Early experimenting with AI coding tools
2026-01-08

Over the last few days I have been experimenting with AI coding tools to help build a weather app. I think as development time and cost decrease, in the future it will be common to build applications to your own specifications. The tools I used were Cursor, Antigravity, Windsurf, Roo and Codex. They are all dependent on Visual Studio Code (though I think Codex can be standalone), I think these tools will diversify in the future, but with the rush to market building off VS Code was the obvious choice. They fall into two categories, all except Roo code are forks of Visual Studio Code with their proprietary AI UI baked in. Roo on the other hand has opted to just be an extension primarily for traditional Visual Studio code but could also be added to any of the other forks.

The downside for Roo being an extension is that it is beholden to VS Code design choices, for example the prioritisation of Co-Pilot meant having to tweak the UI to get the Roo agent chat and workspace explorer open at the same time.

It is important to note that all these tools are merely ways to interact with the underlying language models, so most of the main stream models such as Gemini and Claude are available regardless of the platform.

What drew me to Roo code was the pricing transparency, it’s “pay as you go”, allowing you to pay directly for tokens, these can then be spent on any model you wish. Whereas most of the other services offer a subscription which is unclear on how much access you get to the more powerful models. Pay as you go also opened my eyes to how expensive these models can be, I blew through $10 of credit in a couple of hours with Claude Opus 4.5. I don’t know how this compares to taking out a subscription. Once the investor capital subsidises stop and these companies have to start making profits, it will be interesting to see what happens to prices. It is likely though that advances in models and hardware will help with pricing.

Since I last dabbled with AI coding tools around 8 months ago the new trend is toward agentic development. This means instead of the tooling autocompleting or one shotting your commands it will work more like a developer. The flow seemed similar across platforms with the tooling prompting the model to generate a plan based on your requirements and then iterate through and undertake all the coding tasks. What was impressive is it will then attempt to run console commands to start a server or open a web browser to verify if the application works correctly. It can read logs and perform commands to debug the application. It will then go back into coding mode to try and look at these logs. Antigravity was able to open the chrome browser and navigate its way around my test application. The Roo extension opened a browser tab in VS Code.

I activated a free trial for Chat-GPT plus which gives access to Codex. This is quite a capable model and was a noticeable step up from the free offerings in terms of being able to work on a complex feature for a long period of time. Although you are limited on often you can use the system per day, I think it offers more compute for $20 per month than buying tokens for something like Claude 4.5 Opus directly.

Although impressive it’s still early days for these kinds of interactions, the tools stumbled with many commands, for example not checking if the web sever had been started before trying to access a page. You then have a delay as it tries to understand what is going wrong. Another agent ran a git command to rollback to the previous commit, without realising that there had not been any commits that session, so it wiped hours of code. Fortunately, in that case it was able to remember what the file had contained. The Cursor IDE pushed being able to launch and orchestrate multiple concurrent agents, I have not tried this feature but from a cost and mental bandwidth I think it’s still early to go down the multiple agents route. By mental bandwidth I mean the amount of focus you must give just one agent is quite high. It can churn out features quick but quite often there will be nuances and quirks which will require multiple prompts to iron out. Other times it will use its own initiative and go in a different direction to the original prompt wasting tokens and forcing manual intervention.

One of the most important lessons I learned is to regularly start a new prompting session. Even if a feature is implemented incorrectly, I found that by continuing to re-prompt in the same context led to the model repeatedly trying the same steps to fix the problem and expecting a different result. Or it would look to oversimplify your requirements, for example instead of digging into documentation to understand and API it would just hardcode a response and move on. Starting a new context felt like a developer taking a break and coming back to the problem. The model appeared more focused and willing to try a different approach. Being a able to quickly read the codebase meant that even a fresh context could soon get to speed on implementing a prompt.

As I was working on a hobby side project, I was reluctant to spend any more money on tokens so opted to explore the free offerings further. I quickly hit the limits of Antigravity which caused me to switch to Windsurf. At the time of writing there appears to be very generous limits on the SWE 1.5 model, and I was surprised how capable it was. This would be my recommendation for now if you are averse to a subscription service, it seemed much better value than buying more Claude Opus 4.5 tokens. However I think Codex may well be worth it at $20 a month.

It is clear we are still in the early stages of AI coding agents, there is a lot of work to be done. Currently I feel the biggest bottleneck is the cost and token limits. Even if models did not get more sophisticated being able to have multiple agents review, agree and critic a single agents changes would go a long way.

Early experimenting with AI coding tools 2026-01-08

Early experimenting with AI coding tools
2026-01-08