When it comes to software developers, there are a few distinct types. For example, the extroverted, chatty type, who is always going out there to share the latest and newest libraries and projects with everyone, and is very much into bouncing ideas off others, regardless of whether they know what you’re talking about. Then there is the introverted loner, who prefers to tackle programming challenges by bouncing things around inside their own minds and going on long walks to mull things over before committing to anything significant.
This leads to interesting scenarios when it comes to management-enforced ‘optimization’ strategies, like Pair Programming. This approach involves two developers sharing the same computer and keyboard, theoretically doubling the effective output by some kind of metric, but realistically often leading to at least one side feeling pretty miserable and disconnected unless you put two of the chatty types together.
As a certified introverted loner developer, the idea of using an LLM chatbot as a coding assistant naturally triggers unpleasant flashbacks to hours of forced awkward pair ‘programming’. However, maybe using an LLM chatbot could be more pleasant because you can skip the whole awkward socializing bit. In order to give it a shake, I put together a little experiment to see whether LLM-based coding assistants is something that I could come to appreciate, unlike pair programming.
Setting Expectations
Any good experimental setup features clear goals and parameters that define what will be tested and what the expectations are. Obviously I come from a somewhat negative angle into this whole experiment, so to make it easy I’ll be picking two fairly straightforward scenarios for the LLM to assist with:
- C++ embedded coding for STM32 and CMSIS.
- Ada network development.
These are topics that I’m fairly familiar and comfortable with, so that I know what questions I have here, and what I’m roughly expecting as output. I’ll be treating the chatbot for the most part as I would use StackOverflow or nag people on IRC, with my main fear being that it’ll be expecting pleasantries from me instead of brutal and cold professionalism. Ideally it’ll be a step above me hurling profanities at a search engine for clearly willfully misunderstanding what I am looking for.
My expectations are that it’ll have some answers for me for the questions I have about how to do certain aspects of the tasks, and may even produce half-way usable code that I can fairly easily understand and double-check using my usual documentation references.
This just leaves one big question, being which LLM chatbot to pick and how the heck any of it is supposed to work, since I have avoided the things like the proverbial plague.
Meeting the Crew
Although I am aware that everyone who is into using LLM-assisted programming seem to like to promote LLMs like Claude, I’d ideally not be signing up to another service. This pretty much just leaves GitHub Copilot, which I have access to already. I have written about this particular LLM chatbot quite a bit since it was introduced, with my generally negative feelings towards these tools increasingly backed up by research.
Biased I may be, but to be a true scientist you have to be able to set aside your biases for an experiment and accept reality in the face of new evidence. Thus, with all biases and doubts firmly pushed aside in favor of the aforementioned cold professionalism, let’s get down to brass tacks.
Micro Code
My pet project for STM32-related programming has for a while been my Nodate project, involving the use of the CMSIS standard headers and the macros defined therein in order to write things ranging from start-up to running the Dhrystone benchmark and deciphering the various flavors of real-time clocks.
Much of this work entails digging through datasheets, reference manuals and piles of reference code, as well as throwing queries at search engines to see what potentially useful results percolate out of that particular resource. Coming across the trials and tribulations of fellow STM32 developers in forum threads and the like can be both heartening and disheartening, but all of it tends to condense into something that you can use to progress in the project.
Perhaps ironically, the moment that I tried to use the chatbot in the browser I got an error with the GitHub status page indicating that some of their systems are down, including those for Copilot.

This raises another interesting point: regardless of whether an LLM chatbot makes for a good programming partner, a human partner doesn’t generally randomly keel over or become unresponsive in the midst of trying to do some work together. If they do, however, that’s absolutely a medical emergency and you should call 911, 112, or your local equivalent emergency number stat.

Anyway, after waiting for services to be restored, I was eventually able to ask the chatbot how to properly set the clock speed on an STM32F411 MCU, after getting tripped up previously by the need to set the regulator voltage scaling (VOS) in the power control register (PWR_CR). This is a power saving feature whose adjusting is required for hitting specific and clearly power-wasting clocks.
Shockingly, the chatbot happily spits out ST HAL code and ignores the ‘CMSIS’ bit, although you could maybe argue that the ST HAL uses CMSIS inside. But then so does Arduino code for many MCUs.
To its credit, it does mention in a ‘Key CMSIS Requirements’ list that you need to set PWR_REGULATOR_VOLTAGE_SCALE1 yet without further detail on where to set it. There is also the tiny detail that this isn’t even the CMSIS macro, which would be PWR_CR_VOS to set both bits for the full range.
Fortunately we can do the digital equivalent of smacking the chatbot upside the head and tell it to do the thing we asked it to do. This being to provide the real CMSIS version. Doing so results in another gobsmacking moment when it happily spits out code that doesn’t bother to include the CMSIS headers, but simply copies every single used struct definition and more into the code as well, bloating it up massively:

This is of course very annoying when it should have used #define macros, and it clearly can generate include statements based on its inclusion of <cstdint>, but the absolutely deadly sin here is that his code isn’t even functional for an STM32F411, as can be observed here:

I’m not entirely sure where it got the PWR_CR_VOS_SCALE1 thing from, with asking a friendly search engine leading to just a handful of results, one of which is for an STM32F407 that runs at 168 MHz max. This is hilarious in light of the comments right above the code. It makes you wonder what example code it pilfered from.
At this point I could probably continue to pick at this generated code, but suffice it to say that my confidence level in its generated code and overall output hovers somewhere between ‘low’ and ‘bottom of a black hole’. I’m more than happy to flip this particular table, rage quit, and not lose what remains of my sanity.
Findings
Although I had intended to also do some fun porting to Ada together with my buddy Copilot of some C++ networking code in my NymphRPC remote procedure call library, I found my nerves to be sufficiently frayed and the bouts of near-hysterical laughter out of sheer disbelief worrisome enough to abort this attempt.
I also do not feel that it’d do much more than hammer home the point that GitHub Copilot at the very least doesn’t make for a good pair programming partner, nor as a programming tool, or a search engine, or much of anything. When the only thing that it got me was having to check its output for very obvious errors and shaking my head in disbelief when I found them, it beggars belief that anyone would voluntarily use it.
When we also got reports that the use of such LLM chatbots are likely to degrade human cognition and critical thinking skills, not to mention the worrisome prospect of cognitive surrender, then it’s probably best to avoid these chatbots altogether.
I also agree generally with Advait Sarkar et al. in their 2022 paper that you cannot really do pair programming as-such with an LLM chatbot, but that it offers something different. Something that’s very different from using a search engine and digesting various articles and forum posts along with reference material into something new.
Thus, after using an LLM chatbot for some coding ‘assistance’ I’ll be happily scurrying back to my boring references and yelling invectives at search engines.

I am not surpsied you got garbage results, copilot is far less capable of accomplishing anything.
If you really want to do this experiment again you really ought to use something that score higher on the benchmarks. A site like livebench.ai would be a good starting point (also take note, copilot isn’t even considered here).
The rest of the work is done by providing environment prompts to better steer your LLM. If you were a doctor, you wouldn’t want the bot providing kid friendly answers, so you would set the environment in a way that the LLM always considers its audience (e.g. “Always assume that I am a licensed medical professional working in the field we are discussing and avoid “entry-level” answers to my questions.”) If you don’t dial in your setup and set the stage these things often struggle to accomplish tasks.
Seeing your choice of LLM and the refusal to use what everyone else purports to be good just gives this article a tinge of bias from my perspective and seems to be a purposeful decision.
It’s a bit harsh to ascribe malice here. I used GitHub Copilot on its official domain using its provided defaults because I assumed that it’d give me something decent to get started with. It even used a Claude model, so that all seemed all right.
This all feels a bit like blaming the user when they cannot be expected to know those things, and arguably shouldn’t have to research any of it.
While I largely agree with your takeaways in this article, I also agree with Harry here that to claim biases were “firmly pushed aside” for this trial is disingenuous.
This piece is an (accurate) takeaway of the out-of-box github copilot experience, and perhaps more of a commentary on the poor user experience of tools that were rushed to market.
However, this wasn’t entirely written from that perspective… so I think Harry can be partially forgiven for his opinion that this article lacks a certain due diligence.
Is it malice when someone simply tries to cope?
I think that you truethfully tried, but used the wrong tools incorrectly. Try again, start with copilot CLI (or in vscode) (or opencode!!! with your copilot sub), try it on an existing codebase using small challenges.
Learn to use the tools and their limitations. Oh and try something more common first, e.g. nymphcast
For had, turn it into a miniseries!
I wasn’t attributing malice, sorry if it came across that way over text. I tried to reduce the impact by using the word tinge to imply a small amount of bias, more akin to a subconscious bias.
And I wouldn’t say you needed to do any research, as you already had the right answer in front of you.
“”Although I am aware that everyone who is into using LLM-assisted programming seem to like to promote LLMs like Claude…””
You just chose to forgo group consensus for ease of entry and it affected your results. Again no malice, but am used to HaD writers applying a little savvy and to see very little of (research/application) it influences my perception of the article.
I’m working through my own inherent aversion to the ai coding agents myself right now. I’m certainly not an “AI shill”, despite what the following might suggest.
I was forced to use it for an interview test a week or two ago, which actually got me over some of my hangups, and I’ve had a good few interactions with it since.
Initially I got to use Claude Opus just before it got removed from Copilot, and it’s scarily good. I even had it interrogate an embedded sdk and successfully write code against it. I strongly suspect it’ll even handle the painful STM32 HAL competently if I asked it to.
Unfortunately Opus is gone from the cheap plans now, but if you break up your tasks into smaller chunks rather than ask the agent to do the whole thing, Claude Sonnet seems to be nearly as good at coding – it just doesn’t manage the higher level architectural planning that Opus can. An experienced programmer should still be able to get very good results from using Sonnet interactively.
I think much of the trouble you’re experiencing with poor code output is due to model selection. Claude Haiku is selected in the screenshots in the article, and it’s much, much less capable as a coding agent. I personally wouldn’t bother ever selecting it manually, unless I’m asking for some non-coding requests (documentation, maybe?) and really don’t want to waste lots of credits on them.
Instead, Github Copilot chat in VS Code (at least) has an “auto” option for model, which seems to do a better job than you’d expect (it also gives you a 10% discount on the request cost); it’s often chosen GPT-5.3-Codex for me which has also been a good performer.
And definitely use vscode (or one of the other code editors) instead of the website chat interface when coding! It gives the agent access to a workspace and persistent project state which is much more effective than trying to keep it all in an agent session.
So far I have thrown an embedded linux SDK and a few Pi Pico projects at it, along with figuring out a few libraries. I’ve brought code, libraries and symlinks to local data into my workspace and had the agent interrogate those specific resources intelligently instead of relying on it’s internal knowledge or whatever it decides to look up itself. That seems to result in much higher quality responses, too.
There’s still a lot I will refuse to let the agent do for me, but I’m finding there’s a lot of annoying/boring/timeconsuming stuff I hate doing that the agent is more than capable of taking on, leaving me to concentrate on the more fun/challenging parts.
I use whatever comes with VS 2026. Does better than me copy and pasting back and forth.
This. The author evaluates entry-level models using a 2022-style ‘vibe coding’ approach. This is like evaluating an Arduino in 2026 using a pre-1.0 IDE on an Intel Edison. Please retest using flagship models like Sonnet 4.6, Opus 4.6, or GPT-5.5, using a proper coding harness such as VS Code with Copilot integration, while ensuring access to SDKs and datasheets.
AI slop and hype are frustrating, but an LLM is ultimately just a tool. It has its own limitations and strong suits.
+1 …. Surely when writing an article like this you’d want to shell out a few dollars for the best model possible, not some crappy LLM only suitable for dating advice?
Yeah this whole exercise made no real sense. Signing up for another service is … nothing. Using the best in class is an obvious choice. Choosing copilot is just wild, noone would have recommended he do this. It’s all very bad faith man yells at cloud.
The problem is knowing what models are good for what, and if you’re not interested in exploring just for the fun of it, it’s very easy to just test a less-capable model and come away with an incorrect view of the entire field. That was basically my experience until I was given no choice but to use the best model available for a task I would never have bothered giving to an agent under normal circumstances and found it actually worked out.
The current problem is that Copilot Pro sign-ups are suspended while Microsoft tries to get the demand under control. So even if you wanted to pay for it, you can’t.
Ah the junior engineer.
https://youtu.be/sB77mj3rZMI
It’s almost impossible to set one’s bias aside and use an LLM successfully if one doesn’t want the experience to work. You can do the same to a new hire employee, whether the developer is junior level, or highly experienced; current development teams can set them up for failure.
This is just an example of that.
When I work with a junior developer, then I actually have a chance to mentor them to become better – and they’re an actual person, so that effort is worth it regardless of the outcome. I’ve had many years of experience doing that.
I can’t do that for a specific agent model, because their capability is a fixed value and they’re non-sentient – nothing I can do will improve them. When I try to use one and it comes up short, then there’s no point continuing with it. If all the models come up short, then there’s nothing to do but to wait until more capable models are released.
Until very recently, no models could handle the kinds of tasks I needed them to be able to do – that doesn’t mean they weren’t useful to anybody, but just that my particular domain and level were a bit beyond beyond them. It’s fine if you like playing with the technology and don’t care about the end result, but for me they’re purely a means to an end, and they were insufficient. Maya’s experiences trying to get useful embedded code from Haiku are the sort of thing I was seeing, and it left me with much the same sort of negative reaction as she had.
But now I know some models have finally gotten above my personal minimum level, so I can finally get some use out of them. It’ll be really interesting once the local models reach the same point, because a buddy just bought a bonkers AI GPU and I’d rather send jobs to it rather than the cloud…
“I can’t do that for a specific agent model, because their capability is a fixed value and they’re non-sentient – nothing I can do will improve them.”
This is not true.
First, as a starting point, all the information it needs may be in the model, however there are layers and layers of similar data and need to get it to give you the right data, not the wrong similar data. A wrong result does not mean it cannot do it.
I’m not going to sit here and explain in detail how to get good results, because it can be very specific to the project, but I will include some notes.
What you need to do is set up a context for the conversation, initial precepts that bias the end result in the direction you need.
I’ve successfully had chatGPT give me Python and Swift code that can process IQ data from an SDR, configure my own library (Swift interface) to the Bladerf library. Keeping the AI on point with Swift 6 can be a challenge, because more examples exist online for older versions. But it can be done.
Also, people, including myself have successfully got it to write code in custom languages. This isn’t training in the usual sense, but the models are advanced enough to actually work. LLM work more with concepts than people think. (And if you understand how they use tokenize input you’ll start to understand why that might be.)
This is a very poor way of looking at the results. A wrong answer means you cannot trust the output at all, which is far, far worse than you imply.
You know you can write test cases and evaluate the results, right?
People applying software engineering principles actually do get good or even excellent results.
Half-assed approach will get you, well, stack overflow vomit.
No, that doesn’t make sense. You’re conflating poor results from an inadequate prompt with poor results because the parameter space of the model simply doesn’t extend far enough. If all you ever needed to do was write a better prompt, then I could take a crap model from last year, give it a good prompt, and get Claude Mythos level capability or better. Which is obviously out of the question.
Nothing I tell an agent is going to modify the weights of the model outside of that one session. And the more I try to cram into a session, the more noise enters the context window until the agent loses the plot and stops being helpful.
Every model has a fixed limit to it’s capability you cannot exceed. At best you can take the output and massage it into better shape manually, but often it’s easier just to do the whole thing yourself in that situation.
Nah see a junior engineer will say “I don’t know,” or “let me look at the datasheet”, while AI will tell you the wrong thing as if its fact.
bud I hear you, but I’ve worked with plenty of jr and senior engineers that will die on various hills over patently wrong information they came by honestly. Parsing the signal from the noise is the trick and AI can be real helpful for that or real detrimental.
Even better, the junior will remember it and won’t do it again because they are trying to impress you. Also the junior will now have a memory of this area if the code so you might not have to go into it next time. They also might turn into a senior one day and help you find a job later in. Or they might even pick up pizza for you or go out for drinks and say something funny.
So can an intelligent agent. The LLM is not the agent ( or the software) it is the OS on which the agent runs.
Add persistent memory, a set of core beliefs and you end up with an AI that learns from its mistakes and can turn data into knowledge.
And you can absolutely put an agent on ice and reinstanciate the context ( resurrect them) to work on the code they wrote. Hours, days and even years later.
This article is like saying “cars suck” after reviewing a Model T in the age of Rivians.
A good junior engineer will tell you that, but not every one is good. Nothing an AI does is novel – we teach them like we teach humans, and they’re susceptible to learn the same bad lessons humans do.
+1 …. I thought this to be the case as soon as I started to read the article.
I had success with getting useful code from Grok Expert mode but then wanted to move to a local AI engine for code privacy. I’m running unsloth’s version of the GLM-4.7-Flash model with Q_5 quants on my desktop and with the model temperature and other settings fine-tuned for coding expertise. Running this 100% on an RTX 3090 with 24GB VRAM. If I keep my context manageable, I’m getting around 100 tokens/sec speed and have been pleased with the code generated. Managing expectations and requirements in your prompt is key, of course.
After I’d been away from the project for a while, I prompted the model with my memory of how an encryption strategy (that I’d used in the program) had been set up. My memory was not 100% correct, and the AI model did a perfect job of analyzing the code and pointing out what I was forgetting.
It’s been a positive experience for me so far.
I highly suggest not back and forth correcting the LLM. You must in the first prompt gain an output which works, or you should reprompt until you do (or dont and recognize the failure). Subsequent responses to “hey this did not work please fix” will get you a dozen 90% correct answers. This is why vibe coding is really just recognizing workable sections of code and hodge podgeing those together.
If the model can do a coding task, it must be correct in its first attempt. A 90% correct code will be reworked into a very similar but different and still only 90% correct again. LLM cannot debug code in back and forth conversations, it is a waste of time. One prompt resulting in runnable code, or restrategize and iterate
Not true in my experience using a decent model.
Hmm, Copilot is not an LLM, so we still don’t know what you have tested. $20 gives you a month of Claude and the ability to work with Opus4.7 in Claude Code. There are some great tutorials online how to get the most out of that experience, and it’s best to watch a recent one, because models and the way you should interact to get the best result, change rapidly.
The screenshot shows that they are using the claude haiku model. It’s fine for coding tasks but I bet if they tried opus they would probably get better results.
But it’s kind of funny the article says their boy using claude, then uses claude anyway via copilot.
Honestly an article like this is a good thing, because it shows exactly the sort of pitfalls a novice can fall into when they start trying to use these systems. Just like any other technical thing – the experienced users easily get better results but do a crap job explaining it to a noob.
So someone talks about their poor experience, and then someone else goes “oh! you didn’t do X Y and Z” supposedly obvious things, which is the only way they actually get explained anywhere. Hopefully that gets some new users over the initial hump.
The writer of the article has no interest in success. They poke at it, claim it’s unable to meet basic needs and declare that it’s useless and they don’t need it anyway.
This article is evidence of exactly how to get poor results.
hmmm… seems less an exercise in pair programming than vibe coding to me.
I thought pair programming was primarily to force a conversation, rubber-ducking if you like.
“Right, I’m going to do this, this way -waddya think?”
Also, I associate, perhaps wrongly, pair programming with TDD (test driven development) where player 1 writes tests and player 2 gets the tests to pass. I tried this and swiftly found myself taking over the player 1 role ’cause the llm was taking short cuts, going back and changing tests etc.
I use copilot CLI (it fires up the models it wants, I am not all that sophisticated yet) all the time now, albeit for far more menial tasks than described above, but even so I have to stop now and again and read the code, even my sense of “code smells” has returned.
I use LLM coding agents (mostly in QA mode, like “what approaches might I use to accomplish this goal?”) because management says to use it. They’ve sipped a bit too much from the tippy-cup on this fad. When they inevitably change their mind, I’ll stop using LLMs. Hopefully, I’ll still have relevant skills when that change comes.
The problems are always in the training data.
Ask an ‘AI’ about well trodden problems w common languages and libraries and it can return reasonable answers (that it just copied into statistical memory and regurgitates.)
Ask an ‘AI’ about a subject full of clowns and it will return clown code (that it just copied…).
Go ahead, ask ‘AI’ about server side JS or something equally brain dead and laugh or cry.
The problem is going to be that many students will succeed in using AI to solve contrived academic problems (that are all over the dataset/usual sites).
They will be even worse than previous batches of recent college graduates.
My goto interview question for newbies will remain ‘What programming languages did you know when you started college?’
“Equally brain dead” — your bias is showing.
I’m not a huge fan of nodejs, but I have used it and with an AI help built a real time dashboard that runs with nodejs as a api handler, react as a front end, Redis as a data store, and implemented MQTT, some proprietary connection types, and was using it for prototyping a stream and display for pskreporters/mqtt firehose stream. WebSocket updating live dashboard that way worked surprisingly well.
Implementation of WebSocket with subscriptions etc, with a push for efficient data flow was generally successful.
I’ve also used it to build a python layer that processes data from environment Canada to handle weather conditions related to astrophotography, generating image tiles with the next ~72 hours of projections for seeing, transparency, etc. this is a partial duplication of some else workflow, as the initial developer passed away. It is however my own work combined with the AI, not a clone, or fork.
The inability to get something useful from a current LLM has more to do with the user than it does with the viability. Eclectic languages aside, it handles a lot of stuff very well.
It can fail at nuance, staying on point with versions, etc. usually manageable however.
Tried something similar very recently, had a mixed but generally quite positive result.
Started with a problem (making IPv6 SLAAC work with multiple wireguard peers) that I though I ought to be able to come up with a solution for. Tried a few things, like trying to capture neighbour discovery packets to identify a peer’s new IP address (with a view to a script calling wg command to add allowed-ip), and quickest way to get started was to google correct tcpdump options to do it. This put me into a conversation with Gemini.
Once I’d established that my ideas weren’t going anywhere, I asked Gemini for some, and it suggested eBPF (Extended Berkeley Packet Filter) to hook appropriate calls in the wireguard module. Wasn’t familiar with eBPF at all, so asked for details and then some example code. Previously I would have just done a web search to find this, but Gemini was able to tell me exactly what I wanted and give not just general example code, but specific to my task. I got it to write the whole thing, went through various iterations. I was keen to see what it could do, so I didn’t intervene much directly in the code, but I did need to understand what it was doing and give it some pointers that I think without a coding background might have been difficult.
There were times when we went round in circles, repeatedly switching between a few different ways to do something, none of which were working, but each time was presented confidently as the solution to the previous failure. Probably should have just got stuck in myself, but at that point I felt invested enough in the experiment to see it through.
In the end I/we/it achieved what I wanted (a wireguard helper to enable proper SLAAC support), and it was a bit of a fun way to learn some eBPF. Then I even got Gemini to write it up for my blog.
“certified introverted loner developer”…”nodate”
Yep, that checks out. :-)
It’s Japanese for ‘outdoor tea ceremony’ actually, but you’d have known that if you had read the friendly project page.
It was just an attempt at humor. You know, friendly jokes and things approximating them.
“:-)”
Could you share the prompts you used to generate these outputs? Seeing as how this appears to be your first attempt at using an LLM to code, I’m guessing your prompting needs a bit of refinement. It seems like you’re attempting to vibe code, but under the assumption that you’re pair programming, and the two are not the same. You should try to use the LLM as a tool to accomplish your coding micro-goals and not as a magical black box that spits out perfectly working software from a single sentence (a common misconception).
Also, choosing the github free LLM does seem to have been a bad choice. Even the free Gemini model codes better than this. Compared to your experience, I haven’t had this much trouble getting good results out of an LLM since ChatGPT 2. I also see myself as an introverted loner coder, and I greatly prefer using an LLM to stack overflow. That being said, I still have to do web searches at times. LLMs are only tools. They just seem like they have personalities and that tends to complicate our relationships with them. If we had talking hammers, we might all be complaining about how they don’t make omelettes as well as we expect, and why won’t they just do what I ask.
Haiku is definitely a sub optimal choice for actual coding although it certainly manages simple tasks that don’t need much reference to external information like libraries and apis. For the sort of things Maya is trying to do, Sonnet and GPT-5.3-Codex are actually useful, and Opus is scarily good (but Opus isn’t available through copilot anymore).
I just read that GitHub Copilot is getting an updated pricing plan as well. Going to go all token based billing. Of course they suggest that some people will end up paying less, but that has never been my experience once a service starts charging pay as you go style. There’s a reason no one accepts data limits on home internet connections. The ISPs would much rather you pay for every byte you download.
On the topic of improving results, I also recommend only believing the AI on extremely common questions about extremely popular ICs. Otherwise, the only way to have a snowball’s chance in hell of not getting a hallucination is to provide datasheets prior to asking questions. On most complex parts, this is a must. If the LLM can directly reference your materials, it can provide solid answers and citations to the proof.
You’ve clearly not used the modern coding models, because your observation doesn’t fit the data.
The old pricing model simply wasn’t adequate for the new agentic workflows, and at the moment users are using far more compute than they’re really paying for, even with the per-token pricing plans. There will be a correction at some point when the hype dies down, but there will also be large jumps in model efficiency – so who knows where the prices for competent models will eventually land.
Right now a buddy of mine has a 96gb RTX 6000 for inference, and he claims that the current open models which can be run locally are only about a year behind in terms of capability. It’s only a matter of time before you can run an Opus-level agent on a home system that only costs a few thousand dollars.
People are working on the problem.
https://youtu.be/jt_LZYJ2mIo
My approach was to use basically the same queries as I’d use with a search engine. E.g. ‘how to run an STM32F411 at maximum clock speed with CMSIS and C++’, although for a search engine you can abbreviate that to keywords like ‘Maximum clock STM32F411 CMSIS configuration’, or something similar. I was not expecting to get a full code listing back, but that’s what these LLM front-ends appear to be configured to spit out.
A friend of mine has been abusing ChatGPT and others for various projects and shared some of the results, so I have some idea of how to prod these chatbots. My goal was to get some search results back basically, but it insists on giving back a full ‘solution’, flawed as it is.
Get VS Code and ask copilot to do stuff in the open project. Set up a copilot instructions file to establish the standards you expect. These are things you can do if you want actual pair programming, instead of treating your AI colleague as a search engine.
Howdy. Love the article and love this series, frankly. However, I will say that Copilot is known to be one of the worst coding assistants. If you can, I know you said you’d prefer not to, but just give Claude a try and see if it can help.
I’d need to give my personal information to Anthropic to even use Claude, which I’d much prefer not to.
That’s exactly why I quit signing up for Claude. They wanted my phone number to make a free account. NOPE! Done with Anthropic. You can try Gemini 3 for free with no data required. Amazing coming from google, but it’s not exactly their best model either. If you’re trying to convince people to give you money, I’d think you’d let them try the best of your offerings instead of the worst, but .. what do I know?
ChatGPT also requires no information to try the free tier. It’s definitely better than Gemini.
Copilot isn’t a model, it’s an interface to models from many vendors. You can easily access anthropic models through it, and in fact Maya was using Claude Haiku. Haiku just happens to not be good enough for the kind of embedded tasks she was asking of it.
Quite right. However I understood that automagically Copilot should match the task to the best llm for the running job. When I spot a model change happening I have asked “why did the model change just now?” and that has been the consistent reply via the CLI. Am I being moved for convenience or to save money somewhere down the line? Sorry, not qualified to answer that, but I rather suspect its BS.
You have to actively select auto model selection for that, and you have to at least have a pro subscription so you have access to the better models. The entire point of the auto selection is to choose the cheapest/fastest model that can successfully complete the request, not the “best” model. If that was the case, you’d just be getting Opus all the time (well, until it was removed), which would be a waste of credits and compute.
Additionally, the models cost more than we’re currently paying for them, and there’s a critical mismatch between supply and demand. You can’t scam profits out of a system that is completely incapable of generating them in its current form.
Some VCs were lucky to get together with their capital in the first place.
It is an immoral and unethical act to let a sucker keep his money.
I have used AI to write 1000 page technical books, they will not win any prizes but they are relevant and coherent collections of text that match the meta-seed that I use to define them. The point being that you can one shot huge results if you impose structure on the process algorithmically. In programming this may be applied via descending layers of abstraction, staring with conceptual discussions and elaborations on all of the relevant math, then you define the architecture and start drilling down the stack building up the layers until you are iterating through the defining of structures and functions etc. The key insight is that an AI is a random walker through a biased space of languages, it is biased by the context you supply before you ask it to do the next step, the entire interaction history grows and keeps the AI on track to ensure that the chunks of low level code you are iterating through are correct and relevant. This also works well with agents as the distinct parts of each layer can often be generated concurrently, i.e. the interface of the function is stabilized by the layer of abstraction above and so long as the code for the function passes the test for it then it can be relied on by other functions and code that call it so they can be written at the same time.
There is a perfect pair. the technicap lead, the person that likes to architect and the coder. the guy that loves to code. that pair is smooth like butter.
LLMs in the 9ther hand are more like the clueless idiot making suggestions whipe your typing. sometimes they get it right. sometimes they don’t.
I’ve found an LLM to be a great coding partner. The issue is mostly that we are doing it backwards. I stumbled on this doing work for a client that doesn’t want it’s entire codebase uploaded to an external bot, but is fine with letting me use it in a more limited fashion.
My process has been to use a simple chatbot, not a dedicated coding tool. I’ve been using ChatGPT 5.4 thinking (now 5.5, but I haven’t had a chance to try this yet with the newest one).
I simply use it to reason out loud about architecture, from the highest level needed for the task at hand down to roughly the individual function level. I am doing most of the design but I find its comments often valuable and usually result in improvements I would have missed or had to iterate more to get.
Once we settle on an architecture, I mostly become the LLM. I ask it to prompt me. I write the code. Call me old fashioned but I still care about code quality, and I still do a better job of outputting quality code by hand.
Sometimes I let it be the arbiter on small stylistic choices, if I find myself going back and forth. It’s greatest value is in reasoning externally about such things. It is now primarily a super detailed todo list that lets me write without having to hold the structure in my head.
At this point it is also my stack overflow (minus the toxicity), API reference, etc. For certain types of bounded functions, I will ask it to generate. Or I will ask for snippets, usually no more than a dozen lines at a time.
Finally I ask it to review the partial code. Sometimes it can review a whole function before the other parts are in place — it can be a compiler before the code is done enough to compile.
The end result is that I can code maybe 50% faster, with higher quality and a lot less fatigue. That isn’t the 10x “productivity boost” that most companies these days want to hear about, but it produces solid steady results without technical debt.
It’s hard not to wonder what would happen if one or several people decided to “Report Comment” over and over again, hijack the starship. Of course there’s no way to “Report” “Report Comment” abuse/overenthusiastic use. It would be more work than I’m interested in doing to click the “Report Comment” button fifty-some times but clearly there are some industrious scamps who don’t have that limitation. Clearly the HaD Comments page is perfect just as it is and it would be blas-feemy to change a dot or tittle, one iota. Besides, where would you find someone with the skills to do the work?
I assess that theft (copyright violation) in model training is normalised now.
I get the impression that those who succeed in applying LLM models and create maintainable, well structured, understandable systems already are capable of creating such software.
Will newer generations of developers still ‘have an idea’ of software engineering, of what makes a sound solution and how to arrive there. When and where might they get that.
Avoiding chatty programming like a plague.
Disclosure: I’m about to activate my ejection seat, observing a little distanciated :)
I have been “pair programming” for a couple of years now, and I am not an “AI shill”. I am a disabled veteran and due to really frequent migraine headaches and a spinal injury, I can’t be hunched over the keyboard for the same amount of time as a healthier person. Pair programming is amazing. It is not easy. It begins as a very frustrating journey. Learning the quirks of a model is a pretty labor intensive job, but creating the co-developer persona makes “vibe-coding” and my audio assisted coding sessions very productive.
For me, learning guardrails, the basis for refusals, and setting up a good RAG pipeline with documentation and examples of your preferred coding language makes all the difference. I enjoyed reading this and enjoy seeing LLMs get lighter and smarter. What used to take a frontier model like Claude can now be done privately with the right open source MOE model. I enjoy the new Gemma 4 26B on my home rig.
Feels like 2024 article. Just watch people using agents full time there are enough videos online. You are not supposed to program with chatbots.
At this point i’ve given up on figuring out what one is “supposed” to do anymore.
One blames the choice in model, instructing one to use the latest greatest most premium. Other blames the style of access, saying you have to use a specific application to access it. Then someone pops up saying the problem is that you aren’t prompting it properly, you gotta use a specific template shared on the deepweb or something. A few that will argue that you shouldn’t be using a chatbot, but one of those Agents. There is even one suggesting the right way to do it is in reverse with the bot asking you to write the code!
AI has become one of those things where everyone is convinced they have figured out the “right” way to use, when in reality there is no single right way. Just a lot of approaches that may or may not work for specific situations and/or individuals.