I, for one, welcome our AI overlords… eh… assistants

Are we next? (Dall-E’s take on protesting developers)

For some time now IT managers and developers are being bombarded with alternating predictions of doom and heaven. AI is finally coming for software development jobs and, depending on what kind of media you’re reading, they are great tools to help you become more productive, or else going to replace you or (part of) your team. That is, of course, unless you work on developing AIs or applications with AI, because then the sky is your limit.

This is a new market that is rapidly developing (pun intended), and every other week we have new products unleashed, with an equally unending stream of reviews and comparisons. So, rather than trying to decide which is best, or poking fun at some hideous programming mistakes, let’s instead look at what market these products are trying to target, how well they manage that, and see what conclusions we can draw for the future.

Why are AI coding assistants flooding the market?

This is actually a lot more unusual than you might realize, given that most of the time developer tooling starts out with Open Source projects rather than commercial ones, with the exception perhaps of low- and no-code environments. This is because developers tend to “scratch” their own “itches,” as it was termed in the beginning of the Internet age, and share their tools readily when their employer is not operating in that particular niche. Developing the engine powering an AI of the current generation is actually horrendously expensive, and any free ones available are either using the downloaders as testers, or there is some other way their owners try to recoup the investment.

I did not refer to low-code and no-code by accident, because that particular niche is actually caused by the same itch that many AI coders are targeting; and that is the “problem” of developer productivity.

The Need for Speed in Software Development

The need for speed: a fast car with a license-plate "COBOL".

Traditionally, software development is hampered by a lack of skilled developers. While we tend to see COBOL as a venerable, but now finally outdated (Actually, it is not. There’s more of it than you could envision in your darkest nightmares) programming language, it is actually one of the early attempts at a low-code system. Scroll down in the Wikipedia article on COBOL to find some actual code examples, and you will see they read like something approaching conversational English.If you haven’t learned the art of “Enterprise Software Development,” you can easily underestimate how much care and effort needs to go into the realization of something like “just a screen to enter my address,” to take a random example. I’m not going into the causes of that here, but the upshot is that IT management is constantly worried about the number of developers needed compared to the numbers available, and the speed with which they can produce working applications. If you’re unlucky enough to have the metaphorical wall between “Business” and “IT” in your company, the worries are increased exponentially by the amount of software development happening outside of IT.

So, there is only one way out: find ways to increase developer productivity. Managing by Key Performance Indicator or KPI is, in this field at least, an almost hopeless approach, because developers (1) tend to actually care about the quality of the code they produce, and (2) are very good at analyzing problems, such as how to ensure a target is met with the least effort involved. So, you want lines of code? Fine, you get lots of lines. You want test coverage? No problem, everything will be covered. Just don’t expect those numbers to reflect an actual change in productivity, because the developers mostly already were doing the best they could, because they (hopefully) care about their work. So instead management is convinced to buy Low-code and No-code environments, which promise to convert non-coders, now called “citizen developers,” into productive coders.

Moving Quality Left

If you look at where the most effort is spent on an application, you’ll find that getting an application up and running is usually not the issue. This is what traditionally IT Project Managers have become real adepts at, by making careful agreements on how to define “Finished.” This target is not as fixed as the date it should be delivered on, which is why they include equally careful agreements on how to adjust the target if needed. As a consequence, (the 80-20 rule, yes!) after 80 percent of the planned work has been happily delivered, the second 80 percent of the work is done “in maintenance.” Worse; what has been delivered is often not entirely finished, will likely still have some low priority bugs, and will certainly need updates to reflect new requirements.

The goals we should strive for are twofold: set our targets less ambitiously, and move quality left.

If we take small steps, we are much better at estimating how much time it will take, and less likely to plan for things that we need to adjust even before they are done.
Next, we should ensure that developers are not rushed to prioritize delivery of features, now commonly referred to as “velocity,” above ensuring that delivered code is as bug-free and well-designed as possible. I’m not talking about polishing diamonds, I’m talking about the aphorism “Always code as if the guy who ends up maintaining your code is a violent psychopath who knows where you live.”

The best way to support this is to have regular sessions where you don’t code alone, but work together on one piece of code, one at the keyboard and the other sitting next to you. That is a very difficult thing to accept if your feeling says that you now have two people doing the work of one, but that is actually not what is happening. See it as two pilots in the cockpit, one “pilot flying” and the other “pilot monitoring.” Both have tasks, both do things, but most importantly: they constantly talk to each other about what is happening, both introducing and validating ideas, and catching errors before they end up in the code-base. Look at it from the perspective of the lifetime of that piece of code, and they are more productive together than they would have been alone. This is what we call “pair programming.”

And now the big question: If a second developer is most valuable as a pair programer, can an AI coding assistant fulfil that role?

So, what does a coding assistant do?

Sidestepping what is happening in research for the moment, we generally have two different approaches to software development by AIs: smart autocomplete and chat.

First of all, the assistant can try to predict what the developer is about to type. It suggests, inline, the most likely bit of code that should follow. This is generally referred to as “autocomplete” or “inline autocomplete” and an improvement on what was actually already pretty commonplace. The editor has information on valid continuations of what is being typed and provides suggestions based on that. All modern IDEs can do this in the sense that they have knowledge of the programming language used and can suggest names of keywords, types, variables, and functions. The basic AI prediction extends that with most likely expressions or whole statements, given the current context. More complex predictions use the meaning of names, or comments about the upcoming code, to suggest larger blocks than just how to finish the current line. The developer can then accept the prediction in part or complete, or reject it and continue typing, hopefully giving the AI the “Aha, so that is what you want!” and let it retry.

The alternative approach is for the developer to switch to a separate window and start a chat with the AI, giving context and describing what is needed. This is about the same as we already know from applications as ChatGPT. It is not uncommon to start with such an approach to discuss alternative solutions, before switching to the code editor and work it out. Generic AIs with a chat interface can easily support this second approach already, because they have been trained on the Internet’s vast collection of content, which includes a lot of (discussions on) software. Specialized coding engines have had their training focused on software development and are biased towards such tasks in their answers.

Current Coding Assistant Strengths

There is a lot of variation in how the different assistants have been trained. Just as the publicly available engines are primed to avoid answers on specific topics, or only responding after a bit of “query tweaking,” some assistants are only meant for specific programming languages and will try to stay away from the others. The advantage is that a lot more quality control can be put on the training materials. An AI can also be tooled to check that the answer is not straight from a public repository for which quality cannot be verified, or which may have licensing issues that you want to avoid. Also, while you do want to feed back the usability of the answers, commercial users may want to prevent their code from being exposed to other customers. This functionality is readily available in the current generation of tools.

Current Code Assistant Weaknesses

While the current generation of tools do pretty well, and developers who start using them typically don’t want to lose them, all is not sunshine and roses. Generative AI is essentially about generating a (random) piece of text that is most likely what you want. Just as we have been fooled by generations of programs that try to trick us into believing we are chatting with a real human, because their replies appear reasonable, there was (and still is) no real understanding of what we’re talking about. If the Internet just lacks a good collection of examples of efficiently used C++ from the newest standard, the AI will not be able to produce it either. Giving it my entire code-base will certainly make it produce fitting additions, but they’ll look like what I already have, not extrapolate that into the unknown. It may be able to discuss the new additions to C++26, but ask it to use that and it will be lost, because none of its training data used that version. If you’re unlucky, it will give a reply where it says it used that version, while it actually didn’t.

There is a whole new branch of IT-related jobs now developing, which is that of AI-prompt developer. Here the task is to find the best formulation of the AI’s pre-cooked instructions, so chat sessions will feature the wanted responses, while preventing unwanted ones. There is no guarantee for success however, as adding more information can steer the AI into the right direction, but you’d better also add instructions to never let it reply if it isn’t sure about the result. And even then, this is not the same as validating the response. Instead it is adding weight to an evasive response, that hopefully outweighs the wrong one. Yes, this is a matter of statistics and probability, because that is how neural networks, the technology behind all this, works. So usually, companies employing an AI chatbot will add a warning that an AI is in use, and its answers may be wrong. Sorry, to burst your bubble and that kind of thing.

ChatGPT attempting to be AlphaGo

Ok, the current generation of Generative AIs helping developers have their limits, but they are also a huge leap forward compared to what we had 5 years ago. Can we get over the current hump? As an example, let me tell you about a little experiment I did, that was prompted to me by what may be the impulse for the current sprint AI development is making: AlphaGo. Google’s DeepMind surprised not just the world’s Go professionals by creating an AI that was capable to beat them. It also surprised the world’s AI researchers, because the game of Go was considered the “unbeatable” challenge. Now ChatGPT is most emphatically not to be compared to AlphaGo, as the latter was purpose built for the game. However, ChatGPT does know the rules of the game. Even stronger, it told me:

I can’t play the game of Go directly, but I can help you learn about it, provide strategies, explain rules, or analyze moves in a game. If you’d like to play a game, I can assist by taking your moves and suggesting responses. Alternatively, you could use online platforms like KGS or OGS to play against dedicated Go engines or other players.

Would you like to learn or play a game?

So, while it starts by saying it can’t play the game directly, it still offers me to teach, analyze, or even suggest moves, ending with … an invitation to play. When I agreed to a short game on the smallest size board, 9 by 9, the game quickly degraded due to a complete lack of insight by ChatGPT into what was happening on the board. In the end, it started playing moves on top of existing stones, which is not allowed, or saying it played a move and then showing a board with that move somewhere else. It apologized for its breaking of the rules, but kept going on.

Afterwards, I asked ChatGPT to explain its reasoning, but the answer was not based on the memory of how it arrived at its response. Rather, it is provided a new response that, somewhat, fitted the move. With the latest preview engine, which is supposed to add reasoning capabilities, it gave excellent generic explanations, like “I was aiming to extend my influence.” At the same time, though it expressed the right intent, the move played never realized that intent. It felt like it was telling me what I hoped to hear, without actually understanding its own words.

Wrap-up: Intelligent Search or Reasoned Reply?

Let’s be honest: I like what the AI revolution has done for software development. When I lack a human developer near me to discuss an idea with, talking through a possible implementation in an AI chat session does sometimes get me great responses. In the editor the suggested continuations are great when plodding out large blocks of predictable code. It is a huge improvement on what we had just 5 years ago. But still, I cannot see it replacing human coders any time soon, because it hasn’t learned to truly understand what makes quality code. It spews out mistakes just as quickly as the good stuff, so it still needs a human to validate what it did. Code being the puzzle that it often is, it requires a deeper understanding than simply “Is this what you want?”

Just last week GitHub wrote that their tests show GitHub Copilot increases code quality. Their key findings are:

The developers using Copilot had a higher likelihood of passing the unit tests set for the study,
Code was considered more readable, with more lines of code produced before readability issues developed,
Several code quality metrics increased significantly, in accordance with the 2024 DORA report,
Code was more likely to get through a review.

The first thing that caught my eye was the reference to the DORA report, and indeed it also predicted a quality increase for code when using AI, but the increase mentioned in both reports is well below 5%. The biggest increase was for the quality of documentation, which was 7.5%, and that is not surprising given the natural language abilities of these AIs. Another study of a few months back, instead saw an increase in bugs, with one yet further back reporting a “downward pressure on code quality.” So things look to be improving, but not yet consistent, nor with really impressive numbers.

Have we created the ultimate yes-men?

These studies were all about developers working by themselves with help of an AI, but we already know that this is not how we can get the best results. Let’s step up the game: could an AI qualify for pair programming? For that it needs to understand the task as well as the problem domain — not just phrasing the correct strategy, but also validating the tactics and execution. In pair programming, developers collaborate closely, creating a synergy that allows them to write, review, and refine code, leveraging their combined knowledge to produce higher-quality code than either could have produced alone.

This level of understanding requires an awareness that, for an AI at least, remains firmly in the realm of Science Fiction. In the nineties, we pursued this through knowledge bases, fed into inference and case-based reasoning engines. Today, we appear capable of simulating that awareness, but in our haste to capitalize on being able to pass the Turing test, we forget that the AI is still just guessing. For it to improve, it needs to genuinely know what it’s talking about, just as it must understand its own limits. A legal disclaimer is not an awareness of fallibility. I’d rather hear “Let’s find an expert and ask,” than the statistically most likely response I want. I am afraid that we’ve created a coding yes-man, whose ability to please the human is more important than the quality of the result. Sure it can take some of the work and it has a positive effect, but is it really what we need?