Why ChatGPT is the worst programming language ever
Lee Iverson is a Co-Founder and CTO at Ibbaka. Connect with him on LinkedIn.
I’ve been a professional software developer for 45 years now and have a PhD, taught software engineering at UBC, and work as the CTO of a small but leading-edge software and consulting company. I like to tell people that it’s difficult to name successful programming languages that I have not written programs in. So, I’m pretty familiar with programming languages and their variations. Whether it’s structured or unstructured, functional, declarative or procedural, high-level or low-level I’ve probably used it.
So, what’s the purpose of a programming language? Ultimately it’s to be able to codify how to solve a particular problem and to be able to describe in detail how any new input can be processed and transformed into subsequent outputs and other actions (e.g. flying a drone or monitoring the temperature and humidity in your refrigerator). In essence it’s a way of telling a computer what kind of input it should expect, what exactly to do with them and how to handle different inputs differently, and then exactly what to do to respond.
Programming languages vary by the kinds of problems they’re designed to solve (e.g. systems programming or web programming), how you describe the problems and the means to solve them (e.g. logic programming formed the initial foundation of SIRI), and the kinds of output they’re capable of producing (e.g. there are languages called “shader languages” specifically designed for producing textures in computer graphics systems). Sometimes a language designed for one thing becomes particularly useful for something completely different (e.g. those “shader languages” were adapted to become the foundation of training and running generative AI systems on NVIDIA GPUs).
So why would I refer to ChatGPT (and I’m really using that term as a placeholder for all of the conversational generative AI systems including Perplexity, Anthropic, DeepSeek, and their APIs) as a programming language? Well, it is a system for taking input provided by a user and producing “meaningful” output. If it stopped at that then I probably would just continue to think of it as a nice little machine, useful for augmenting web searches and summarizing long pages. However, it has been constructed and is being widely used as a programmable component in AI-assisted systems, primarily as backend technology. That means that it’s being integrated into software systems that are designed to do a bunch of different things, all of which can be characterized as depending on text processing and interpretation.
At Ibbaka, we are using it to streamline the creation of value models for our clients via a series of semi-structured “prompts” which include instructions, context, examples, one or more descriptions of roles that ChatGPT should adopt, and then ultimately a question or direction to “do something”. This pattern of constructing a sequence of prompts in order to produce semi-structured text is referred to as “prompt orchestration,” but it is really a model for using ChatGPT as a programmable text and knowledge processing system. The preferred prompt “language” is actually the programming language for that system and these prompt sequences are the programs themselves.
But… and there’s a big but here. Programming languages have specifications. When I use a set of instructions in every other programming language it does a very specific thing. For example, the code “a = a + f(c)” in many languages means “assign to the variable named ‘a’ the result of adding the current value of `a’ to the return value of the function named `f’ applied to the value of the variable `c’”. Whenever you see that line in a program you can immediately know exactly what it will do and the only uncertainty you might have is knowing what the function `f’ does with the value `c’ and whether or not it’s advisable to overwrite the value of `a’ with this new value. It is unambiguous in its meaning and consequence. If I don’t understand a function written in that language I can always refer to the language specification and read the code for `f’ with the specification or documentation at the ready.
But ChatGPT is different. I might be able to give it instructions, provide context, ask it to play a role, and then ask it a question (all in plain language) but I really don’t know what it will do with that. I can hope. I can predict. But I can’t know. You may have experienced this when interacting with it or one of the image-generating systems. You can give it direction, and sometimes very explicit direction, but what it produces is rarely exactly what you intended and often requires multiple rounds of disambiguation and testing or exploration before you produce something acceptable.
And far too often you’re faced with asking a question and getting an answer that you know to be simply wrong. It can’t tell you how many `r’s in “strawberry”. It fails on simple reasoning tasks and makes elementary mistakes in both attribution and factual reasoning. It is very good at summarizing and repackaging information, but very often responds with general descriptions when asked about specifics. The most frustrating of all is when you provide a correction and it either incorporates your correction incorrectly or reinterprets the rest of its answer entirely when presented with a simple correction. Examples of these dialogs are easy to find and verify.
In essence, treating GenAI as a programming language means that you are trying to program an unreliable system with a largely unknown instruction set and no guarantee that it won’t confidently respond with a result that has all the appearances of correctness without actually being correct. Even worse, the same inputs and the exact same context can result in significantly different outputs, in part because the systems are essentially probabilistic, but also because sometimes the models will actually change over time. And yet, entirely new business models are being built on this new “knowledge processing” infrastructure.
In essence, programming languages are like motor vehicles, some are cars, some are trucks, and some are even motorcycles. They can be fast, or slow, have large towing capacity, or be designed specifically for limited purposes like delivering parcels or mail. But they all share certain characteristics. If you know how to drive you can pretty much jump into any of them and make them move. You may need some specialized skills to manage certain of them (e.g. a manual transmission of a big rig or a motorcycle) but they essentially just do what you want them to do if you’ve got enough skill to drive them. Yes, they’re not always safe and frankly, most of them are much less safe than people think they are, but at least we have good models and experience with how they can go wrong and have reasonable intuitions about when they are dangerous and how they break down. They are simple, usable, well-understood tools.
ChatGPT is a mule. Yes, it can get you places that you can’t get with a car (although after a hundred years of car development, there’s probably a specialized vehicle that will do the job of a mule). And yes, they are actually smarter than a car, and that’s very easy to demonstrate. And yes, it’s definitely cheaper to rent or even buy a mule than to buy a vehicle and hire a driver to get you to a similar destination (e.g. I once rode a mule for 10 hours in a day in order to be able to climb a mountain in Northern Africa. It was a good experience – with a guide – if a little painful). But, mules are stubborn and need skilled handlers to get them to do what you want them to do. They are really hard to train, in part because they are probably smarter than horses, which can do similar jobs but are much more fragile. They occasionally refuse to do even familiar tasks for their own reasons and need forcing. Managing a small group of mules is especially difficult since they all have their own distinct personalities, strengths, and weaknesses. Only a skilled handler familiar with each animal can even hope to control a group of them. And yes, anyone can go to a petting zoo and feed and pet one (similar to using ChatGPT for a simple conversation or document summary) but actually getting one to do useful work is hard and needs training and often a lot of trial and error. The only way this metaphor doesn’t work is that mules are a lot easier on the environment than cars.
Now this should not all be met with pessimism. These systems are clearly useful and they are much more predictable and reliable at certain tasks than others. They are very good at summarizing specific documents and collections of documents. They can often answer specific questions that have already been asked and answered. If you depend on that, however, they must be considered to be unreliable and should always be asked for references, so treat them as summarizing search engines so you can check their work – on that point, I think Perplexity is the right model. They are unreliable when faced with factually wrong or misleading sources and information – just as we are, perhaps even more so. In no case should they be used to replace people who have judgment and reasoning abilities, which they definitely do not yet have.
Maybe they’ll be more like cars someday, but remember that it took decades and a lot of transformation of society around them before that was true of cars. Best to treat them like a mule. Only try to use them for what they’re really good at, don’t trust them, and make sure you have a good guide.