The Bitter Lesson Was Right. Then the Sun Got Involved.

The bitter lesson was most true in 2019. It is slightly less true today. And it will be less true still in the future. Not because scaling stopped working. But because the assumptions underneath it are quietly falling apart

Share
Three monkeys crowded around a red typewriter, papers flying everywhere, looking bored and chaotic β€” an illustration of the infinite monkey theorem.
The reading committee is not optimistic.

A seven-year retrospective on Richard Sutton's most uncomfortable essay.

In March 2019, Richard Sutton published a short essay called The Bitter Lesson. It is about 1,125 words. It has probably influenced more research dollars than most academic papers ten times its length.

The argument is clean. Seventy years of AI history shows one repeating pattern: brute force always won. The lesson is bitter because it implies that the clever thing we spent our career building will be outrun by someone with a bigger cluster.

Seven years later, the conventional wisdom is that Sutton was proven right beyond his wildest expectations. Large language models. GPT-4. Claude. Gemini. The scaling hypothesis didn't just survive. It became the entire industry.

I want to argue the opposite. The bitter lesson was most true in 2019. It is slightly less true today. And it will be less true still in the future.

Not because scaling stopped working. But because the assumptions underneath it are quietly falling apart.

Now, let's the word count start.


The Engine Nobody Mentioned

Sutton never says this explicitly, but the bitter lesson only works if compute keeps getting cheaper. Moore's Law is the silent engine of the whole argument. Without it, "just scale" is not a viable strategy.

For fifty years, Moore's law held. The cost per computation fell reliably enough that you could build an entire research philosophy around it.

But now we are approaching 2nm nodes and beyond. At this scale, quantum tunneling becomes a real problem. Heat dissipation becomes almost unsolvable. The cost per transistor, which fell for decades, will likely rise. It's no longer guaranteed that the new generation will be cheaper than the last.

This is not speculation. Ask TSMC.


The Sun

A photon is born deep in the solar core through nuclear fusion. But before it begins its eight-minute journey to Earth, it can spend thousands of years bouncing around inside the sun. Absorbed and re-emitted through plasma so dense that the path is essentially known as 'random walk' at quantum scale.

Now, compute THAT.

We cannot. Not only because we lack the compute. The energy required to perform that calculation exceeds the energy the photon itself carries. The computation costs more than the thing being computed. We would need to spend more than the sun has to track what the sun does.

This is the structure of reality. Some processes are their own fastest simulation. The universe does not have a shortcut for itself. If calculating all the paths of photons ever emitted from the sun is not possible, can we claim that scaling will solve everything?

Unlike entropy, scaling is a local phenomenon at most.


Hire William.

They say given enough monkeys at enough typewriters for enough time, one of them will produce Hamlet. The key is 'enough', or scaling.

But remember this, the number of monkeys required is cosmologically larger than anything that physically exists. Every atom in the observable universe, typing since the Big Bang, would almost certainly not get there.

But that's not all.

Even if a monkey types Hamlet, someone has to find it. We need a second army of atoms to read the output, page by incomprehensible page, and recognize when the play has appeared.

William wrote Hamlet on a 20-watt compute. Yet, in this experiment, this 20-watt machine produced something that can't be escaped. It shows up either at the generation stage or at the evaluation stage.

So why not just hire William? The monkeys need typewriters, rooms, food, electricity, water, and a reading committee that never sleeps. William needs a quill. Brute force was never the powerful choice. It was always the expensive one dressed up as pragmatism.


Hire Vincent.

We have scaled image generation indefinitely. Billions of images. Models trained on everything ever painted. And we can produce something that resembles a sunflower on computer.

But there is exactly one Starry Night. Exactly one Sunflowers. What makes them matter is inseparable from the fact that Vincent in a specific state of mind, through a specific history of suffering and seeing, made them in 3D printing fashion. A billion generated sunflowers do not accumulate into that purpose or outcome.

Hire Vincent. Give him lots of paint.

But do not make the mistake of wishing he were Leonardo. The fantasy of general intelligence, the one ring, is closer to a procurement error.

Leonardo never finished the Mona Lisa, not for lack of talent. He was busy acquiring new knowledge. Anatomy, optics, hydrology, and engineering. The richer his perception, the harder completion became. Leo was also an expensive man. He lived in a king's castle, and a king to hold him when he died.

Why do we need millions of Leos? If all we need is millions of paintings? Vincents would happily take the job without the need of castles or kings.

The bitter lesson calls human knowledge a dead end. But it confused transferable knowledge or rules with human intelligence or human creativity. One of them can be scaled around. The other keeps showing up at the evaluation desk, asking if you found Hamlet yet, forever.

Hire Vincent.


Efficiency Was Free. Now It Isn't.

Here is something Sutton wrote: "building in how we think the mind works, such as we should stop trying to find simple ways to think about the contents of minds."

Fair enough, but we chose LLMs precisely because language is the simplest possible interface. Language is how humans already think, store knowledge, and transmit culture. It is a simplification of contents of minds, except we have nothing better.

And language is already an approximation. It is a lossy compression of reality being transmitted then rebuilt at receiving end. Even if LLM is trained on every word ever written, it's still training on humanity's approximation of reality. Everything that was never written is permanently outside the training distribution.

We cannot sample our way to originality. It is closer to a mathematical limitation. Think of it like a Fourier transform. More samples give us a better approximation of the original signal, but not the frequencies that were never in the signal to begin with. Scaling does not introduce new information.

Vincent did not sample previous paintings more carefully. He introduced a new frequency.


The Voters

Data centers are now competing with small towns for electricity and water. Communities are already voting them down. Here's the question: do we power machines, or humans.

Without the data center, scaling does not happen. The democratic ceiling may arrive first before we hit mathematical, physical or thermodynamic ceilings.

Don't assume compute is going to be abundant forever.

Let the word count stop.


p.s. The bitter lesson was true during the free scaling era.

Efficiency did not matter when growth masked everything. It matters now. And when efficiency matters, the calculus shifts completely. Specialization becomes valuable. Human insight becomes worth investing in. Finding the right person and giving them a typewriter starts to look like the rational strategy, not the sentimental one.

The knowledge of all things is possible. Just not by brute force.

BTW, for the love of Leonardo, I'll be releasing a new song dedicates to him. Watch for its coming.