How much LLM training data is there, in the limit?

Recent large language models such as Llama 3 and GPT-4 are trained on gigantic amounts of text. Next generation models need 10x more. Will that be possible? To try to answer that, here’s an estimate of all the text that exists in the world.


Revised Chinchilla scaling laws – LLM compute and token requirements

There’s a nice blog post from last year called Go smol or go home. If you’re training a large language model (LLM), you need to choose a balance between training compute, model size, and training tokens. The Chinchilla scaling laws tell you how these trade off against each other, and the post was a nice guide to the implications.

A new paper shows that the original Chinchilla scaling laws (from Hoffmann et al.) have a mistake in the key parameters. So below I’ve recalculated some scaling curves based on the corrected formulas.


On the ordering of miracles

AI is starting to arrive. Those close to the action have known this for a while, but almost everyone has been surprised by the precise order in which things occurred. We now have remarkably capable AI artists and AI writers. They arrived out of a blue sky, displaying a flexibility and finesse that was firmly in the realm of science fiction even five years ago. Other grand challenge problems like protein folding and Go also fell at a speed that took experts by surprise. Meanwhile, seemingly simpler mechanical tasks like driving a car remain out of reach of our best systems, despite 15+ years of focused, well-funded efforts by top quality teams.

What gives? Why does AI seem to race ahead on some problems, while remaining stuck on others? And can we update our understanding, so that we’re less surprised in future?


How to win a fight with God

The breeze whispers of a transformation, a time of great trial and tribulation for mankind. We face an adversary whose complexity is almost unimaginable. Its vast computing power makes the human brain look like a toy. Worse still, it wields against us a powerful nanotechnology, crafting autonomous machines out of thin air. Already, more than one percent of the Earth is under its sway. I’m talking of course about the mighty Amazon rainforest.

But humans have long ago learned to live with our ancient enemies, the plants and the animals. Perhaps we can take some lessons for how to navigate our new friends, the AIs, who may be arriving any day now.

The rainforest is not actively trying to kill us, at least not most of the time. It is locked into a fierce struggle with itself, deploying its vast resources in internal competition. As a side effect, it produces large amounts of oxygen, food and other ecosystem services that are of great benefit to humanity. But the rainforest doesn’t like us, doesn’t care about us, mostly doesn’t even notice us. It exists in a private hell of hyper-competition, honed to a sharp point by the lathe of selection turning for a billion years.

So here is one model for AI safety. Don’t hope for direct control. Don’t dream of singletons. Instead, design a game that locks the AIs in a competition we don’t care about, orthogonal to the human world, perhaps with a few ecosystem services thrown our way as a side-effect. We will only collect scraps from the table of the Gods. But while the Gods are busy with their own games, we can get on with ours.

I think of the Irish proverb: What’s as big as half the Moon? Answer: The other half. Good advice for fighting with God.

Written with assistance from ChatGPT :-)

Google+ Archive

Google is shutting down Google+ in the next few days. I’m archiving my G+ stream here for posterity.

Over 7 years I posted exactly 1,001 times to Google+. I know it was never an especially popular social network, but somewhat to my surprise I found that I enjoyed it a lot. There was a strong set of people on G+ interested in deep learning, robotics and related topics, at least for a period of several years. The unlimited post length meant your could have meaningful conversations in a way you couldn’t on Twitter. For me Google+ was a thoughtful place, with high quality people, interesting content and meaningful discussion. The fact that it was a small, ignored community mostly interested in technical topics provided the conditions for that.

I have now reluctantly moved to Twitter, and I also have this blog for occasional long form content. Twitter is (alas) not much of a substitute for G+. No matter how carefully I curate who I follow, my Twitter stream is invariably full of political anger and culture wars. I am as susceptible to this as anyone else, and a Twitter session is a pretty sure way to make me feel angry and unhappy. I very much wish there was a way to turn down the emotion in my Twitter feed. Unfortunately I am yet to find that setting. Nevertheless, I do still get some useful technical and professional news from Twitter, so I will likely proceed with it and accept the unhappiness tax that it imposes.

So RIP G+, you were not much loved by most, but I will miss you.


Two and a half years ago I left Google and set out to build a new kind of search engine. This may sound a little crazy, but all the best things are like that :-)

We’ve been avoiding the tech press and trying to build things quietly, but this week we’re launched our user-facing app. I’m really proud of what the team has built, so it’s exciting to finally be able to say a bit more about it.

The problem we’ve been working on is finding specific items locally. For example, a light bulb just broke and it’s a strange fitting, where’s the nearest place you can get a new one?  Or you’re half way through a recipe and realise you’re missing an ingredient – where do you get it?



Imagine you’re a road engineer and you’re designing an access road for a new town. The town will soon be built in a previously uninhabited area. You’ve managing the construction project, but unfortunately no one can tell you what the population of the town will be.

Taking your job seriously, you sit down to design the best road that you can build. You settle on constructing a seven lane highway with regular flyovers to minimize traffic. The road will be fully lit with a state-of-the-art LED lighting system. You add crash barriers and regularly spaced emergency telephones. After much consideration you decide to also include a rest area with parking and toilets. This involves designing a self-contained water and sewerage system, but it’s obviously worth it.

With three months to go until launch day, you discover problems with road drainage. After the panic subsides, the construction team agrees to work around the clock to refit a completely new system for surface water management. By a minor miracle, the work is completed on time.

Opening day finally arrives and the excitement is intense. Everyone agrees the finished product is an engineering marvel. The new town will have the best road in the world.

Unfortunately, it turns out that the town is a remote settlement with a population of 57. The road is mainly used by an old man and a donkey.

The next year, you are again given a road construction project for another new town. Having learned your lesson, you build a modest single lane road. It’s well constructed but nothing special.

Opening day comes again, and it’s revealed that this time the “town” is in fact a major city with a population of 14 million. There are 50 mile tailbacks for six years before a larger road can be built. Your face appears on wanted posters throughout the nation, and you flee the country in disgrace.

Twitter, I forgive you the Fail Whale. And I hope to always walk the middle *ahem* road.

Epiphenomenalism for Computer Scientists

It’s hard to work on robotics or machine learning and not occasionally think about consciousness.  However, it’s quite easy not to think about it properly! I recently concluded that everything I used to believe on this subject is wrong. So I wanted to write a quick post explaining why.

For a long time, I subscribed to a view on consciousness called “epiphenomenalism”. It just seemed obvious, even necessary. I suspect a lot of computer scientists may share this view. However, I recently had a chance to think a bit more carefully about it, and came upon problems which I now see as fatal. Below I explain briefly what epiphenomenalism is, why it is so appealing to computer scientists, and what convinced me it cannot be right. Everything here is old news in philosophy, but might be interesting for someone coming to the issue from a computer scientist perspective. More

Building a DIY Street View Car

A little blast from the past here. Several years ago I built something very like a Google Street View car to gather data for my PhD thesis. At the time I wrote up a blog post about the experience, as a guide for anyone else who might want to build such a thing. But I never quite finished it. Upgrading WordPress today, I came across this old post sitting in my drafts folder from years ago, and decided to rescue it. So here it is. The making of a DIY StreetView car.