Recent large language models such as Llama 3 and GPT-4 are trained on gigantic amounts of text. Next generation models need 10x more. Will that be possible? To try to answer that, here’s an estimate of all the text that exists in the world.
Firstly, here’s the size of some recent LLM training sets, with human language acquisition for scale:
Training Set (Words) | Training Set (Tokens) | Relative size (Llama 3 = 1) | ||||
Recent LLMs | ||||||
Llama 3 | 11 trillion | 15T | 1 | |||
GTP-4 | 5 trillion | 6.5T | 0.5 | |||
Humans | ||||||
Human, age 5 | 30 million | 40 million | 10-6 | |||
Human, age 20 | 150 million | 200 million | 10-5 |
And here’s my best estimate of how much useful text exists in the world:
Words | Tokens | Relative size (Llama 3 = 1) | ||||
Web Data | ||||||
FineWeb | 11 trillion | 15T | 1 | |||
Non-English Common Crawl data (high quality) | 13.5 trillion | 18T | 1 | |||
All high quality web text | 45 – 120 trillion? | 60 – 160T? | 4 – 11? | |||
Code | ||||||
Public code | – | 0.78T | 0.05 | |||
Private Code | – | 20T | 1.3 | |||
Academic publications and patents | ||||||
Academic articles | 800 billion | 1T | 0.07 | |||
Patents | 150 billion | 0.2T | 0.01 | |||
Books | ||||||
Google Books | 3.6 trillion | 4.8T | 0.3 | |||
Anna’s Archive (books) | 2.8 trillion | 3.9T | 0.25 | |||
Every unique book | 16 trillion | 21T | 1.4 | |||
Court documents | ||||||
US federal court documents | 2 trillion | 2.7T | 0.2 | |||
Social media | ||||||
Twitter / X | 8 trillion | 11T | 0.7 | |||
29 trillion | 38T | 2.5 | ||||
105 trillion | 140T | 10 | ||||
Publicly available audio (transcribed) | ||||||
YouTube | 5.2 trillion | 7T | 0.5 | |||
TikTok | 3.7 trillion | 4.9T | 0.3 | |||
All podcasts | 560 billion | 0.75T | 0.05 | |||
Television archives | 50 billion | 0.07T | 10-3 | |||
Radio archives | 500 billion | 0.6T | 0.04 | |||
Private data | ||||||
All stored instant messages | 500 trillion | 650T | 45 | |||
All stored email | 900 trillion | 1200T | 80 | |||
Total human communication | ||||||
All human communication (daily) | 115 trillion | 150T | 10 | |||
All human communication (since 1800) | 3 million trillion | 4000000T | 105 | |||
All human communication (all time) | 6 million trillion | 8000000T | 105 |
So is training data running out?
At 15 trillion tokens, current LLM training sets seem within an order of magnitude of using all high-quality public text. For English, you could maybe get to somewhere in the 40 – 90T range using more web crawl and some harder to reach sources. Including non-English data might possibly get you to the 100 – 200T range. That seems like the upper limit.
Private data is much larger. Facebook posts alone likely come to 140T, Google has around 400T tokens in Gmail, and with all text everywhere you could maybe reach 2,000 trillion tokens. This data seems clearly off limits to responsible private actors1, though it’s worth keeping in mind that it exists. It’s potentially an option open to intelligence agencies or nefarious actors2.
Model data requirements have historically gone up 10x per generation. On the assumption that (A) commercial model makers won’t train on private data and (B) more data is required for performance gains, then future models must rely heavily on synthetic data. GPT-5 might get away with mostly just scaling up data collection, but for GPT-6 level, synthetic data or some other new ideas are required. This conclusion will shock nobody who’s been paying attention, but I still found it useful to work through exactly where the limits of the current approach are.
Notes and Sources
Tokens vs Words
The ratio of tokens per word depends on the tokenizer and the language. I have assumed 0.75 words per token, which is about right for English text using OpenAI tiktoken. This means the counts in the table will be slightly off for non-English sources, and for models like Llama that use other tokenizers. We’re interested in orders of magnitude here though, so the differences aren’t significant.
Recent LLM training sets
Llama 3
Data from the official blog post. A 15T token training set is used for both the 8B and 70B models. For the conversion to words, I have assumed 0.75, which will be slightly off since they use a different tokenizer than OpenAI.
GPT-4
Estimate from EpochAI. Quoting them: “Speculative. Reported secondhand by online sources such as Semianalysis, but not verified by OpenAI. If total number of tokens seen was 13T, text was repeated for 2 epochs, and text was the majority of tokens, then dataset size roughly is 13T*0.75/2 = 4.9T words.” I have rounded to 5T.
Google Translate n-gram model (2007)
For historical interest, I note that Google trained an n-gram model on 2 trillion tokens of general web data almost twenty years ago. They continued to use this model until at least 2013, with the same sized training set3. So modern LLM training sets aren’t unprecedently large.
Human language acquisition
Human, age 5
The total number of words heard in early childhood is pretty well studied in the literature. Good estimates come in at 25 – 50 million words by age 5.
Human, age 20
My estimate. People hear about 5 million words a year, assuming the childhood data linked above remains valid into later years. Adding to that, an avid reader who gets through 1 book a week could read maybe 5 million more words per year. So that gives about 100 – 200 million words by age 20.
Web text
Common Crawl
Common Crawl is the basis for many large models, as it’s a convenient source of massive-scale web data. It’s sometimes referred to as being “the whole web”. This isn’t true, but it does cover a substantial fraction of all public HTML content. It misses dynamically rendered websites, PDF content, anything behind a login, etc. Google certainly has something much more comprehensive internally, and OpenAI and Anthropic also run their own custom crawls.
In terms of tokens, raw Common Crawl is at least 100 trillion tokens. I haven’t worked out a more precise number, because it’s fairly meaningless for our purposes. The raw data is full of junk and duplication, so any practical use for LLM training will start with heavy filtering.
FineWeb
FineWeb comes in at 15 trillion tokens. It’s a filtered English subset of Common Crawl dumps since 2013, and serves as a reasonable proxy for all useful English web text from Common Crawl.
One important detail is that FineWeb is (intentionally) not fully deduplicated. The FineWeb team empirically discovered that it results in lower performance, so they don’t do it. Effectively the dataset bakes in multiple epochs, but in a data dependent way, where more popular content is seen more often.
Full deduplication would reduce the size about 75%, leaving 3T to 4T tokens (depending on the exact deduplication method used). I’m going to quote the headline 15T number, since this is what people train on, but bear in mind that the 3-4T unique token count is more comparable to other numbers in the rest of this document.
Non-English Common Crawl
FineWeb is entirely English, Common Crawl is only 45% English. So there should be 18T tokens of comparable quality available in other languages.
All high-quality web text
[When I first wrote this post, I assumed Common Crawl contained most HTML-based web content that was of value for LLM training. Several people contacted me after posting to let me know that wasn’t right. So I have added the below correction].
Even if we limit our focus to static HTML pages, Common Crawl is not the entire web. This isn’t too surprising given that the total funding for Common Crawl is only about $4m over the last 15 years. Google and Bing certainly have more comprehensive indexes, and others such as Meta, OpenAI or Anthropic could produce equivalents given time. How much useful HTML web content exists if you gather everything?
I don’t have a good number for this. Based on some extremely crude reasoning (here), my guess would be 30-75T additional English tokens, and 65-160T across all languages. This is a big and important number to know, so if you have a better estimate please contact me.
Code
Publicly accessible code
The Stack v2 is a dataset covering most publicly accessible code. It contains 775 billion tokens in the largest variant, though if you want common languages and stricter near-duplicate removal it’s around half that. It’s based on the Software Heritage archive which includes all publicly accessible code from all the major code hosting platforms, regardless of licence.
All code
Estimates of the total amount of code ever written tend to land between 1 trillion and ~3 trillion lines. A quick estimate4 suggests about 10 tokens per line, so that gives about 10-30T tokens of code in total. At first glance this seems implausibly large, given FineWeb is about the same size. We have almost 0.8T tokens of public code though, so ~20x that as an overall total seems reasonable. Most of it is private, and a substantial amount has probably been lost forever, so it’s of limited relevance for most LLM purposes. Still, it’s useful to have the number as an upper bound.
Academic publications and patents
Academic articles
A 2014 estimate concluded there were 114 million English-language academic papers on the web. English papers are a large majority in most academic fields, estimates suggest 75 – 90%. So total publication count was probably around 140m in 2014. The rate of publishing has increased a lot over time, running at about 4 – 5 million per year over the last decade. So that gives about 180m total publications as of 2024, of which 75% are in English. The mean length of a publication is about 4,500 words. So the total comes to 800 billion words or 1T tokens. Since papers are almost all PDFs, a significant fraction would require OCR to extract. Many are also paywalled, though 100m are accessible via shadow libraries such as Anna’s Archive.
Patents
Google Patents includes 13m grants (dating back to 1790) and 7.7m applications (since 2001). Modern patent applications are about 12,000 words on average. There is some overlap between applications and grants, and size has gone up over time. Overall this comes to maybe 150 billion words or 200 billion tokens.
Books
Google Books
Official 2019 blog post says they have 40 million scanned books. They are not scanning much anymore, so that’s probably still current. I have assumed 90k words per book based on this, which gives 3.6 trillion words or 4.8 trillion tokens5.
Library of Congress
I’m including this as a scale reference. Most of it is contained within Google Books, so it’s not a separate source. The library has 39 million print books and 167 million items total. Using the print books number, and the same word count assumption as above, we get 3.5 trillion words or 4.7T tokens.
Shadow libraries
The largest shadow libraries such as Anna’s Archive contain 31 million books, which would be 2.8T words or 3.9T tokens. I don’t know to what extent this overlaps with Google Books, but it must be much less than 100%. Google’s scanning effort of historical print books was unique, and as far as I know was never scraped, whereas Anna’s Archive mostly comprises more recent e-books.
All books everywhere
I have used the estimate from Google that there were 130m distinct titles in the world as of 2010, and the estimate here that mainstream publishing adds around 0.5 – 1 million titles per year, but up to 4 million titles if you include self-published works. I’ve included the latter, which means about 180m books as of 2024. Assuming 90k words per book gives about 16 trillion words, or 21T tokens.
Court Documents
US Federal court documents
In the US, federal court documents are accessible electronically via a system called Pacer. Pacer had over 1 billion documents as of 2014, with an average length of 9.1 pages. Based on a quick manual sample, around 200 words per page seems typical6. Allowing for growth since 2014, that’s over 2T words or 2.7T tokens. There is a lot of boilerplate, but even adjusting for that, this is a substantial text source. Unfortunately Pacer charges 10 cents per page for access, so downloading the entire corpus would cost around $1B7. However, it is publicly accessible to someone with deep pockets.
A few tens of millions of documents (with transcribed full text) are available via the RECAP archive which is mirrored by Archive.org, but this is only about 1% of what’s on Pacer. So maybe ~30B tokens.
Worldwide court documents
Figuring out court document access policies worldwide is too big a project to take on here. I do know that most countries don’t make as much available as the US does through Pacer. For the UK and Ireland, about 1 million documents are available here. However, this is mostly limited to judgements; unlike the US, other court documents are not generally available. For Germany Open Legal Data has 50B tokens, which represents all publicly available decisions. I haven’t checked other countries. Overall my impression is that much of this kind of data is in web crawl, and probably runs to a few hundred billion tokens at most.
The full set of global legal data is much larger, likely tens of trillions of tokens. For example, Open Legal Data estimates that the 50B tokens they have for Germany are only 1% of total decisions, implying 5T tokens total, albeit with a lot of repetition and boilerplate. However, accessing this data is clearly difficult.
Social media
Twitter / X
Twitter reported an average volume of 500 million tweets a day in 2014. As far as I can see no more recent number is available, so let’s go with this as a current rate7. About 25% of tweets are estimated to come from bots, so removing those leaves 375m tweets per day. Twitter released data in 2018 showing the average length of a tweet was 33 characters, which is about 6 words8. That gives 0.8 trillion words per year. Twitter is 20 years old, and has had similar daily volume for at least 10 years. So let’s be conservative and count just the last 10 years: we get a total of 8 trillion words or 11T tokens.
Twitter (and most other dynamically rendered websites) is absent from Common Crawl, so this data is additional to what you’d find in FineWeb. Of course, Twitter is ‘weird text’ and only a fraction of this will be high quality for LLM training purposes. As a user, I find it high signal-to-noise once you get used to it, so a reasonable fraction of these tokens could be useful in principle.
Weibo
Weibo’s metrics are very similar to Twitter. The platforms are about the same age, and the DAU counts are also similar. This study found Weibo users post 1.2 times per day on average, or 0.85 excluding reposts, which is similar to the 1.5 we found for Twitter. One difference is that average post length is 55 chars, and given that Chinese is much denser per character, this is about 38 words per post, almost 6x what we found for Twitter. Scaling on this, I get 29 trillion words or 38T tokens for Weibo. Pretty substantial!
Meta (Facebook and Instagram)
Facebook had 2 trillion searchable posts as of 2015, per an official blog post. At that point, Facebook was just over 10 years old and had 1.5B users. Average user count over the 2004-2014 period was about 600m, implying 0.9 posts per user per day. This seems high to me, but Facebook was at the height of its popularity in this period.
At the time of writing Facebook has about 3B users, and average user numbers in the 2015 – 2024 period Facebook was about 1.5b. If we assume the same rate of posts per user per year, we get a little over 6 trillion posts at the time of writing. Anecdotally, public posting activity on Facebook has slowed down a lot in recent years, so this may be an overestimate. The total must be somewhere in the 2 – 6 trillion range though, which is a pretty narrow band for our purposes. I’m not sure if this number includes replies, or just primary posts. Given the initial data comes from a search index, I’d guess that replies are not indexed. Instagram is not counted either, though it’s more text-light. Both of these factors would push the total higher, potentially much higher. However, let’s stick with the estimate of 6 trillion as a reasonable estimate based on the data available.
The average Facebook post is about 17.5 words. So that gives 105 trillion words, or 140T tokens. Quality and privacy considerations mean that this data might not be that useful for LLM training. For Meta’s recent Llama 3 model, it was reported that “No Meta user data was used, despite Zuckerberg boasting that it’s a larger corpus than the entirety of Common Crawl”. That lines up with our estimate.
Finally, if Instagram Reels is similar in scale to TikTok (see below), the transcribed audio might add other 5T tokens.
Publicly available audio
YouTube
A recent academic project made a careful estimate and came up with 14.7 billion public YouTube videos as of 2024 (research paper here, some colour here).
Next we need to know how many of those videos have speech. From the paper: “96.81% of videos had some audio. Among these, 40.56% were judged to be entirely or almost entirely music. 53.87% had spoken language in the first minute. A human talking to the camera was seen in 18.32% of videos and 9.13% of videos include footage of a public or semi-public event”. So 54% have some spoken language, and for 18% someone talking is the main focus of the video. Let’s say that 25% of YouTube videos could provide useful tokens for LLM training.
Videos have 615 second mean duration (though 60 secs median, so a few very long ones skew this a lot). Assuming 140 words per minute, this gives 5.2 trillion words.
TikTok
TikTok discloses numbers in their transparency report. If I am reading it correctly, a whopping 163 billion TikTok videos were posted between July 2020 and Dec 2023, with a current annual run rate of 70 billion per year. TikTok has 1b users, and the top 25% post 98% of the content. So this works out at 0.76 posts per day from this group. That’s higher than I would have guessed, but plausible. If we do a little extrapolation, total videos posted to TikTok since inception is probably around 250 billion.
TikTok videos are short, around 35-40 seconds on average. I don’t know what percentage contains speech, but let’s say 18%, which is the lower estimate for YouTube. At 140 words per minute, that gives 3.7 trillion words or 4.9T tokens, which is similar to YouTube. How much of it is “high quality” from an LLM training perspective is hard to say, but TikTok has all sorts of content, so presumably at least some of it is valuable.
Podcasts
I estimate 100 million episodes exist, based on this podcast stats tracker which gives 94 million episodes on Apple. (Alternative websites list up to 150m episodes, but the source seems less reliable). Average length I have assumed 40 minutes based on this estimate. Assuming 140 words per minute, that gives 560 billion words.
Television
It’s surprisingly hard to find good numbers on the total amount of unique TV content in existence. However, it seems clear that as a source of words, TV is small. Given the high production costs of an hour of TV relative to YouTube, that isn’t too surprising.
Nielsen lists 1.1 million unique video titles available to U.S. audiences across all platforms as of October 2023. This includes lots of back catalog on streaming services. That works out at around 4 – 5 billion words, which is a rounding error compared to other sources we’ve considered.
The UK has good statistics due to government regulation, and they list 27,000 hours of domestic content produced a year, which is 150 – 200 million words. Even if we allow 50 years of archives (which is unlikely, presumably hours produced were much lower 50 years ago), this only comes to 10 billion words for all of UK TV history.
Total world archives seem likely to be a few hundred billion words at most. The amount you could access for LLM training might only be in the tens of billions. I’m going to guess 50 billion with high uncertainty, but it’s such a small source that precision here doesn’t matter.
Radio
The total number of radio stations worldwide is estimated to be between 44,000 and 66,000. A lot of the content is of course music, with maybe around 15% spoken word9. If talk stations produce 10 hours a day of unique content, that’s around 160 billion words per year worldwide.
If 10 years’ worth of archival recordings exist, that’s 1.6T words. If you could access 50 years of archives, and station counts were as high in the past (both of which seem dubious), there’s potentially as much as 8T words here. Sadly, I suspect no such archive exists. The biggest archives I can find listed are the BBC / British Library archive, which has a few million items, and the Internet Archive radio section, which has about half a million items. That would represent no more than a few billion words. If an internet radio service like TuneIn kept archives, it could be fairly large, but they don’t seem to. All told, my guess is that there might be only a few hundred billion words accessible via a mishmash of small archives. I’m going to go with 500 billion.
Private data
All instant messages
Meta reported in 2023 that it handles over 140 billion instant messages per day across all apps (Whatsapp, FB Messenger, Instagram). Less reliable WeChat estimates come to 45 billion per day, but the number seems reasonable given the relative size of the user base. Apple iMessage numbers are harder to find, but ok-ish ones put it at around 8 billion per day. Snapchat seems to be 4-5 billion per day10. I didn’t try to find numbers for Telegram, QQ, etc, because they’re unlikely to change the order of magnitude much. Global SMS volume estimates suggest 23 billion messages per day, though I can’t find a reliable source for that, and if my own experience is any guide it’s mostly automated messages or marketing, so I won’t include it.
Overall this comes to about 200 billion messages a day. None of these sources specify if this is messages sent or messages received. Messages received would be inflated by one-to-many sending in group chats. Meta has about 3.5 billion users, so 140b daily messages implies 40 sends per person per day. That seems too high to me, so I’m going to assume it’s messages received. Sends might be 50% of that, which would mean 100 billion messages sent per day.
Based on this study, the average WhatsApp message is 5.85 words. So overall we get about 0.6 trillion words or 0.8T tokens per day.
The total stock of stored messages is harder to estimate. Some are ephemeral and not backed up, but this is not the default on the most popular platforms. WhatsApp messages are backed up centrally by default (unencrypted, until quite recently). Meta’s other apps store messenger histories centrally at account level. I assume WeChat is the same. So let’s say 75% of daily volume is stored as backups or logs somewhere.
The messenger apps have been fairly mature in terms of user numbers for the last few years. So let’s say about 3 years of stored history exists, across some set of company servers.
So all stored chat logs contain just under 500 trillion words or 650 trillion tokens.
All email
Email is very big. Precise numbers are hard to come by, but global email volume estimates tend to come in around 350 billion messages11 per day. Of this, between 50% and 85% is spam. If we take the higher spam estimate, that leaves us with 45 billion non-spam messages per day. I believe this figure counts emails with multiple recipients multiple times. So let’s take away another 75% to account for that, leaving 11 billion unique emails per day. That implies that an average internet user writes 2 emails per day, which seems plausible. If we assume an average of 50 words per email12, we get 0.5 trillion words per day in email.
Half a trillion words per day is roughly comparable to the totals for instant messages and phone calls (see below). However, email is distinctive in that people tend to keep much longer histories. It’s pretty common for users to have years or even decades’ worth of email, and storage is mostly centralized and unencrypted. Popular services like Gmail are 20 years old, and I personally have some emails from over two decades ago.
Let’s guess that the average user has 5 years of email. That works out at 900 trillion words. Google and Microsoft have a sizable fraction of this stored in one place, though they probably couldn’t ever make use of it for privacy reasons.
All phone calls
Estimates suggest 13.5 billion phone calls per day, with an average length of about 1.8 minutes. The sources I’ve found on this aren’t great (here), but they seem about the right order of magnitude and align with other metrics on total talk time, so it’s close enough for our purposes.
If we allow for some hold time and spam, let’s say 1 minute of speaking time per call, at 140 words per minute. That gives 1.9 trillion words per day across all phone calls.
To my knowledge, phone call content is not retained anywhere at scale, except by state actors such as the NSA. NSA Mystic was retaining 30 days worth of calls for entire countries (specifically the Bahamas and Afghanistan) from 2011 onwards. Whether these programs are still in operation is unknown. With advances in storage and voice recognition since 2011, they could potentially now be retaining call content more broadly. If they had say 10% of global call content retained for one year, that would be about 70 trillion words. I think it’s useful to know this order of magnitude, even if no such recordings exist. For our purposes let’s assume phone calls are ephemeral and contribute zero words to our count.
Other private text
Private documents, corporate memos, etc. are small compared to email (think how often you write an email vs a document). It won’t affect our count much either way, and it’s hard to get any data on it, so I won’t estimate this.
Other Data Sources
There are probably other interesting pots of data that I’ve missed. Here are a few candidates which I didn’t get around to looking at:
- Usenet archives
- Government documents
- Newspaper archives
I don’t think they would move the needle too much overall, though they may be useful for specific applications. There are probably other sources I haven’t considered.
Total human communication
Total words spoken (per day)
At the time of writing, world population is 8.1 billion people, of which about 90% are old enough to speak. Decent studies put average words spoken per person per day at about 16,000. So total words spoken per day worldwide is about 115 trillion. About 1% of this is recorded, mostly in the form of emails and instant messages. The rest is lost in time, like tears in rain.
Total words spoken (since 1800)
There are fairly good estimates for the number of people born each year since 1800. The total comes to just under 18 billion. Adjusting for years lived, I estimate around 3 million trillion words spoken since 1800.
Total words spoken (all time)
Finally, how many words have been spoken since language first evolved? Deep time population estimates are tricky, but a reasonable one comes out at 117 billion people. However, infant mortality was extremely high in earlier periods, so many of these people would not have lived long enough to speak. Adjusting for life expectancy, I estimate total words ever spoken at around 6 million trillion. Almost half of these words have been spoken since 1800.
Footnotes
- Conceivably there may be privacy preserving learning methods that allow you to train public models on private data, but the cost of a mistake would be enormous, so I can’t see major companies going this route any time soon. ↩︎
- Perhaps even a rogue AI, though I don’t think that’s very likely. ↩︎
- Source: my own paper :-) ↩︎
- There are about 4 chars per token of code. I did a crude sample of some random Github files and the average line of code came to 39.4 characters. That works out at 9.8 tokens per line. Let’s round that off to 10 tokens per line. ↩︎
- After I posted this, Twitter user @c_stroebele pointed me to two DeepMind papers that disclose some relevant numbers. The first is the Gopher paper, which uses a sample of 4m books (presumably from Google Books). The authors write that their “average book contains 120,000 tokens“, which is precisely in line with our estimate. However, in a later paper, Table 8 lists a dataset of 20,472,632 English books. This must be the entire English portion of Google Books. They explicitly give the total token count for this as 3,423,740 million tokens, using SentencePiece with a 128k vocab. I’m unsure what to make of this. On one hand, it’s the number we’re looking for, direct from the horse’s mouth. On the other hand, it works out at 167k tokens per book, which doesn’t tally with the other DeepMind paper, and implies the average book in the corpus must be over 500 pages. That doesn’t seem right, so I feel I must be missing something. I’m going to stick with my original number for now. If anyone knows better, please get in touch. ↩︎
- You can access a sample of the documents here. They range from boilerplate forms to long written submissions. A surprising number of them are hand written. ↩︎
- There was a recent successful legal challenge to these charges, so potentially this may change, but it’s correct at time of writing. ↩︎
- It seems like they would have celebrated reaching a billion, so the fact they didn’t suggests volumes stagnated or maybe declined. As a sanity check, Twitter has about 250m daily active users in 2024. So 375m tweets is 1.5 tweets per user per day, which seems plausible. ↩︎
- This seems too short to me, based on my own experiences with the platform. I sampled my own feed at random and the average tweet was 31.6 words (about 180 chars), which is 6x longer, and in line with data from Weibo. However, Twitter has changed post length limits substantially in recent years, so this may be a recent phenomenon. The 33 character average comes from an official Twitter blog post, so it must be correct up to 2018. ↩︎
- I found it hard to get a good number on this, but ~15% was mentioned in various sources, which seems reasonable based on my own experience. ↩︎
- I couldn’t find an original source for this, but the number is repeated around the web and attributed to Snapchat. It seems about the right order of magnitude. ↩︎
- This is another one of those numbers that’s repeated around the internet with no clear source, so I am a little dubious, but it does seem like the right order of magnitude. The most official-looking source I can find is a report from a small consultancy here, with no methodology given. ↩︎
- This is about right based on my own inbox, and is inline with some public email corpora such as Enron emails. ↩︎
How much publicly available unique code is there?
The data for the following can be found in my Evidence-based Software Engineering book, pdf+code+all data freely available here:
http://knosof.co.uk/ESEUR/
The Software Heritage contains around10^8 unique files (fig 1.15)
say 64 lines per file (table 7.2)
and 26 non-whitespace characters per line (fig 7.39)
giving 1.6 10^11 non-whitespace characters.
Now, how many tokens is that?
The most common statement is: a=b;
which contains 4 language tokens.
There is an exponential decline in language tokens per line (Fig 770.17 here http://knosof.co.uk/cbook/).
The question is how many LLM tokens per computer language identifier, which tend to be abbreviated, and I have no idea how these translate to LLM tokens.
If we say 10 LLM tokens per line, we get:
10^8*64*10 -> 6.4 10^10 LLM tokens
What percentage of the publicly available code is stored by Software Heritage?
Given the exponential growth seen in fig 1.15, perhaps 1% or less.
Hey people!!!!!
Good mood and good luck to everyone!!!!!
Amazing article!!!