Are We Running Out of Data for AI? (Featuring Encord)
- Kieren Sharma
- Nov 28, 2024
- 4 min read
Updated: Feb 28
In this episode, we tackle a question that's becoming increasingly relevant in the world of AI: are we running out of data? Initially, we planned to discuss how AI might be progressing faster than anticipated, but recent research suggests a potential limit to the current methods of AI development. This episode explores how the rapid advancements in AI have been significantly fuelled by scaling up models, compute power, and training data. But, is this sustainable?

Recent AI Progress
The recent progress in AI, particularly in large language models (LLMs), has been impressive. Let’s look at the progression of OpenAI’s models, for example:
GPT-2 (2019): 1.5 billion parameters, trained on 40GB of text data.
GPT-3 (2020): 175 billion parameters, trained on 570GB of text data. This model's capabilities led to the rise of platforms like ChatGPT.
GPT-4 (2023): 1.7 trillion parameters; the training data size is not publicly known.
These models show a clear trend of increasing size and, consequently, capability. However, it’s not just about scaling up the model size; there's also a crucial relationship between model size, data size, and training time.
The Data Bottleneck
While computational power and training time have been improving, we're starting to see a potential bottleneck with data. The amount of data on the internet is a finite resource, and we may be approaching the limits of what’s available.
Public vs Private Web: AI models primarily train on publicly available online data, a fraction of the total data created. This includes websites, forums, blogs, and platforms like Reddit and Wikipedia. Private data like emails and personal messages are not accessible to these models.
Common Crawl: The most widely used dataset is the Common Crawl, which contains about three billion web pages, compressed to around 100 terabytes, which is approximately 100 trillion tokens.
Estimates: The public web contains an estimated 500 trillion tokens, while private data may hold around 3,000 trillion tokens.
Current Usage: The largest training datasets for LLMs have reached about 15 trillion tokens, indicating that they aren't using the full Common Crawl due to data quality issues.
What are tokens? Tokens are words, character sets, or segments of words and punctuation utilised by large language models (LLMs). In most tokenizers used today, 1 token approximately equals 0.75 words on average.
Multimodality and Data Diversity
The conversation then expanded to include other modalities beyond text, such as image, video, and audio data. There's an estimated 10 trillion seconds of video and 500 billion to a trillion seconds of audio on the public web. These different modalities provide context and information, enriching the data.
How Long Until We Run Out?
A study from Epoch AI suggests we could face a data shortfall between 2026 and 2032, with a median date of 2028. This is primarily due to the heavy reliance on text data, and the rate at which the data required for these models is growing versus how quickly data is created on the internet.
Encord is Here to Help!
The second half of the episode featured Oscar Evans, a machine learning solutions engineer from Encord, a data development platform that focuses on data curation and labelling. Encord helps companies manage their data at scale, emphasising the importance of data quality over quantity. Key takeaways from Oscar included:
Data Quality is Paramount: Data quality is essential for specific use cases.
Managing Data at Scale: Encord helps clients find and manage specific data within large datasets.
AI to Understand AI: AI can be used to understand and categorise data sets.
Multimodality: The fusion of different data types is essential for developing agentic AI models.
Data Poisoning: Even very small amounts of poor-quality data can negatively impact models.
High-Quality Labels: It is critical to have high-quality, human-reviewed labels for accurate results.
Data-Centric Approach: Instead of focusing on simply scaling up data, focusing on the quality and relevance of data is more effective.
The Future of AI Training
We discussed how companies are beginning to use AI-generated or 'synthetic' data to train AI models. This is currently being done in areas where correctness is easily verified, such as coding tasks and mathematics. The potential risk is that models will collapse when they are trained repeatedly on previous versions of their own outputs. There's an increasing acknowledgement that simply scaling up models and data isn't a sustainable approach. New model architectures, data quality, and synthetic data will play an increasingly important role.
Conclusion
The episode revealed that while AI progress has been rapid, the current trajectory isn’t infinitely sustainable. As we approach data limits, there's a growing need to prioritise data quality, explore new modalities, and develop innovative methods for training AI models. The conversation with Oscar Evans from Encord highlighted that data is the key, and there's significant potential for progress by being more thoughtful about the data used to train AI models.
If you enjoyed reading, don’t forget to subscribe to our newsletter for more, share it with a friend or family member, and let us know your thoughts—whether it’s feedback, future topics, or guest ideas, we’d love to hear from you!
Comments