Data inbreeding
Nowadays, AI has been trained on human-created data and is still improving at an exponential pace.
So far, AI has been essentially human-made, learning from the vast amount of data that people have put on the internet.
Training data for AI
AI learns to understand the world by studying the enormous amount of text and code available online.
1. Public web data This is the largest share of its training data. It learns from billions of sentences on the internet, including news articles, blog posts, forum discussions, and encyclopedias like Wikipedia.
Through this, it picks up language patterns, common sense, and the latest information that people use in everyday life.
2. Professional books and articles It uses data from countless ebooks and academic papers to gain a deeper understanding of language structure and specialized knowledge.
3. Open‑source code It learns from code published on open‑source platforms such as GitHub.
By doing so, it understands the syntax and logic of various programming languages such as Python, Java, and C++, and can write code or fix bugs when users ask.
4. Human feedback The model does not just arrange letters in sequence; it also goes through a process of human feedback to ensure its answers are helpful and safe.
Through this, it learns more natural and ethical ways of communicating by having its answers evaluated and revised.
These are the four main ways AI currently learns.
When humans rely on AI to write
However, these days humans rely on AI when writing many of the texts and data on the internet.
The web is already full of content that is not written purely by humans, but with the help of AI.
After AI has learned all the existing data on Earth, at some point there may no longer be many truly new, human‑written texts.
Instead, AI will increasingly learn from AI‑generated writing.
If web text, research papers, and programming code stop growing, AI will be forced to learn mainly from conversations with people.
From that point on, AI’s training data will begin to stagnate.
So I asked an AI:
If web text, papers, and programming code stopped growing, would you end up learning only from conversations with people?
The answer was roughly as follows (paraphrased):
If all new writing and knowledge in the world stopped and there were only conversations with people, AI’s way of learning would take a completely different turn.
Instead of quantitative expansion, it would likely focus on qualitative improvement and reflective learning.
1. Deepening RLHF (Reinforcement Learning from Human Feedback) Even today, AI is trained with a process called Reinforcement Learning from Human Feedback.
Even without new information, it could keep learning how to be more accurate, polite, and useful through its conversations with people.
It would be like refining your tone and explanations over and over again: explaining the same encyclopedia entry differently to a child, a beginner, or an expert.
2. Synthetic data creation When data becomes scarce, it becomes important for AI to review its own Q&A interactions and learn from them again.
If an AI solves a complex reasoning problem and a human labels its solution as correct, that conversation itself becomes new training data.
This can dramatically improve reasoning ability even if the amount of factual knowledge stays the same.
3. Shift from knowledge to wisdom and taste While web texts and papers mainly teach facts, conversations with people teach value judgments and social context.
Even if “the set of technologies that exist” no longer grows, humans can still ask how ethically and meaningfully those technologies should be applied to our lives.
In that sense, AI becomes a deeper projection of human values.
However, there is also a warning:
⚠️ Possible side effect: data inbreeding
If conversations are repeated within a closed loop, without new external information (new papers, news, etc.), model collapse can occur, where knowledge becomes distorted or biased and those biases are reinforced.
In short, if data growth stops, AI will try to evolve from a “learning machine” into more of a counselor and strategist that understands people better and reasons more carefully.
But at the same time, it risks getting stuck in a loop of its own thoughts.
Data inbreeding
If that happens, the internet will be filled mostly with AI‑generated articles that AIs read and learn from each other.
That is data inbreeding: inbreeding between similar data.
Just as genetic inbreeding in humans or animals can lead to hereditary diseases, data inbreeding can cause models to degenerate or become fragile.
It may not be a problem right now, while AI is still learning from a constantly growing pool of human‑created data.
But if humans eventually stop writing new things and rely only on AI‑generated content, AI may not develop as much as we expect and could enter a long stagnation period.
Right now, AI seems to be improving almost quadratically.
But once the stagnation period arrives, progress may start to decay like a half‑life, and people may look back nostalgically on “the good old days” of AI.
backtodev
A 40-something PM returns to code. Learning, failing, and growing.