Model Collapse – Why AI should not be trained with AI content (2024)

AI here, AI there. Google has released new devices with AI functions and Apple will soon be distributing its first operating systems with "Apple Intelligence". ChatGPT, DALL-E, Gemini, Llama, MidJourney and Stable Diffusion are also widely used. But this is precisely where the problem lies for the further development of the underlying models, such as the Large Language Models (LLMs) of chatbots. If these are trained with too much AI-generated content, so-called "model collapse" can occur. At least that is the conclusion of a study by the University of Oxford, which published in the journal Nature wurde.

Mac tip: Find and delete large files with Daisy Disk (advertising)

Model Collapse – Why AI should not be trained with AI content (1)

Chapter in this post:

AI failure: What is the so-called “model collapse”?

Chatbots like ChatGPT are trained with millions of texts and billions of words to be able to recognize connections and give relevant answers to questions. The situation is similar with AIs that generate images and are fed with a huge number of photos, works of art, sketches and the like. However, the study linked above shows that AI models produce increasingly poor results the more their generations are trained with the output of their previous generations.

As an example, a chat about historical architecture is shown, which after only a few AI-trained generations of the chatbot led to incomprehensible answers mentioning different types of rabbits. The situation is similar with AI models that generate images. As early as 2023 Another study showedthat imaging AI models sometimes produce highly distorted results even after training with the smallest amounts of their own images. AI model collapse describes the output of incomprehensible or unrecognizable results despite the input of understandable questions or tasks.

How does AI model collapse occur?

The current study in the Nature magazine shows the reasons for the incomprehensible and distorted output of the AIs. According to it, probable answers and sentence elements are given higher priority with each generation, while less likely content, phrases and words fall further and further behind and are ultimately forgotten. After several generations of AI models trained with previous answers, the answer contains completely false assumptions, repeated words and the like. Or in short: the AI "poisons" its own reality.

If the original training material for generation 0 already contains some errors, which may even be repeated and thus appear to be important, then these will be reinforced more and more in generations n. If the AI generations are trained with data from generation n-1, at some point the only possible answer will be the error. If there is then a grammatical collapse, repetitions arise, as in the chat example of the study, where there is only "[...] In addition to being home to some of the world's largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, […]" is called.

What is the danger with current AI training?

Training material is slowly running out for both chatbots and AI models that generate images or videos. The (freely available) web has already been almost completely drained for the largest companies and their models. Whether OpenAI, Google, Meta or others - the endless hunger for additional texts and data sets can only be satisfied in two ways: using every bit of brand new content directly as training material, and using "synthetic data".

But as more and more AI content floods the web, AI models will sooner rather than later be fed their own outputs. And "synthetic data" is AI-generated data sets that are created specifically for training new models. In addition to the risk of accidental "poisoning" (transferring the term "poisoning" from the study), AI is already being deliberately trained with the content of previous models - for example at Google and Meta. Because real material is slowly running out and high-quality sources (newspapers, etc.) want to see money for their texts.

Is there a solution to the problem of AI model collapse?

At present, the extent of the spending distortion shown in the study is still purely theoretical. The web is not yet filled with so much AI content that human-made content is in a vanishing minority. AI is therefore not trained primarily or exclusively on the basis of AI content. Nevertheless, precautions must be taken here, ideally with the cooperation of AI companies with sources of high-quality, human-made data sets. OpenAI is already working with Springer-Verlag and News Corp.. However, this costs several million dollars.

Another solution proposed by researchers at Oxford University is a joint agreement between AI companies to clarify questions of origin regarding existing data sets. It should therefore be checked whether sources may come from this or that AI, so that they can be marked accordingly and, if necessary, removed from the data set for training. But this would require automation and thus another (testing) AI. Because manual checking cannot satisfy the data hunger of AI training.

My tips & tricks about technology & Apple

GSM mobile phone with rotary dial “Macintosh Phone 128k” in classifieds
Authy hack: 33 million phone numbers stolen – security flaw in API uncovered
Study: Malware can use ChatGPT as an accomplice for its own optimization and distribution
Nine-year-old Apple TV is compatible with tvOS 18!
Cara vs. Instagram – that’s why many artists are switching to Cara.App
M4 iPad Pro camera: What is the new sensor for?
iFixit teardown: 13-inch iPad Pro and Apple Pencil Pro taken apart

Johannes Domke

After graduating from high school, Johannes completed an apprenticeship as a business assistant specializing in foreign languages. But then he decided to research and write, which resulted in his independence. For several years he has been working for Sir Apfelot, among others. His articles include product introductions, news, manuals, video games, consoles, and more. He follows Apple keynotes live via stream.

Did you like the article and did the instructions on the blog help you? Then I would be happy if you the blog via a Steady Membership would support.