- Researchers from the best American universities provide that pre-training extension can be harmful to performance
- Too much pre-training can offer worse performance due to something that looks like the butterfly effect
- The more pre-formed they are, the more they become sensitive to small changes that could disrupt the final result
Researchers from Carnegie Mellon, Stanford, Harvard and Princeton dispute one of the fundamental beliefs accepted by the development of AI – that the pre -training data is the performance.
As indicated by HpcwireA new paper describes the concept of “catastrophic overtraining”, by which an extensive pre-training can harm the performance of a model after the fine adjustment.
The researchers compared two versions of the OLMO-1B model, one formed on 2.3 billions of tokens and another on 3 billions. Despite the wider training set, the most formed model would have made up to 3% of worse on references like Alpacaeval and Arc.
Reach the inflection point
This drop in performance, says the study, is linked to a phenomenon called “progressive sensitivity”.
As the number of tokens increases, the model becomes more fragile. Even small adjustments, such as adjustments during fine adjustment, or noise introduction, can reverse the previous gains.
The authors demonstrated this by injecting Gaussian noise into pre-formulated models, noting that performance deteriorated more strongly, the more the model was formed for a long time.
The point where this additional training begins to degrade performance is called “inflection point”.
Once reached, the advantages of the training begin to be exceeded by the risk of internal instability. The study revealed that this tilting point often occurs beyond 2.5 billions of tokens in smaller models, like OLMO-1B.
“The catastrophic overntraining can be inevitable … especially when the pre-training and affinity tasks are poorly aligned”, warn the authors in their article, to which you can access via the Arxiv pre-print server.
Although researchers do not suggest the end of pre-training, they believe that developers should consider how sufficient pre-training is. As the newspaper concludes, “our results require a renewed concentration on the modeling of the model which considers the entire training pipeline”.
For AI developers to chase the ladder, the message seems clear: sometimes, the less it is really more.




