
Hug Faces shared a new case study last week showing how small language models (SLMs) outperform large models. In the post, researchers from the platform claimed that focusing on test time calculations is not about increasing the training time (AI) model of artificial intelligence (AI) models, which can display enhanced results for AI models. The latter is a reasoning strategy that allows AI models to spend more time solving problems and provides different approaches such as self-programming and searching for validators that can improve their efficiency.
How test time calculation scaling works
In one article, Embrace Face emphasizes that traditional methods of improving the functionality of AI models can often be resource-intensive and extremely expensive. Typically, a technique called train time computing is used, where preprocessing data and algorithms are used to improve the way the underlying model decomposes queries and enters the solution.
Additionally, researchers claim to focus on testing time-calculated scaling scales, a technology that allows AI models to spend more time solving problems and allow them to correct themselves that can display similar results.
The researchers highlighted an example of Openai’s O1 inference model calculated using test time, but the technology could allow AI models to show enhanced functionality, despite no changes to training data or preprocessing methods. However, there is a problem. Since most inference models are closed, it is not possible to understand the strategies being used.
Researchers used Google DeepMind and reverse engineering research to unravel how LLM developers perform test time calculations in the post-training phase. According to the case study, only increasing processing time does not show a significant improvement in the output of complex queries.
Instead, the researchers suggest using a self-performing algorithm that allows the AI model to evaluate responses in subsequent iterations and identify and correct potential errors. Furthermore, validators that can be searched using the model can further improve the response. Such a validator can be a learning reward model or hard-coded heuristic.
More advanced techniques will involve an optimal N method where the model produces multiple responses for each problem and assigns more appropriately to the judgment score. Such an approach can be paired with the reward model. Beam search prioritizes step-by-step reasoning and allocating scores for each step, another strategy the researchers emphasize.
By using the above strategy, hugging facial researchers were able to use the Llama 3B SLM and make it better than the larger Llama 70B on the Math-500 benchmark, a larger model.