A new study by Microsoft estimates the size of some of the most powerful AI models in existence today, but this feature is being kept secret. Microsoft suggests that Claude 3.5 Sonnet consists of 175 billion parameters, and the o1 preview has 300 billion parameters.
The technology company also suggests that OpenAI’s smaller models, o1 Mini and GPT 4o Mini, are made up of 100 billion and 8 billion parameters, respectively.
This got people excited. The GPT-4o Mini is a powerful model from OpenAI, ranking higher than the larger GPT-4o Claude 3.5 Haiku and comparable to the latest Llama 3.3 70B according to Artificial Analysis quality metrics.
The 8B parameter model may be incorporated into portable devices for local use. Yuchen Jin, CTO of Hyperbolic Labs, asked Sam Altman, head of OpenAI, in a post on X. It might be possible to run it on our local device. ”
However, we speculate that GPT-4o mini, like GPT-4o, is a “Mixed Expertise (MoE) model” that uses smaller, specialized models internally to solve different parts of the problem. Some people.
Oscar Le, CEO of SnapEdit, one of the most popular AI photo editing apps, said, “We believe 4o-mini has about 40B total parameters and maybe 8B active MoE. ” he said.
“We found that while being very fast, it retained much more knowledge than the 8B model (when you ask about facts).Furthermore, the GPT-4o is MoE, so the mini also has the same architecture may be used,” he added in a post about X.
Microsoft experimented with the above model in a study to develop a benchmark for detecting and correcting medical errors in clinical notes. However, this is not the exact number of parameters.
“The exact number of parameters for some LLMs has not yet been published. Most numbers for the parameters are estimates reported to provide more context for understanding model performance.” says Microsoft in the study.
OpenAI, Anthropic, and Google have not released detailed technical reports outlining the architectural details and techniques used to build their latest models. This is likely due to concerns about exposing proprietary technology. By the way, GPT-4, released in 2023, was the last model for which OpenAI provided a technical report.
However, companies such as Microsoft and Chinese AI giant Alibaba’s Qwen and DeepSeek have published detailed technical documentation for their models. Recently, all the details of Microsoft’s Phi-4 model were released.
Harkirat Behl, one of the creators of Microsoft’s Phi-4 model, said in an interview with AIM that the company is taking a different approach than OpenAI and Google. “In fact, we have given all the secret recipes and very complex techniques (for the model), but no one in the world has implemented these techniques.”
“We published all those details in print. That’s how much we love open source at Microsoft,” Behl added.
“You don’t just need a big model.”
In recent years, the number of parameters in AI models has been decreasing, and the latest facts confirm this trend. Last year, EpochAI announced parameters for several frontier models, including GPT 4o and Claude 3.5 Sonnet.
Following Microsoft, EpochAI also revealed that GPT-4o has 200 billion parameters. According to EpochAI, 3.5 Sonnet has about 400 billion parameters, in stark contrast to Microsoft’s estimated 175 billion parameters. Regardless, this suggests that AI models prioritize the number of parameters.
Between GPT-1 and GPT-3, the number of parameters increased by a factor of 1,000, and between GPT-3 and GPT-4, the number of parameters increased by a factor of 10, from 175 billion to 1.8 trillion. However, the multiplication factor is reversed.
“Far from reaching the 10 trillion parameter mark, current frontier models such as the original GPT-4o and Claude 3.5 Sonnet are probably an order of magnitude smaller than GPT-4,” said Ege Erdil, a researcher at EpochAI. I mentioned it in December last year.
Initially, increasing the parameter size improved the model’s performance. However, over time, the model did not scale further as the computational complexity and parameter size increased. The unavailability of new datasets also contributes to these revenue declines.
“Models with more parameters are not necessarily better; they are generally more expensive to run and require more RAM than a single GPU card can accommodate,” Yann LeCun said in a post about X. Masu.
For this reason, engineers have sought efficient methods at the architectural level to scale models. One such technology is MoE, and the GPT-4o and 4o Mini reportedly work with this technology.
“(MoE is) a neural network consisting of several specialized modules, only one of which is executed at a given prompt. Therefore, the effective number of parameters used at any one time is less than the total number. LeCun further stated.
As 2024 draws to a close, models emerge in the ecosystem with innovative technology that surpasses the Frontier model. Microsoft’s Phi-4, released in December, uses small, carefully selected, high-quality datasets to train Phi-4 models. These outperform many leading models, including GPT-4o.
Just two weeks ago, DeepSeek released the open source MoE Model V3. Not only did it outperform GPT-4o in most tests, but it also trained for just $5.576 million. For example, training for GPT-4 cost $40 million and Gemini Ultra $30 million.
Therefore, in 2025 we are likely to see more optimization and scaling techniques that take models to higher levels at significantly lower costs.
“While model size has stagnated or even decreased, researchers are now using either training-at-test or neuro-symbolic approaches such as search-at-test, program synthesis, or the use of symbolic tools. “I am looking at the pertinent question of “Keras and ARC AGI Benchmark, X”.
“We don’t just need bigger models. We need better ideas. Now, better ideas are finally being put into practice,” he added.