A team of AI researchers from Stanford University and Washington University achieved a breakthrough in AI development by training sophisticated “inference” models for under $50 on cloud calculation credits. Researchers have used Google, Alibaba’s AI models to create a chatbot that is as good as ChatGpt-Maker Openai’s O1 LLMS.
The model named S1 is claimed to perform comparable to state-of-the-art inference models such as Openai’s O1 and Deepseek’s R1 in tests assessing mathematics and coding abilities.
Delhi Election Results 2025
The S1 model is now available on GitHub, along with the data and code used in its training.
How researchers develop inference models
The researchers behind S1 said they used and refined the readily available basic model through a process known as distillation. This is a technique that extracts “inference” features from another AI model by training S1 on the answer.
Researchers argue that the model S1 is distilled from Google’s Gemini 2.0 flash thinking experimental model. This approach reflects the way Berkeley researchers used last month to create AI inference models for around $450.
The researchers behind S1 aim to identify the simplest approach to achieving strong inference performance and “test time scaling”. This allows the AI model to deliberate more before generating a response.
“Test time scaling is a promising new approach to language modeling that uses additional test-time computing to improve performance,” the researcher said in a paper published last week (via TechCrunch) that “realizes tests.” He added that he is looking for the simplest approach to doing – time scaling and powerful inference performance.
S1 research uses a relatively small dataset and a process called monitored fine tuning (SFT), in which AI models are explicitly trained to mimic specific behaviors within the dataset, and uses a process called monitored fine tuning (SFT). It suggests that the model can be distilled effectively.
“First, we curate a small dataset of S1K of 1,000 questions combined with inference traces that rely on three criteria to verify through ablation: difficulty, diversity, quality. Second, we will look at the thought process of the model. Develop budget enforcement to control test time calculations by adding multiple “wait” when trying to force quit or end model generation. This allows models to double-check the answers and often correct false inference steps,” the researchers said.