header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

Google has open-sourced the synthetic data engine Simula, which can generate specialized training datasets from scratch without real data.

According to Perceiving AI monitoring, the Google research team has published a paper introducing Simula, a framework that elevates synthetic data generation from "one-at-a-time data crafting" to "designing entire datasets." The paper was published in the journal "Transactions on Machine Learning Research." Simula has been widely deployed internally at Google and serves as the primary data source for specialized models in the Gemma series, such as ShieldGemma (security filtering), MedGemma (medical), and FunctionGemma (function invocation). It also provides training data for the Gemini secure classifier, Android call fraud detection, and Google Messages spam filtering.

Existing synthetic data methods mostly optimize one data point at a time, relying on manual prompts or real data as seeds. They cannot precisely control the overall coverage, diversity distribution, and quality of the dataset. Simula requires no seed data; instead, the inference model constructs the entire dataset from scratch in four independently controlled steps:

1. Global Diversity: The inference model recursively decomposes the target domain into a hierarchical knowledge tree (e.g., the complete classification system of network security threats) to ensure coverage of tail scenarios.
2. Local Diversity: It generates various scenarios and representations at each knowledge node to prevent monotony in expressing the same concept.
3. Complexification: It can selectively increase the difficulty of some scenarios to independently adjust the difficulty distribution of the dataset.
4. Quality Control: A dual-reviewer mechanism independently assesses the correctness of each data point to counteract the model's bias towards seemingly plausible answers.

The research team used Gemini 2.5 Flash as the teacher model and Gemma-3 4B as the student model to generate and test up to 512,000 data points in five domains: network security, legal reasoning, elementary school mathematics (GSM8k), and multilingual academic knowledge (Global MMLU). The complete Simula process outperformed simplified approaches in all domains but did not have a universal formula. High-difficulty data led to a 10% accuracy improvement in mathematical reasoning but hindered performance in legal reasoning. This was because the teacher model lacked strength in that domain, resulting in unreliable high-difficulty data quality. The most crucial finding was that Simula achieved higher downstream performance with less data, emphasizing that data quality, not quantity, drives model improvement.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish