NewsFlash Articles Data Fundraising Skill&API

Google has open-sourced the synthetic data engine Simula, which can generate specialized training datasets from scratch without real data.

According to Perceiving AI monitoring, the Google research team has published a paper introducing Simula, a framework that elevates synthetic data generation from "one-at-a-time data crafting" to "designing entire datasets." The paper was published in the journal "Transactions on Machine Learning Research." Simula has been widely deployed internally at Google and serves as the primary data source for specialized models in the Gemma series, such as ShieldGemma (security filtering), MedGemma (medical), and FunctionGemma (function invocation). It also provides training data for the Gemini secure classifier, Android call fraud detection, and Google Messages spam filtering.

Existing synthetic data methods mostly optimize one data point at a time, relying on manual prompts or real data as seeds. They cannot precisely control the overall coverage, diversity distribution, and quality of the dataset. Simula requires no seed data; instead, the inference model constructs the entire dataset from scratch in four independently controlled steps:

1. Global Diversity: The inference model recursively decomposes the target domain into a hierarchical knowledge tree (e.g., the complete classification system of network security threats) to ensure coverage of tail scenarios.
2. Local Diversity: It generates various scenarios and representations at each knowledge node to prevent monotony in expressing the same concept.
3. Complexification: It can selectively increase the difficulty of some scenarios to independently adjust the difficulty distribution of the dataset.
4. Quality Control: A dual-reviewer mechanism independently assesses the correctness of each data point to counteract the model's bias towards seemingly plausible answers.

The research team used Gemini 2.5 Flash as the teacher model and Gemma-3 4B as the student model to generate and test up to 512,000 data points in five domains: network security, legal reasoning, elementary school mathematics (GSM8k), and multilingual academic knowledge (Global MMLU). The complete Simula process outperformed simplified approaches in all domains but did not have a universal formula. High-difficulty data led to a 10% accuracy improvement in mathematical reasoning but hindered performance in legal reasoning. This was because the teacher model lacked strength in that domain, resulting in unreliable high-difficulty data quality. The most crucial finding was that Simula achieved higher downstream performance with less data, emphasizing that data quality, not quantity, drives model improvement.

Source

Correction/Report

On-Chain Activity

12min ago

A high-leverage BTC long position was liquidated, with a position size of $7.53 million before the liquidation.

Binance Wallet: Alpha Airdrop Claim Opening Today at 5:00 PM, Threshold 242 Points

Salesforce has released Headless 360, transforming the entire platform into an AI-powered agent infrastructure.

Cloudflare Email Service Enters Public Beta, AI Agents Can Now Have Their Own Email Addresses for Sending and Receiving Emails

Correction/Report

Submit

Add Library

Visible to myself only

Public

Save

Choose Library

Add Library

Cancel

Finish

Google has open-sourced the synthetic data engine Simula, which can generate specialized training datasets from scratch without real data.

A high-leverage BTC long position was liquidated, with a position size of $7.53 million before the liquidation.

A whale unstaked 150,000 HYPE tokens and set up a TWAP sell order for 100,000 HYPE tokens after.

After Two-and-a-Half-Month Hiatus, Whale Makes $3.8 Million Short Bet on S&P 500 Index

Elon Musk Responds to SpaceX Official Mascot Request, Meme Coin ASTEROID Skyrockets Over $10 Million