According to 1M AI News, on March 4th, Google released the Gemini 3.1 Flash-Lite preview, positioned as the fastest and most cost-effective model in the Gemini 3 series. This model is based on the Gemini 3 Pro architecture, utilizing a Mixture of Experts (MoE) design, activating only partial parameters to reduce inference costs. The API pricing is $0.25/million tokens for input and $1.50/million tokens for output, around 1/8 of the Gemini 3.1 Pro ($2/$18).
In terms of performance, compared to Gemini 2.5 Flash, the first token latency has been reduced by 2.5 times, and the output speed has increased by 45%, reaching 363 tokens per second. It supports a maximum of 1 million token input and 64,000 token output, accepting text, image, audio, and video input. In 11 internal benchmark tests, Flash-Lite outperformed GPT-5 mini and Claude 4.5 Haiku in 6 categories, with GPQA Diamond (Doctor-level Science QA) at 86.9%, MMMU-Pro (Multimodal Reasoning) at 76.8%, LiveCodeBench (Code Generation) at 72.0%.
This model comes with adjustable "thinking levels" built-in, allowing developers to control the model's inference depth in AI Studio and Vertex AI, balancing quality and cost in high-throughput scenarios. Currently available through the Gemini API (Google AI Studio) and Vertex AI for preview access.
