According to Insight Beating monitoring, the Perplexity research team has published a technical article, disclosing the post-training process of its web search agent. The process is based on the open-source models Qwen3.5-122B-A10B and Qwen3.5-397B-A17B, employing a two-stage approach: first, using supervised fine-tuning (SFT) to establish essential deployment behaviors such as instruction following and language consistency, and then using online policy reinforcement learning (RL) to optimize search accuracy and tool usage efficiency.
During the RL stage, the GRPO algorithm is used, and the training data consists of two parts: first, a self-developed synthetic multi-hop verifiable question-answering dataset, starting from internal seed queries, constructing questions requiring 2 to 4 hops of reasoning through entity chaining, and validated for answer uniqueness by multiple independent solvers; second, a general dialogue dataset based on rubric criteria, transforming deployment requirements such as instruction following and format constraints into objectively verifiable atomic conditions used at the RL stage to prevent behavioral degradation established by SFT.
The core of the reward design is gate aggregation: only when the baseline is correct (question-answering is accurate or all rubric criteria are met), preference scores are involved in the calculation to prevent high-preference signals from masking factual errors. Efficiency penalties are applied through intra-group anchoring, with same-group correct answers as the benchmark, imposing smooth penalties on excessive tool calls and generation lengths.
Evaluation shows that the post-trained Qwen3.5-397B-SFT-RL performs best on multiple search benchmarks. On FRAMES, a single tool invocation achieves 57.3%, which is 5.7 percentage points higher than GPT-5.4 and 4.7 points higher than Sonnet 4.6. Under a moderate budget (4 tool invocations), it reaches 73.9%, with a cost of 2.0 cents per query; under the same conditions, GPT-5.4 achieves 67.8% at 8.5 cents, and Sonnet 4.6 achieves 62.4% at 15.3 cents. Cost data is calculated based on each vendor's publicly available API pricing and does not include cache optimization.
