header-langage
简体中文
繁體中文
English
Tiếng Việt
한국어
日本語
ภาษาไทย
Türkçe
Scan to Download the APP

Microsoft Open Sources Phi-Ground: 4 Billion Parameter Click-Through Rate Model Outperforms Operator and Claude

According to Dongcha Beating monitoring, Microsoft has open-sourced the Phi-Ground model family, specifically addressing the AI-driven computer control issue of "where to click on the screen." Given a screenshot and a command, the model outputs the precise click coordinates. The open-sourced 4 billion parameter version, combined with a large model for instruction planning, achieved a higher click accuracy than OpenAI Operator and Claude Computer Use in the Showdown benchmark test, and secured first place in all five evaluations, including ScreenSpot-Pro, with parameters below ten billion.

The team conducted large-scale validation with over 40 million data points and found that three common training techniques in previous academic papers all failed when the data volume was increased. The truly effective approach is quite simple: treat coordinates as ordinary numerical outputs, such as "523, 417." Previously, several papers had invented a set of spatial vocabulary specifically for coordinates, hoping the model would speak coordinates like words. However, during large-scale training, these new words were not learned well at all, leading to model breakdowns instead. Another key discovery was to input textual commands before images. Since large models process information unidirectionally, reading the instruction "click the blue settings icon" before viewing the image allows the model to already know what to look for when processing pixels. In contrast, if the model first views the image, it can only scan blindly, resulting in much poorer performance.

The team also found that reinforcement learning is useful for purely visual tasks. The specific approach involves having the model predict multiple clicks on the same image, then comparing the results of correct and incorrect clicks for training (a method known as DPO, a type of reinforcement learning). Even if the model has been thoroughly fine-tuned, this step significantly improves accuracy. Previously, reinforcement learning was mainly used in language tasks requiring reasoning. Its applicability to pure perceptual tasks like "point and shoot" in images was an unexpected benefit. To address the issue of buttons being too small on 4K high-resolution screens (where a button may occupy only 0.07% of the screen area), the team resized the screenshots proportionally during training and pasted them onto a large white canvas, simulating the extremely small elements on a high-resolution screen. This technique proved particularly effective in complex professional software like Photoshop.

举报 Correction/Report
Correction/Report
Submit
Add Library
Visible to myself only
Public
Save
Choose Library
Add Library
Cancel
Finish