NewsFlash Articles Data Fundraising Skill&API

Microsoft Open Sources Phi-Ground: 4 Billion Parameter Click-Through Rate Model Outperforms Operator and Claude

According to Dongcha Beating monitoring, Microsoft has open-sourced the Phi-Ground model family, specifically addressing the AI-driven computer control issue of "where to click on the screen." Given a screenshot and a command, the model outputs the precise click coordinates. The open-sourced 4 billion parameter version, combined with a large model for instruction planning, achieved a higher click accuracy than OpenAI Operator and Claude Computer Use in the Showdown benchmark test, and secured first place in all five evaluations, including ScreenSpot-Pro, with parameters below ten billion.

The team conducted large-scale validation with over 40 million data points and found that three common training techniques in previous academic papers all failed when the data volume was increased. The truly effective approach is quite simple: treat coordinates as ordinary numerical outputs, such as "523, 417." Previously, several papers had invented a set of spatial vocabulary specifically for coordinates, hoping the model would speak coordinates like words. However, during large-scale training, these new words were not learned well at all, leading to model breakdowns instead. Another key discovery was to input textual commands before images. Since large models process information unidirectionally, reading the instruction "click the blue settings icon" before viewing the image allows the model to already know what to look for when processing pixels. In contrast, if the model first views the image, it can only scan blindly, resulting in much poorer performance.

The team also found that reinforcement learning is useful for purely visual tasks. The specific approach involves having the model predict multiple clicks on the same image, then comparing the results of correct and incorrect clicks for training (a method known as DPO, a type of reinforcement learning). Even if the model has been thoroughly fine-tuned, this step significantly improves accuracy. Previously, reinforcement learning was mainly used in language tasks requiring reasoning. Its applicability to pure perceptual tasks like "point and shoot" in images was an unexpected benefit. To address the issue of buttons being too small on 4K high-resolution screens (where a button may occupy only 0.07% of the screen area), the team resized the screenshots proportionally during training and pasted them onto a large white canvas, simulating the extremely small elements on a high-resolution screen. This technique proved particularly effective in complex professional software like Photoshop.

Source

Correction/Report

On-Chain Activity

3h ago

The frenzy of on-chain speculation continues to heat up, with old meme coins and new algorithmic stablecoins dancing together.

OpenAI to Fully Close Fine-Tuning API: Large Models Completely Shift to Prompt, Startup Teams Lose Underlying Customization Path

Sakana AI Teams Up with NVIDIA: Allowing GPUs to Skip 80% of Useless Computations in Large Models, H100 Inference Speedup by 30%

Microsoft Open Sources Phi-Ground: 4 Billion Parameter Click-Through Rate Model Outperforms Operator and Claude

Correction/Report

Submit

Add Library

Visible to myself only

Public

Save

Choose Library

Add Library

Cancel

Finish

Microsoft Open Sources Phi-Ground: 4 Billion Parameter Click-Through Rate Model Outperforms Operator and Claude

A whale FOMO bought $477,000 worth of SATO 3 days ago, currently sitting on a $130,000 unrealized gain

A whale bought $17 million worth of ETH and deposited it into the Lido staking pool

The WorldCoin team sent 30 million WLD to BitGo, worth $8.17 million

The USDC Treasury has minted an additional 250 million USDC on the Solana blockchain.