According to Dynamic Beating monitoring, Logan Kilpatrick, Google DeepMind's Senior Product Manager and Google AI Studio Product Lead, stated at X that every company building AI-based products should establish its own benchmark (a standardized test set used to measure AI model performance). He mentioned that this is a way to help the model progress "disproportionately benefit your company" and advised founders and business owners to "start tomorrow."
Currently, most companies rely on AI models based on public leaderboards, but these leaderboards focus on general capabilities, often disconnected from specific business scenarios. For example, a company specializing in contract review is most concerned about clause extraction accuracy, but this particular test is not included in public benchmarks, making it impossible to assess the model's performance in this area. The benefits of creating custom benchmarks are twofold: first, evaluating each model update using your own business tasks to identify the best-performing model in your specific scenario, rather than just relying on the model with the highest public ranking; second, providing this test data to model providers to drive continuous optimization in the directions that matter to you.
Kilpatrick mentioned that companies like Zapier and Sierra are already pursuing this approach, stating, "there is a significant alpha (excess return) that can be created here."
