- Many AI pilots fail real world operations and 95% of Genai pilots do not reach production, Salesforce complaints
- CRMARENA-PRO allows companies to test stress their AI agents with digital twins
- Two new benchmarks are used for stress test AI agents
Salesforce claims that companies are struggling with their AI pilots failing in real world operations and launched Crmarena-Pro, a new service to allow companies to create a digital twin of their operations to stress test agents before their deployment.
The company cited recent MIT research which revealed that 95% of AI generative pilots do not even reach the production stage.
CRMARENA-PRO evaluates AI agents on real tasks, such as customer service, sales forecasts and supply chain disturbances, but using synthetic data validated by experts.
Salesforce allows you to stress test AI agents using digital twins
“CRMARENA-PRO creates a rigorous business environment framework in context with synthetic data, where it can safely assess API calls to relevant systems, as well as the ability to protect PII data,” wrote the company in an ad.
By adding real world noise in the test environment, CrmaRena-Pro can better assess performance, strengthen resilience and fill the gap between the meadow and after deployment.
“The result is AI agents who are capable, consistent, trustworthy and agent for the company.”
Companies can also see how AI agents manage real world challenges such as disorderly data, inherited systems and complex workflows.
Salesforce noted that part of the complexity comes from the wide range of models available to choose today, and know which specific model or combination of models to use is not so simple.
To this air, the company has published two new references to measure the performance of agents: MCP-Eval for evaluation through synthetic tasks and MCP-Universe, which adds real tasks and evaluators based on execution to stress testing agents in complex scenarios.
In a previous article, Salesforce noted that Crmarena -Pro “lays the foundations for the next border: Enterprise General Intelligence” – and for the moment, users can expect an AI “safe, capable and impactful” for all organizations.