- Robots still fail quickly once removed from predictable factory environments
- Microsoft Rho-alpha links language understanding directly to robotic movement control
- Touch sensing is key to bridging the gap between software and physical action
Robots have long operated reliably in tightly controlled industrial environments, with predictable environments and limited deviations, but outside of that they often encounter difficulties.
To alleviate this problem, Microsoft announced Rho-alpha, the first robotic model derived from its Phi series of vision languages, arguing that robots need better ways to see and understand instructions.
The company believes that systems can operate beyond assembly lines by responding to changing conditions rather than following rigid scripts.
What Rho-alpha is designed for
Microsoft relates this to what is widely called physical AI, where software models are meant to guide machines in less structured situations.
It combines language, perception and action, reducing dependence on fixed production lines or instructions.
Rho-alpha translates natural language commands into robotic control signals and focuses on bimanual manipulation tasks, which require coordination between two robotic arms and precise control.
Microsoft describes the system as an extension of typical VLA approaches by expanding both perception and learning inputs.
“The emergence of vision-language-action (VLA) models for physical systems enables systems to perceive, reason, and act with increasing autonomy alongside humans in much less structured environments,” said Ashley Llorens, corporate vice president and general manager of Microsoft Research Accelerator.
Rho-alpha includes tactile sensing alongside vision, with additional sensing modalities such as force, which is a continuing development.
These design choices suggest an attempt to narrow the gap between simulated intelligence and physical interaction, although their effectiveness remains under evaluation.
A central part of Microsoft’s approach relies on simulation to process limited robotic data at scale, particularly data involving touch.
Synthetic trajectories are generated by reinforcement learning within Nvidia Isaac Sim, then combined with physical demonstrations from commercial and open datasets.
“Training basic models that can reason and act requires overcoming the scarcity of diverse real-world data,” said Deepu Talla, vice president of Robotics and Edge AI at Nvidia.
“By leveraging NVIDIA Isaac Sim on Azure to generate physically accurate synthetic datasets, Microsoft Research is accelerating the development of versatile models like Rho-alpha that can master complex manipulation tasks. »
Microsoft also emphasizes human corrective intervention during deployment, allowing operators to intervene using teleoperation devices and provide feedback that the system can learn from over time.
This training loop combines simulation, real-world data, and human correction, reflecting an increasing reliance on AI tools to compensate for scarce embedded data sets.
Professor Abhishek Gupta, Assistant Professor at the University of Washington, said: “Although generating training data through teleoperation of robotic systems has become standard practice, there are many contexts in which teleoperation is impractical or even impossible.
“We are working with Microsoft Research to enrich pre-training datasets collected from physical robots with various synthetic demonstrations using a combination of simulation and reinforcement learning.”
Follow TechRadar on Google News And add us as your favorite source to get our news, reviews and expert opinions in your feeds. Make sure to click the Follow button!
And of course you can too follow TechRadar on TikTok for news, reviews, unboxings in video form and receive regular updates from us on WhatsApp Also.




