Back Arrow Back To Blogs

Foundational models to achieve Pixel-to-Control

By: Karthigeyan Ganesh Shankar & Srividya Prasad

Ever notice how we can stroll from the factory floor to the IT desk without whipping out Google Maps for every ten steps we take? Our brain quietly stitches together landmarks, memories, and a pinch of intuition, guiding us on autopilot to our destination. That’s muscle memory at work on home turf, our brain parses through a well-rehearsed routine.

Now imagine giving that same “street-smart” superpower to an industrial robot. That is exactly what we have been tinkering with, teaching our bots to read the factory floor plan the way humans do, so they move to their destinations without having to clutch a digital compass.

The idea of Foundational models is the grain-bound ability to be able to emulate this behaviour for a bot. Given that the bot is navigating a known environment (or unseen environment), would it be able to move to a destination with reasonable control and precision?

Recently, there has been a lot of activity around LLMs (Large Language models) and the more advanced VLMs (Vision Language models) which have been trained by ingesting large amounts of data and providing reasonably good predictions.  Intuition says we can train robots to recognize aisles, pallets, and loading bays the same way we spot the coffee machine on a Monday morning fast, natural, and with zero second-guesses!

The idea is simple: Can we apply the approach of transformers and training on large datasets that made large language models successful, in robotics? Instead of separate algorithms for mapping, localization, and path planning, imagine an end-to-end model that learns navigation from robot datasets and generalizes to new, unseen environments.

This vision is compelling, but the challenges are immense. Unlike text, robotic navigation must account for diverse environments, sensor input data types, intrinsic parameters, actuators, physical dimensions, locomotion types, and degrees of freedom. The data is very multimodal. There’s a lot of research on this and all of us work towards and await the day that these models are ready for real-world deployment, right out-of-the-box. 

We took this idea forward with two promising foundation models on our Sherpa-RP, a mobile robot platform designed, equipped with sensors and motors and engineered by Ati Motors for research purposes. We adopted a simulation-first approach in order to test and fine tune the two models on Nvidia’s Isaac Sim environment.

A black and white robotic device with a game controller on top, featuring control buttons, warning labels, and geometric symbols. The device rests on a gray concrete floor.
Sherpa RP (Reality)
A small yellow robotic vehicle moves along a blue floor in a warehouse, passing tall shelves stacked with cardboard boxes and wooden pallets under artificial lighting.
Sherpa RP (Simulation)
A dark-themed interface displays multiple connected nodes and blocks, representing a visual programming workflow with icons, labels, and lines linking different functions.
An overview of the ROS2 Omnigraph to run Sherpa RP in Isaac Sim

ViNT (Visual Navigation Transformer) creates a topological map by teleoperating the robot, then uses transformer-attention to navigate between nodes by generating waypoints.

A computer screen displays rows of image thumbnails labeled with numbered filenames, each showing a similar scene of shelves and a large blue area in the foreground.
Topomap
A computer interface shows a 3D warehouse scene at the bottom left and a black grid with a green path outline on the right, indicating navigation or mapping software in use.
Path taken during topomap creation

Whereas NoMaD (Navigation with Goal Masked Diffusion) uses diffusion models to predict actions from random noise based on visual understanding from training data.

The table below summarizes our key observations with the two foundational models we studied on our platform

                                                ViNT

  • Majorly successful goal-reaching runs with excellent repeatability
  • Reliable within mapped environments
  • It did not avoid dynamic obstacles like humans walking into its path
  • It was about to collide into chairs, glass walls
  • Took unnecessarily long paths when short paths existed in the map
  • More likely to be successful when the goal image has a distinguishable object

                                            NoMaD

  • Few runs successful to completion 
  • Successfully explored open areas but repeatedly tried to go under warehouse racks
  • Actions are noisy with angular velocities even when the goal is straight ahead
  • Got lost in areas with similar-looking warehouse racks
A wide view of a warehouse interior with stacked pallets on shelves to the left, some boxes and racks to the right, and a few chairs and tables in the center on a shiny blue floor.
Sample Goal Image from topomap
A computer screen displays a warehouse simulation with shelves, a terminal running code, and a robot navigation path visualized on a grid in robotics software.
Failed Run
A computer interface displays a 3D simulation with a grid, a room containing shelves and boxes, and a green line outlining a path or boundary on the right. Multiple control panels and options are visible.
Successful Run

Key Insights:

  • Domain-Specific Learning is Critical 
    NoMaD’s affordances failed in our virtual warehouse. While the trained model could have learnt to avoid office halls or not go off the road outdoors from training data, it did not recognize that warehouse rack undersides are non-navigable. Foundation models are not truly foundational until they are trained in the domain of interest. We must finetune the model instead of relying on zero-shot navigation.
A warehouse shelf with stacked pallets, viewed from a low angle. Blue lines and points are overlaid, possibly indicating object tracking or path analysis on a yellow floor.
Sherpa-RP trying to navigate under a rack.
A small, boxy mobile robot with black and white panels and warning labels moves across a room with wooden floors, near blue chairs and another robot in the background.
Sherpa-RP in a cluttered environment
  • Robot Embodiment Matters 
    Both models struggled with spatial reasoning that depends on physical robot dimensions. Without explicit embodiment knowledge, they misjudge available space. Accurate geometry awareness is essential for reliable operation.
  • Safety Must Be Prioritized 
    Despite claims of emergent avoidance, we saw inadequate safety behaviors. In industrial or service settings, the robot must detect and avoid dynamic obstacles or replan around static ones. Such mechanisms cannot be compromised.
A person in jeans and sandals stands behind a small, boxy robotic device with buttons and controls on its surface, positioned on a wooden floor in an indoor setting.
The bot traversed very close to a human.
  • System Optimization 
    The navigation runs at 4 Hz and the control loop at 10 Hz. The model needs GPU compute to meet this timing. Profiling showed time lost in input data formatting. We need smaller, optimized models for edge deployment.
  • Smarter Spatial Awareness and Path Planning 
    If the robot gets lost, there is no feedback to recover. A robot cannot be localized without prior node knowledge. While it follows sequences well, we still do not know what defines them. Models are to be extended to plan across the full topomap, not just between adjacent nodes.
Computer screen showing a robotics simulation with a warehouse scene, terminal output, a virtual camera view, and a 2D map; a yellow path and a robot are visible in the 3D scene.
The bot cannot recognise that the goal is just a little farther away from the start node
  • Smoother motion 
    Traditional path planning produces smooth, optimal trajectories. Both ViNT and NoMaD generated noisy actions with unnecessary angular velocity variations, even for straightforward goals. Using a better tracker to ensure we reach the waypoint could also help mitigate this.

     

Recent Research:

Several more recent approaches attempt to address the current limitations in robot navigation. One promising direction is using models like NaviDiffuser, which generate entire action sequences instead of just single-step actions. This enables longer-term planning that considers multiple objectives such as safety, efficiency, and operational cost.

To improve obstacle avoidance, the CARE framework enhances navigation by combining ViNT with depth estimation and a local costmap, changing trajectories to replan around obstacles or immediately avoid them.

Recent work like NaviBridger holds promise for smoother and smarter actions by denoising previous actions rather than denoising random actions from scratch for each frame.

Safety is also being tackled through hybrid approaches like Risk-Guided Diffusion, which fuses a fast, learned policy with a slower, physics-based controller. This balances the adaptability of foundational models with the reliability of formal safety guarantees.

In service-oriented scenarios, object-based navigation like LM-Nav leverages visual language models to identify landmarks in images, constructing a navigation graph and planning paths to specific goals. While not ideal for repetitive warehouse automation, this technique offers significant value in object-rich environments.

Another innovative approach is LLM-guided planning, where large language models use their commonly known knowledge of the world to inform navigation. This semantic understanding acts as a powerful heuristic, enabling more intelligent and context-aware decision-making.

Future Directions

Testing ViNT and NoMaD on the Sherpa-RP, our mobile robot platform for autonomy research, showed us both the potential and current limitations.  They helped us understand what are the needs that have to be addressed for real world applications and how they are being solved or thought about in the latest research publications. 

We are actively researching

  • Hybrid approaches supplementing the explored models with active feedback
  • Generating synthetic warehouse training data for finetuning these models
  • Multi-modal models with depth and IMU fusion

The future does lie in thoughtful combinations of traditional and learned models utilising the strengths of both. As researchers continue pushing boundaries and addressing current limitations, we are optimistic about the eventual realization of truly general robotic navigation systems.

References

  1. ViNT: A Foundation Model for Visual Navigationhttps://arxiv.org/abs/2306.14846
  2. NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration –
    https://arxiv.org/abs/2310.07896
  3. NaviDiffuser: Cost‑Guided Diffusion Model for Visual Navigation –
    https://arxiv.org/abs/2504.10003
  4. CARE: Enhancing Safety of Foundation Models for Visual Navigation through Collision Avoidance via Repulsive Estimation – https://arxiv.org/abs/2506.03834
  5. NaviBridger: Prior Does Matter – Visual Navigation via Denoising Diffusion Bridge Models – https://arxiv.org/abs/2504.10041
  6. Risk-Guided Diffusion: Toward Deploying Robot Foundation Models In Space, Where Failure Is Not An Option – https://arxiv.org/pdf/2506.17601
  7. LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action – https://arxiv.org/abs/2207.04429 
  8. Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning – https://arxiv.org/abs/2310.10103