Foundational models to achieve Pixel-to-Control

By: Karthigeyan Ganesh Shankar & Srividya Prasad

Ever notice how we can stroll from the factory floor to the IT desk without whipping out Google Maps for every ten steps we take? Our brain quietly stitches together landmarks, memories, and a pinch of intuition, guiding us on autopilot to our destination. That’s muscle memory at work on home turf, our brain parses through a well-rehearsed routine.

Now imagine giving that same “street-smart” superpower to an industrial robot. That is exactly what we have been tinkering with, teaching our bots to read the factory floor plan the way humans do, so they move to their destinations without having to clutch a digital compass.

The idea of Foundational models is the grain-bound ability to be able to emulate this behaviour for a bot. Given that the bot is navigating a known environment (or unseen environment), would it be able to move to a destination with reasonable control and precision?

Recently, there has been a lot of activity around LLMs (Large Language models) and the more advanced VLMs (Vision Language models) which have been trained by ingesting large amounts of data and providing reasonably good predictions. Intuition says we can train robots to recognize aisles, pallets, and loading bays the same way we spot the coffee machine on a Monday morning fast, natural, and with zero second-guesses!

The idea is simple: Can we apply the approach of transformers and training on large datasets that made large language models successful, in robotics? Instead of separate algorithms for mapping, localization, and path planning, imagine an end-to-end model that learns navigation from robot datasets and generalizes to new, unseen environments.

This vision is compelling, but the challenges are immense. Unlike text, robotic navigation must account for diverse environments, sensor input data types, intrinsic parameters, actuators, physical dimensions, locomotion types, and degrees of freedom. The data is very multimodal. There’s a lot of research on this and all of us work towards and await the day that these models are ready for real-world deployment, right out-of-the-box.

We took this idea forward with two promising foundation models on our Sherpa-RP, a mobile robot platform designed, equipped with sensors and motors and engineered by Ati Motors for research purposes. We adopted a simulation-first approach in order to test and fine tune the two models on Nvidia’s Isaac Sim environment.

ViNT (Visual Navigation Transformer) creates a topological map by teleoperating the robot, then uses transformer-attention to navigate between nodes by generating waypoints.

Whereas NoMaD (Navigation with Goal Masked Diffusion) uses diffusion models to predict actions from random noise based on visual understanding from training data.

The table below summarizes our key observations with the two foundational models we studied on our platform

ViNT

Majorly successful goal-reaching runs with excellent repeatability
Reliable within mapped environments
It did not avoid dynamic obstacles like humans walking into its path
It was about to collide into chairs, glass walls
Took unnecessarily long paths when short paths existed in the map
More likely to be successful when the goal image has a distinguishable object

NoMaD

Few runs successful to completion
Successfully explored open areas but repeatedly tried to go under warehouse racks
Actions are noisy with angular velocities even when the goal is straight ahead
Got lost in areas with similar-looking warehouse racks

Key Insights:

Domain-Specific Learning is Critical
NoMaD’s affordances failed in our virtual warehouse. While the trained model could have learnt to avoid office halls or not go off the road outdoors from training data, it did not recognize that warehouse rack undersides are non-navigable. Foundation models are not truly foundational until they are trained in the domain of interest. We must finetune the model instead of relying on zero-shot navigation.

Robot Embodiment Matters
Both models struggled with spatial reasoning that depends on physical robot dimensions. Without explicit embodiment knowledge, they misjudge available space. Accurate geometry awareness is essential for reliable operation.
Safety Must Be Prioritized
Despite claims of emergent avoidance, we saw inadequate safety behaviors. In industrial or service settings, the robot must detect and avoid dynamic obstacles or replan around static ones. Such mechanisms cannot be compromised.

System Optimization
The navigation runs at 4 Hz and the control loop at 10 Hz. The model needs GPU compute to meet this timing. Profiling showed time lost in input data formatting. We need smaller, optimized models for edge deployment.
Smarter Spatial Awareness and Path Planning
If the robot gets lost, there is no feedback to recover. A robot cannot be localized without prior node knowledge. While it follows sequences well, we still do not know what defines them. Models are to be extended to plan across the full topomap, not just between adjacent nodes.

Smoother motion
Traditional path planning produces smooth, optimal trajectories. Both ViNT and NoMaD generated noisy actions with unnecessary angular velocity variations, even for straightforward goals. Using a better tracker to ensure we reach the waypoint could also help mitigate this.

Recent Research:

Several more recent approaches attempt to address the current limitations in robot navigation. One promising direction is using models like NaviDiffuser, which generate entire action sequences instead of just single-step actions. This enables longer-term planning that considers multiple objectives such as safety, efficiency, and operational cost.

To improve obstacle avoidance, the CARE framework enhances navigation by combining ViNT with depth estimation and a local costmap, changing trajectories to replan around obstacles or immediately avoid them.

Recent work like NaviBridger holds promise for smoother and smarter actions by denoising previous actions rather than denoising random actions from scratch for each frame.

Safety is also being tackled through hybrid approaches like Risk-Guided Diffusion, which fuses a fast, learned policy with a slower, physics-based controller. This balances the adaptability of foundational models with the reliability of formal safety guarantees.

In service-oriented scenarios, object-based navigation like LM-Nav leverages visual language models to identify landmarks in images, constructing a navigation graph and planning paths to specific goals. While not ideal for repetitive warehouse automation, this technique offers significant value in object-rich environments.

Another innovative approach is LLM-guided planning, where large language models use their commonly known knowledge of the world to inform navigation. This semantic understanding acts as a powerful heuristic, enabling more intelligent and context-aware decision-making.

Future Directions

Testing ViNT and NoMaD on the Sherpa-RP, our mobile robot platform for autonomy research, showed us both the potential and current limitations. They helped us understand what are the needs that have to be addressed for real world applications and how they are being solved or thought about in the latest research publications.

We are actively researching

Hybrid approaches supplementing the explored models with active feedback
Generating synthetic warehouse training data for finetuning these models
Multi-modal models with depth and IMU fusion

The future does lie in thoughtful combinations of traditional and learned models utilising the strengths of both. As researchers continue pushing boundaries and addressing current limitations, we are optimistic about the eventual realization of truly general robotic navigation systems.

References

ViNT: A Foundation Model for Visual Navigation – https://arxiv.org/abs/2306.14846
NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration –
https://arxiv.org/abs/2310.07896
NaviDiffuser: Cost‑Guided Diffusion Model for Visual Navigation –
https://arxiv.org/abs/2504.10003
CARE: Enhancing Safety of Foundation Models for Visual Navigation through Collision Avoidance via Repulsive Estimation – https://arxiv.org/abs/2506.03834
NaviBridger: Prior Does Matter – Visual Navigation via Denoising Diffusion Bridge Models – https://arxiv.org/abs/2504.10041
Risk-Guided Diffusion: Toward Deploying Robot Foundation Models In Space, Where Failure Is Not An Option – https://arxiv.org/pdf/2506.17601
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action – https://arxiv.org/abs/2207.04429
Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning – https://arxiv.org/abs/2310.10103