GeNIE: A Generalizable Navigation System for In-the-Wild Environments

Accepted at IEEE Robotics and Automation Letters (RA-L)
🏆 Winner of the Earth Rover Challenge, ICRA 2025
Presented at RSS'25 FM4RoboPlan Workshop

TL;DR: GeNIE integrates a generalizable traversability prediction model (SAM-TP) with a novel path fusion strategy that enhances planning stability in noisy and ambiguous settings. Deployed in the Earth Rover Challenge (ERC) at ICRA 2025 across six countries spanning three continents, GeNIE took first place and achieved 79% of the maximum possible score, outperforming the second-best team by 17%, and completed the entire competition without a single human intervention. These results set a new benchmark for robust, generalizable outdoor robot navigation.

GeNIE navigating through diverse terrains in real-world environments. Top-left: current RGB observation overlaid with the predicted Traversability Prediction Mask. Top-right: GPS-based map view, showing the robot’s estimated trajectory under noisy GPS signals. Bottom-left: Traversability score map where warmer colors (red) indicate higher traversability likelihood. Bottom-right: Trajectory visualization showing sampled candidate paths (light blue) and the selected fused trajectory (red) guiding navigation.

GeNIE across diverse environments
Diverse scenes (top row), traversability predictions (middle row), and final path planning results (bottom row) across seven different environments. GeNIE generalizes well across challenging conditions, including heavy rain, high-contrast lighting, muddy lenses, and diverse terrain types.

Introduction

Navigation in the open world remains challenging due to diverse terrains, lighting, and sensor noise that cause models to fail under domain shifts. The Earth Rover Challenge (ERC) offers a valuable testbed for studying such generalization, deploying identical sidewalk robots across three continents and six countries in varied real-world settings, from urban streets to rural paths, under different weather and lighting conditions. Each robot relies only on low-cost, noisy sensors (RGB camera, IMU, GPS) and limited remote computation, with visual data streamed at just 3–5 Hz, making traditional SLAM pipelines infeasible. The difficulty of the task is underscored by the inaugural results, where the best autonomous system achieved only 36% of the total score.

Deployment locations
Examples of diverse terrains in ERC 2025, illustrating the requirement for robots to understand semantics in order to identify safe navigable areas in unstructured environments.

System Overview

GeNIE addresses the challenge of robust navigation through a carefully designed architecture that combines foundation models with novel planning strategies. Given an RGB input image, the system first predicts pixel-wise traversability scores in the image space. The predicted traversable regions are then projected into a local 2D bird's-eye view (BEV) cost map. Multiple candidate paths are sampled from this BEV map, followed by path fusion on the top-K paths with the lowest traversability costs to generate more stable and consistent trajectories. Finally, the best fused path is selected based on its alignment with the goal location.

System architecture
Overview of the GeNIE system. Given an RGB input image, the SAM-TP module predicts navigable regions in the image space. These predictions are projected into a bird's-eye view (BEV) cost map. Path fusion is then performed to identify coherent and safe traversable paths, followed by path selection based on alignment with the goal direction. Finally, the control module outputs linear and angular velocities to follow the selected path.

SAM-TP: Traversability Prediction

Robust navigation requires the ability to perceive traversable regions consistently across diverse terrains, lighting conditions, and sensor configurations. To enable such generalization, we develop SAM-TP, a traversability-aware adaptation of the SAM2 model.

To train this model, we sample diverse outdoor frames from the FrodoBots-2K dataset and annotate them using a semi-automatic pipeline, producing 15,347 high-quality traversability masks. Each mask marks only regions directly reachable from the robot’s current position, assuming the robot stands near the mid-bottom of the image.

SAM-TP alters the SAM2 architecture by removing the original mask decoder and introducing a single learnable prompt token that captures traversability semantics. This token is optimized directly via backpropagation, eliminating the need for manual prompts at inference time.

We evaluate SAM-TP on a novel benchmark with unseen environments and different robot embodiments, achieving 92.9% IoU on normal scenes, 79.9% under difficult conditions, and 92.6% in cross-embodiment settings. The model outperforms strong baselines, including large vision and language models and state-of-the-art traversability prediction methods.

SAM-TP comparison
Comparison of traversability predictions (red regions) from SAM, SAM2, VLM (gemini-2.5-flash), and our proposed SAM-TP approach. SAM-TP achieves superior segmentation of navigable regions across diverse conditions.

Path Fusion, Control and VLM-based Failure Recovery

Path Fusion Strategy

Traditional sample-based planners struggle when goals are far away, often oscillating between paths of similar cost or getting trapped in local minima. To address this, we introduce a path fusion strategy that reasons at the path level instead of the pixel level. From the BEV cost map, we sample candidate paths represented as first- and second-order polynomials, then select the lowest-cost set. Using adaptive k-means clustering on the waypoint trajectories, we merge nearby paths based on silhouette-optimized clusters and centroid proximity. The resulting fused path set captures stable, semantically distinct routes, from which we select the one most aligned with the goal direction. This approach reduces switching and improves long-horizon stability, achieving 84.6% Path Identification Precision and 85.7% Path Selection Accuracy, significantly outperforming planners without fusion.

Control and Collision Avoidance

The control system operates in a receding-horizon loop that continuously replans based on new observations. A simple waypoint-tracking controller aligns the robot's heading toward the next waypoint, waits for visual confirmation via epipolar geometry, and advances by 1 m before re-evaluating. A lightweight collision detection module monitors the traversability map to halt motion and trigger replanning when hazards appear near the front region. Because the SAM-TP model runs at 10 Hz while input frames arrive at 3–5 Hz, the system performs real-time collision checks. This conservative replan–align–move cycle prioritizes robustness under ERC's high-latency and low-frequency sensing conditions.

Path fusion visualization
Comparison of path planning with and without fusion. Without fusion, the planner tends to select paths with lower traversability when the goal weight is too small, or gets stuck in local minima when too large. Path fusion succeeds in both scenarios by operating at the path level.

VLM-Guided Failure Recovery

When the local planner stalls or drifts off the road, we trigger a lightweight two-stage recovery routine driven by a vision-language model (VLM). Unlike traditional escape behaviors that rely on hand-tuned heuristics or metric maps, this approach uses semantic reasoning over camera observations to select safe recovery actions.

Stage 1 — Monitor

During normal navigation, the robot continuously samples the front RGB camera and maintains a small FIFO buffer of recent views. At a fixed rate, we query the VLM using the latest frame plus short temporal context to classify the robot’s status as on-road or off-road. A majority vote across recent classifications suppresses noise from transient occlusions and lighting changes. If the VLM flags off-road in two consecutive evaluations, recovery mode is activated.

Stage 2 — Recover

Recovery consists of two steps:

  1. Look-around selection: The robot pauses and performs a 360° look-around using four headings at 90° increments. The VLM inspects the four captured views as a set and selects the heading that is most likely to lead back to the road, using cues such as road texture, painted lines, or curb boundaries. If uncertainty is high, the VLM can request a repeat sweep before committing to a direction.
  2. Direction choice: Facing the selected heading, the robot presents a fresh view with three options: left, straight, or right. The VLM recommends a conservative short-horizon action or requests another sweep if the scene remains ambiguous. Each motion is executed at low speed and limited distance, with a brief settle pause before re-evaluation. If the VLM detects that the robot has rejoined the road, normal navigation resumes; otherwise, recovery iterates.

This design enables reliable recovery using only onboard perception: no GPS, metric maps, or handcrafted escape behaviors are required. In qualitative field trials, the method successfully escaped dead ends and roadside drifts in roughly 80% of cases, showing strong potential for language-guided corrective navigation.

VLM-based failure recovery
VLM-guided failure recovery process showing the two-stage approach: monitoring for off-road detection and recovery through look-around selection and direction choice.

Additional Demonstrations

Beyond outdoor navigation, GeNIE demonstrates strong generalization to diverse environments and robot embodiments, including indoor settings and different hardware platforms.

Cross-Embodiment Deployment on Robot Dog

GeNIE deployed on a quadruped robot platform, showcasing cross-embodiment capabilities and adaptability to different robot morphologies.

Indoor Environment Navigation

GeNIE navigating autonomously in indoor environments, demonstrating generalization beyond outdoor terrains.

References

  1. J. Frey, M. Mattamala, N. Chebrolu, C. Cadena, M. Fallon, and M. Hutter, "Fast Traversability Estimation for Wild Visual Navigation," Robotics: Science and Systems, Daegu, Republic of Korea, July 2023.

Citation

If you find our work useful, please consider citing:

@article{wang2025genie,
  title={GeNIE: A Generalizable Navigation System for In-the-Wild Environments},
  author={Wang, Jiaming and Liu, Diwen and Chen, Jizhuo and Da, Jiaxuan and 
          Qian, Nuowen and Man, Tram Minh and Soh, Harold},
  journal={arXiv preprint arXiv:2506.17960},
  year={2025}
}