TL;DR: GeNIE integrates a generalizable traversability prediction model (SAM-TP) with a novel path fusion strategy that enhances planning stability in noisy and ambiguous settings. Deployed in the Earth Rover Challenge (ERC) at ICRA 2025 across six countries spanning three continents, GeNIE took first place and achieved 79% of the maximum possible score, outperforming the second-best team by 17%, and completed the entire competition without a single human intervention. These results set a new benchmark for robust, generalizable outdoor robot navigation.
Navigation in the open world remains challenging due to diverse terrains, lighting, and sensor noise that cause models to fail under domain shifts. The Earth Rover Challenge (ERC) offers a valuable testbed for studying such generalization, deploying identical sidewalk robots across three continents and six countries in varied real-world settings, from urban streets to rural paths, under different weather and lighting conditions. Each robot relies only on low-cost, noisy sensors (RGB camera, IMU, GPS) and limited remote computation, with visual data streamed at just 3–5 Hz, making traditional SLAM pipelines infeasible. The difficulty of the task is underscored by the inaugural results, where the best autonomous system achieved only 36% of the total score.
GeNIE addresses the challenge of robust navigation through a carefully designed architecture that combines foundation models with novel planning strategies. Given an RGB input image, the system first predicts pixel-wise traversability scores in the image space. The predicted traversable regions are then projected into a local 2D bird's-eye view (BEV) cost map. Multiple candidate paths are sampled from this BEV map, followed by path fusion on the top-K paths with the lowest traversability costs to generate more stable and consistent trajectories. Finally, the best fused path is selected based on its alignment with the goal location.
Robust navigation requires the ability to perceive traversable regions consistently across diverse terrains, lighting conditions, and sensor configurations. To enable such generalization, we develop SAM-TP, a traversability-aware adaptation of the SAM2 model.
To train this model, we sample diverse outdoor frames from the FrodoBots-2K dataset and annotate them using a semi-automatic pipeline, producing 15,347 high-quality traversability masks. Each mask marks only regions directly reachable from the robot’s current position, assuming the robot stands near the mid-bottom of the image.
SAM-TP alters the SAM2 architecture by removing the original mask decoder and introducing a single learnable prompt token that captures traversability semantics. This token is optimized directly via backpropagation, eliminating the need for manual prompts at inference time.
We evaluate SAM-TP on a novel benchmark with unseen environments and different robot embodiments, achieving 92.9% IoU on normal scenes, 79.9% under difficult conditions, and 92.6% in cross-embodiment settings. The model outperforms strong baselines, including large vision and language models and state-of-the-art traversability prediction methods.
Traditional sample-based planners struggle when goals are far away, often oscillating between paths of similar cost or getting trapped in local minima. To address this, we introduce a path fusion strategy that reasons at the path level instead of the pixel level. From the BEV cost map, we sample candidate paths represented as first- and second-order polynomials, then select the lowest-cost set. Using adaptive k-means clustering on the waypoint trajectories, we merge nearby paths based on silhouette-optimized clusters and centroid proximity. The resulting fused path set captures stable, semantically distinct routes, from which we select the one most aligned with the goal direction. This approach reduces switching and improves long-horizon stability, achieving 84.6% Path Identification Precision and 85.7% Path Selection Accuracy, significantly outperforming planners without fusion.
The control system operates in a receding-horizon loop that continuously replans based on new observations. A simple waypoint-tracking controller aligns the robot's heading toward the next waypoint, waits for visual confirmation via epipolar geometry, and advances by 1 m before re-evaluating. A lightweight collision detection module monitors the traversability map to halt motion and trigger replanning when hazards appear near the front region. Because the SAM-TP model runs at 10 Hz while input frames arrive at 3–5 Hz, the system performs real-time collision checks. This conservative replan–align–move cycle prioritizes robustness under ERC's high-latency and low-frequency sensing conditions.
When the local planner stalls or drifts off the road, we trigger a lightweight two-stage recovery routine driven by a vision-language model (VLM). Unlike traditional escape behaviors that rely on hand-tuned heuristics or metric maps, this approach uses semantic reasoning over camera observations to select safe recovery actions.
During normal navigation, the robot continuously samples the front RGB camera and maintains a small FIFO buffer of recent views. At a fixed rate, we query the VLM using the latest frame plus short temporal context to classify the robot’s status as on-road or off-road. A majority vote across recent classifications suppresses noise from transient occlusions and lighting changes. If the VLM flags off-road in two consecutive evaluations, recovery mode is activated.
Recovery consists of two steps:
This design enables reliable recovery using only onboard perception: no GPS, metric maps, or handcrafted escape behaviors are required. In qualitative field trials, the method successfully escaped dead ends and roadside drifts in roughly 80% of cases, showing strong potential for language-guided corrective navigation.
Beyond outdoor navigation, GeNIE demonstrates strong generalization to diverse environments and robot embodiments, including indoor settings and different hardware platforms.
If you find our work useful, please consider citing:
@article{wang2025genie,
title={GeNIE: A Generalizable Navigation System for In-the-Wild Environments},
author={Wang, Jiaming and Liu, Diwen and Chen, Jizhuo and Da, Jiaxuan and
Qian, Nuowen and Man, Tram Minh and Soh, Harold},
journal={arXiv preprint arXiv:2506.17960},
year={2025}
}