Dawei Zhang1*,
Nuo Chen3*,
Shuo Liu2,
Roberto Tron1,
Zhiwen Fan3
1Boston University
2Boston University Mechanical Engineering
3Texas A&M University ECE
*Equal contribution.
SafeDF is an online monocular perception-to-control framework that embeds semantic risk into the distance field used by Control Barrier Function (CBF)-based safe navigation and teleoperation. From RGB video, a foundation-model-based SLAM front end reconstructs dense 3-D geometry, while semantic observations are fused into the reconstructed scene. The resulting geometric-semantic representation is converted into a semantic-aware ESDF, where semantic labels identify safety-relevant regions and impose class-dependent inflation before field computation.
SafeDF expects model weights in the following locations:
checkpoints/MASt3R_ViTLarge_BaseDecoder_512_catmlpdpt_metric.pth
checkpoints/MASt3R_ViTLarge_BaseDecoder_512_catmlpdpt_metric_retrieval_trainingfree.pth
checkpoints/MASt3R_ViTLarge_BaseDecoder_512_catmlpdpt_metric_retrieval_codebook.pkl
efficientvit/l2.pt
Download the MASt3R weights from the MASt3R release and the EfficientViT ADE20K segmentation weight from HuggingFace:
mkdir -p checkpoints efficientvit
wget https://download.europe.naverlabs.com/ComputerVision/MASt3R/MASt3R_ViTLarge_BaseDecoder_512_catmlpdpt_metric.pth \
-P checkpoints/
wget https://download.europe.naverlabs.com/ComputerVision/MASt3R/MASt3R_ViTLarge_BaseDecoder_512_catmlpdpt_metric_retrieval_trainingfree.pth \
-P checkpoints/
wget https://download.europe.naverlabs.com/ComputerVision/MASt3R/MASt3R_ViTLarge_BaseDecoder_512_catmlpdpt_metric_retrieval_codebook.pkl \
-P checkpoints/
wget https://huggingface.co/han-cai/efficientvit-seg/resolve/main/efficientvit_seg_l2_ade20k.pt \
-O efficientvit/l2.ptThe simulation benchmark uses the six ScanNet++ scenes listed below. Download
ScanNet++ from the official dataset page
and arrange each scene as ${DATA_ROOT}/${SCENE}/iphone/rgb with the aligned
mesh at ${DATA_ROOT}/${SCENE}/scans/mesh_aligned_0.05.ply.
Create a clean environment named safedf:
git clone https://github.com/phai-lab/SafeDF.git
cd SafeDF
conda create -n safedf python=3.11 -y
conda activate safedf
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.1 -c pytorch -c nvidia -y
pip install -e thirdparty/mast3r
pip install -e thirdparty/in3d
pip install -e efficientvit
pip install -r requirements_semantic_esdf.txt
pip install --no-build-isolation -e .To quickly try SafeDF on a monocular RGB video, run the semantic ESDF reconstruction with the non-subsampled configuration:
python main_semantic.py \
--dataset /path/to/video.mp4 \
--config config/base.yaml \
--save-as example_mp4 \
--no-viz \
--stream_img_size 224 \
--efficientvit_dataset ade20k \
--enable_semantic_input \
--use_semantic_in_geo \
--use_stable_semantic_in_geo \
--semantic_beta 0.2 \
--planning_pointcloud_outdir logs/example_mp4_esdf \
--enable_planning_tsdf_publish \
--planning_tsdf_radius_m 2.5 \
--planning_tsdf_voxel_m 0.05 \
--planning_tsdf_use_semantic \
--planning_tsdf_semantic_band_m 0.1 \
--planning_pointcloud_no_save_imagesThis writes the semantic ESDF snapshot to:
logs/example_mp4_esdf/global_esdf_snapshot.npz
For a higher-fidelity reconstruction, use a larger MASt3R input resolution and a smaller TSDF voxel size:
python main_semantic.py \
--dataset /path/to/video.mp4 \
--config config/base_subsample10.yaml \
--save-as example \
--no-viz \
--stream_img_size 512 \
--efficientvit_dataset ade20k \
--enable_semantic_input \
--use_semantic_in_geo \
--use_stable_semantic_in_geo \
--semantic_beta 0.2 \
--planning_pointcloud_outdir logs/example_semantic_esdf \
--enable_planning_tsdf_publish \
--planning_tsdf_radius_m 2.5 \
--planning_tsdf_voxel_m 0.025 \
--planning_tsdf_use_semantic \
--planning_tsdf_semantic_band_m 0.1 \
--planning_pointcloud_no_save_imagesThis writes:
logs/example_semantic_esdf/global_esdf_snapshot.npz
Interactive Viser visualization:
python scripts/view_esdf_snapshot_viser.py \
--snapshot logs/example_semantic_esdf/global_esdf_snapshot.npz \
--mesh-style semantic \
--host 127.0.0.1 \
--port 8080Render an RGB/semantic-ESDF comparison video with the same camera trajectory:
python scripts/render_rgb_esdf_side_by_side.py \
--video /path/to/video.mp4 \
--snapshot logs/example_semantic_esdf/global_esdf_snapshot.npz \
--traj logs/example_dense.txt \
--out logs/example_semantic_esdf/rgb_semantic_esdf.mp4 \
--size 720 \
--semantic-mesh \
--background 0.88,0.89,0.90 \
--lighting-profile soft \
--sun-intensity 22000 \
--ibl-intensity 6000 \
--roughness 1.0 \
--match-video-lengthBatch processing for a folder of videos:
bash scripts/run_video_esdf_batch.sh /path/to/videos logs/video_esdf_batchFor quality-oriented offline rendering, override the resolution and voxel size:
STREAM_IMG_SIZE=512 TSDF_VOXEL_M=0.025 \
bash scripts/run_video_esdf_batch.sh /path/to/videos logs/video_esdf_batch_hifiThe simulation benchmark uses six ScanNet++ scenes and reports the released SafeDF table artifacts. This repository provides the semantic ESDF reconstruction path, scene-specific semantic risk export, and table verification scripts.
Set the common paths first:
MAST3R_ROOT=/path/to/SafeDF
DATA_ROOT=/path/to/scannetpp/data
RISK_OUT=${MAST3R_ROOT}/outputs/semantic_risk_groups_balanced
SCENES=(281bc17764 689fec23d7 7cd2ac43b4 8a20d62ac0 b26e64c4b0 bc03d88fc3)Reconstruct the ESDF snapshots for the six ScanNet++ scenes:
conda activate safedf
cd ${MAST3R_ROOT}
for SCENE in "${SCENES[@]}"; do
python main_semantic.py \
--dataset "${DATA_ROOT}/${SCENE}/iphone/rgb" \
--config config/base_subsample10.yaml \
--no-viz \
--efficientvit_dataset ade20k \
--enable_semantic_input \
--use_semantic_in_geo \
--use_stable_semantic_in_geo \
--semantic_beta 0.2 \
--planning_pointcloud_outdir "${MAST3R_ROOT}/logs/planning_pointcloud_scannetpp_${SCENE}_iphone" \
--enable_planning_tsdf_publish \
--planning_tsdf_radius_m 2.5 \
--planning_tsdf_voxel_m 0.05 \
--planning_tsdf_use_semantic \
--planning_tsdf_semantic_band_m 0.1
doneExport the scene-specific semantic risk groups. The mapping CSV in
resources/mesh_to_ade20k_all.csv maps ScanNet++ mesh labels to the ADE20K
labels predicted by EfficientViT; the export script then balances scene objects
into low-, mid-, and high-risk groups for the benchmark.
conda activate safedf
cd ${MAST3R_ROOT}
python scripts/export_scene_specific_semantic_risk_configs.py \
--data-root "${DATA_ROOT}" \
--mapping-csv "${MAST3R_ROOT}/resources/mesh_to_ade20k_all.csv" \
--outdir "${RISK_OUT}" \
--scene-id 281bc17764 \
--scene-id 689fec23d7 \
--scene-id 7cd2ac43b4 \
--scene-id 8a20d62ac0 \
--scene-id b26e64c4b0 \
--scene-id bc03d88fc3For LIMO-style deployment, SafeDF can read RGB frames from a ZMQ bridge and optionally publish joystick velocity commands:
python main_semantic.py \
--input_source limo_zmq \
--bridge_ip 127.0.0.1 \
--bridge_vid_port 5555 \
--bridge_cmd_port 5556 \
--enable_joystick \
--config config/base.yaml \
--stream_img_size 224 \
--efficientvit_dataset ade20k \
--enable_semantic_input \
--use_semantic_in_geo \
--use_stable_semantic_in_geo \
--semantic_beta 0.2 \
--planning_pointcloud_outdir logs/limo_semantic_esdf \
--enable_planning_tsdf_publish \
--planning_tsdf_radius_m 2.5 \
--planning_tsdf_voxel_m 0.05 \
--planning_tsdf_use_semantic \
--planning_tsdf_semantic_band_m 0.1Use --stream_img_size 512 and --planning_tsdf_voxel_m 0.025 only when
prioritizing reconstruction quality over online speed. The robot-side ROS/ZMQ
bridge and controller parameters are deployment-specific; this repository
exposes the reconstruction and semantic ESDF side used by the hardware
experiments.
@misc{zhang2026embeddingsemanticriskdistance,
title={Embedding Semantic Risk into Distance Fields and CBFs for Online Monocular Safe Control},
author={Dawei Zhang and Nuo Chen and Shuo Liu and Roberto Tron and Zhiwen Fan},
year={2026},
eprint={2606.01605},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2606.01605},
}This code builds on MASt3R-SLAM for online monocular dense SLAM and EfficientViT for semantic segmentation. We thank the authors of these projects for releasing their code and models.