*: Equal contribution
1KAIST, 2Ghent University
| Project Page | arXiv | Code |
Our method aligns and flattens Gaussian covariances to scene surfaces estimated from
monocular normal estimations.
Our method jointly reconstructs static scene with dynamic object such as cars, which can then be relocated arbitrarily.
Neural rendering-based urban scene reconstruction methods commonly rely on images collected from driving vehicles with cameras facing and moving forward. Although these methods can successfully synthesize from views similar to training camera trajectory, directing the novel view outside the training camera distribution does not guarantee on-par performance. In this paper, we tackle the Extrapolated View Synthesis (EVS) problem by evaluating the reconstructions on views such as looking left, right or downwards with respect to training camera distributions. To improve rendering quality for EVS, we initialize our model by constructing dense LiDAR map, and propose to leverage prior scene knowledge such as surface normal estimator and large-scale diffusion model. Qualitative and quantitative comparisons demonstrate the effectiveness of our methods on EVS. To the best of our knowledge, we are the first to address the EVS problem in urban scene reconstruction. We will release the code upon acceptance.
The software requirements are the following:
- Conda (recommended for easy setup)
- C++ Compiler for PyTorch extensions
- CUDA toolkit 11.8 for PyTorch extensions
- C++ Compiler and CUDA SDK must be compatible
Please refer to the original 3D Gaussian Splatting repository for more details about requirements.
# HTTPS
git clone https://github.com/deepshwang/vegs.git --recursive
or
# SSH
git clone [email protected]:deepshwang/vegs.git --recursive
Create and activate the environemnt with the required packages installed.
conda env create -f environment.yml
conda activate vegs
We provide training pipeline for KITTI-360 Dataset. Pleaser refer to the data documentation for details on the data structure.
You may register and log-in for KITTI-360 page. Then, please download the following data.
KITTI-360
└───calibration
└───data_2d_raw
│ └───2013_05_28_drive_{seq:0>4}_sync
└───data_3d_semantics
│ └───train
│ └───static
│ └───{start_frame:0>10}_{end_frame:0>10}.ply
│ └───dynamic
│ └───{start_frame:0>10}_{end_frame:0>10}.ply
└───data_3d_bboxes
│ └───train
│ └───2013_05_28_drive_{seq:0>4}_sync.xml
│ └───train_full
│ └───2013_05_28_drive_{seq:0>4}_sync.xml
└───data_poses
│ └───2013_05_28_drive_{seq:0>4}_sync
Since each sequence is too large to construct as a single scene model, we use scene segment pre-divided by frames, start_frame
and end_frame
.
In addition to the LiDAR map, we use points triangulated from training images. To prepare the points, run the following command. (COLMAP must be installed to run)
python triangulate.py --data_dir ${KITTI360_DIR}
where ${KITTI360_DIR}
is the KITTI-360 data directory. By default, the script will triangulate for all scene sgements in data, and save the results in data_3d_colmap
and data_3d_colmap_processed
folder under the KITTI-360 data directory.
You may download the points from here and save them into ${KITTI360_DIR}/data_3d_colmap_processed
We use omnidata for monocular surface normal estimation. Please download and place the pretrained model in omnidata/pretrained_models/omnidata_dpt_normal_v2.ckpt
. Running the following scripts will save monocular surface normal estimations in data_2d_normal_omnidata_all
under the KITTI-360 data directory. To prepare the data, run
bash bash_scripts/normal_preprocess_kitti360.sh ${GPU_NUM} ${KITTI360_DIR}
You may download pre-calculated monocular surface normal estimations from here, and save them into ${KITTI360_DIR}/data_2d_normal_omnidata_all
.
Note that the file only contains a frame segment from 3972
to 4258
in sequence 0009
as files for all sequences are too large.
To prepare dataset for LoRA training, run the following command.
bash bash_scripts/lora_preprocess_kitti360.sh
This will prepare square-cropped dataset and save them into lora/data/kitti360
.
By default, this will prepare images for scene segments listed in lora/data/kitti360/2013_05_28_drive_train_dynamic_vehicle_human_track_num_vehicles.txt
, which includes scene fragements where vehicles are the only dynamic objects in the scene (as our method cannot handle topologically-varying dynamic objects such as walking people). You may change the text file to only process the scene segment of interest.
We use diffusers to train Stable-Diffusion with LoRA. To train, run the following command.
bash bash_scripts/lora_train_kitti360.sh ${GPU_NUM}
By default, the script will train fine-tuned models for all scene segments listed in lora/data/kitti360/2013_05_28_drive_train_dynamic_vehicle_human_track_num_vehicles.txt
.
You may download pre-trained LoRA weights from here and unzip them under lora/models/kitti360
. Again, we only provide models for scene segments listed in lora/data/kitti360/2013_05_28_drive_train_dynamic_vehicle_human_track_num_vehicles.txt
.
To train VEGS for a scene segment of interest, run the following command.
bash bash_scripts/train_kitti360.sh ${GPU_NUM} ${DATA_PATH} ${SEQUENCE} ${START_FRAME} ${END_FRAME} ${EXPERIMENT_NOTE}
Parameter | Description | Default |
---|---|---|
${GPU_NUM} |
Index of GPU to use. | 0 |
${DATA_PATH} |
Data path | ./KITTI-360 |
${SEQUENCE} |
Index of sequence to train | 0009 |
${START_FRAME} |
Start frame number of the frame segment | 3972 |
${END_FRAME} |
End frame number of the frame segment | 4258 |
${EXP_NOTE} |
Optional note for the run. The note will be included to the folder that the model will be saved. |
"" |
Trained model and images rendered on conventional and extrapolated cameras will be saved in output
.
We also provide a script to render and save from camera trajectories, along with novel cameras interpolated between adjacent pairs of the cameras within the trajectory for smooth video rendering.
bash bash_scripts/render_video.sh ${GPU_NUM} ${MODEL_PATH}
where ${MODEL_PATH}
is the path of the trained gaussian model. Running the script will give you smooth video renderings from both interpolated and extrapolated views.