This repository accompanies our IEEE Access paper "Visual Relationship Detection with Visual-Linguistic Knowledge from Multimodal Representations" and contains validation experiments code and the models on the SpatialSense and the VRD dataset.
This project is constructed with Python 3.6, PyTorch 1.1.0 and CUDA 9.0 and largely based on VL-BERT.
Please follow the original instruction to install an conda environment.
- Download the SpatialSense dataset here.
- Put the files under
$RVL_BERT_ROOT/data/spasen
and unzip theimages.tar.gz
asimages/
there. Ensure there're two folders (flickr/
andnyu
) below$RVL_BERT_ROOT/data/spasen/images/
.
- Download the VRD dataset: images (Backup: download
sg_dataset.zip
from Baidu) and annotations - Put the
sg_train_images/
andsg_test_images/
folders under$RVL_BERT_ROOT/data/vrd/images
. - Put all
.json
files under$RVL_BERT_ROOT/data/vrd/
.
Download the pretrained weights here and put the pretrained_model/
folder under $RVL_BERT_ROOT/model/
.
Download the trained checkpoint here and put the .model
file under $RVL_BERT_ROOT/checkpoints/spasen/
.
Download the trained checkpoints and put the .model
files under $RVL_BERT_ROOT/checkpoints/vrd/
:
Run the following commands to reproduce experiment results. A single GPU (NVIDIA Quadro RTX 6000, 24G memory) is used by default.
- Full model
python spasen/test.py --cfg cfgs/spasen/full-model.yaml --ckpt checkpoints/spasen/full-model-e44.model --bs 8 --gpus 0 --model-dir ./ --result-path results/ --result-name spasen_full_model --split test --log-dir logs
- Basic model:
python vrd/test.py --cfg cfgs/vrd/basic.yaml --ckpt checkpoints/vrd/basic-e59.model --bs 1 --gpus 0 --model-dir ./ --result-path results/ --result-name vrd_basic --split test --log-dir logs/
- Basic model + Visual-Linguistic Commonsense Knowledge
python vrd/test.py --cfg cfgs/vrd/basic_vl.yaml --ckpt checkpoints/vrd/basic-vl-e59.model --bs 1 --gpus 0 --model-dir ./ --result-path results/ --result-name vrd_basic_vl --split test --log-dir logs/
- Basic model + Visual-Linguistic Commonsense Knowledge + Spatial Module
python vrd/test.py --cfg cfgs/vrd/basic_vl_s.yaml --ckpt checkpoints/vrd/basic-vl-s-e59.model --bs 1 --gpus 0 --model-dir ./ --result-path results/ --result-name vrd_basic_vl --split test --log-dir logs/
- Full model
python vrd/test.py --cfg cfgs/vrd/basic_vl_s_m.yaml --ckpt checkpoints/vrd/basic-vl-s-m-e59.model --bs 1 --gpus 0 --model-dir ./ --result-path results/ --result-name vrd_basic_vl --split test --log-dir logs/
This repository is mainly based on VL-BERT.
Please cite our paper if you find the paper or our code help your research!
@ARTICLE{9387302,
author={M. -J. {Chiou} and R. {Zimmermann} and J. {Feng}},
journal={IEEE Access},
title={Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations},
year={2021},
volume={9},
number={},
pages={50441-50451},
doi={10.1109/ACCESS.2021.3069041}}