This repo is the official implementation for PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation. The paper has been accepted to CVPR 2023.
arXiv / project page / video
[2024.06.16] Sorry for the huge delay. The code and pre-trained model for MPI-INF-3DHP has been released. Please check here.
[2024.02.06] The environment requirements are updated. Also, check our NeurIPS 2023 paper ContextAware-PoseFormer (It outperforms sequence-based models with a single video frame as input)!
[2023.06.16] Codes for in-the-wild video demos are released!
[2023.05.31] We have a narrated video introduction. Please check here.
[2023.03.28] We build a project page where we place more descriptions and video demos.
[2023.03.31] Our paper on arXiv is ready!
PoseFormerV2 is built upon PoseFormer. It targets improving its efficiency in processing long input sequences and its robustness to noisy 2D joint detection via a frequency-domain joint sequence representation.
Abstract. Recently, transformer-based methods have gained significant success in sequential 2D-to-3D lifting human pose estimation. As a pioneering work, PoseFormer captures spatial relations of human joints in each video frame and human dynamics across frames with cascaded transformer layers and has achieved impressive performance. However, in real scenarios, the performance of PoseFormer and its follow-ups is limited by two factors: (a) The length of the input joint sequence; (b) The quality of 2D joint detection. Existing methods typically apply self-attention to all frames of the input sequence, causing a huge computational burden when the frame number is increased to obtain advanced estimation accuracy, and they are not robust to noise naturally brought by the limited capability of 2D joint detectors. In this paper, we propose PoseFormerV2, which exploits a compact representation of lengthy skeleton sequences in the frequency domain to efficiently scale up the receptive field and boost robustness to noisy 2D joint detection. With minimum modifications to PoseFormer, the proposed method effectively fuses features both in the time domain and frequency domain, enjoying a better speed-accuracy trade-off than its precursor. Extensive experiments on two benchmark datasets (i.e., Human3.6M and MPI-INF-3DHP) demonstrate that the proposed approach significantly outperforms the original PoseFormer and other transformer-based variants.
If you find PoseFormerV2 useful in your research, please consider citing:
@InProceedings{Zhao_2023_CVPR,
author = {Zhao, Qitao and Zheng, Ce and Liu, Mengyuan and Wang, Pichao and Chen, Chen},
title = {PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {8877-8886}
}
The code is developed and tested under the following environment:
- Python 3.9
- PyTorch 1.13.0
- CUDA 11.7
conda create -n poseformerv2 python=3.9
conda activate poseformerv2
pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt
Please refer to VideoPose3D to set up the Human3.6M dataset as follows:
code_root/
└── data/
├── data_2d_h36m_gt.npz
├── data_2d_h36m_cpn_ft_h36m_dbb.npz
└── data_3d_h36m.npz
You can train PoseFormerV2 on a single GPU with the following command:
python run_poseformer.py -g 0 -k cpn_ft_h36m_dbb -frame 27 -frame-kept 3 -coeff-kept 3 -c checkpoint/NAMED_PATH
This example shows how to train PoseFormerV2 with 3 central frames and 3 DCT coefficients from a 27-frame sequence. You can set frame-kept and coeff-kept to arbitrary values (of course <= frame number) as you like :)
We provide pre-trained models with different inputs:
Model | Sequence Leng. | f | n | #Depth | Hidden Dim. | #MFLOPs | MPJPE (mm) | Download |
---|---|---|---|---|---|---|---|---|
PoseFormerV2 | 27 | 1 | 3 | 4 | 32 | 77.2 | 48.7 | model |
/ | 27 | 3 | 3 | 4 | 32 | 117.3 | 47.9 | model |
/ | 81 | 1 | 3 | 4 | 32 | 77.2 | 47.6 | model |
/ | 81 | 3 | 3 | 4 | 32 | 117.3 | 47.1 | model |
/ | 81 | 9 | 9 | 4 | 32 | 351.7 | 46.0 | model |
/ | 243 | 27 | 27 | 4 | 32 | 1054.8 | 45.2 | model |
You can evaluate PoseFormerV2 with prepared checkpoints as:
python run_poseformer.py -g 0 -k cpn_ft_h36m_dbb -frame 27 -frame-kept 3 -coeff-kept 3 -c checkpoint/ --evaluate NAME_ckpt.bin
We followed P-STMO to prepare the data and train our model. Please click here for details.
Our codes for in-the-wild video demos are adopted from MHFormer.
First, you need to download the pretrained weights for YOLOv3 (here), HRNet (here) and put them in the ./demo/lib/checkpoint
directory. Then, put your in-the-wild videos in the ./demo/video
directory.
NOTE: make sure you have also downloaded the weights for PoseFormerV2! (the default path in the code is ./checkpoint
, and the default model variant used is 27_243_45.2.bin
, using 243 frames as input)
Run the command below:
python demo/vis.py --video sample_video.mp4
Our codes are mainly based on PoseFormer. We follow P-STMO to train on MPI-INF-3DHP and MHFormer to prepare our in-the-wild video demos and visualizations. Many thanks to the authors!