Skip to content

Modalities beyond text

zhezhaoa edited this page Oct 27, 2023 · 2 revisions

In addition to text, TencentPretrain supports vision, audio, and cross-modal pre-training models. This section demonstrates the way of using TencentPretrain to pre-train and fine-tune in models of different modalities.


The example of pre-training ViT model on CIFAR10 dataset:

python3 --corpus_path datasets/cifar10/train.tsv --tokenizer virtual \
                      --dataset_path --processes_num 8 --data_processor vit

python3 --dataset_path --tokenizer virtual \
                    --pretrained_model_path models/vit_base_patch16_224_model.bin \
                    --config_path models/vit/base-16-224_config.json \
                    --output_model_path models/cifar10_vit_base_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 2000 --save_checkpoint_steps 1000 --batch_size 32 \
                    --labels_num 10

vit_base_patch16_224_model.bin can be found in Model Zoo section. --tokenizer virtual is specified since image does not need a tokenizer to tokenize the text. The example of fune-tuning and doing inference on CIFAR10 dataset:

python3 finetune/ --pretrained_model_path models/vit_base_patch16_224_model.bin \
                                         --tokenizer virtual \
                                         --config_path models/vit/base-16-224_config.json \
                                         --train_path datasets/cifar10/train.tsv \
                                         --dev_path datasets/cifar10/test.tsv \
                                         --output_model_path models/image_classifier_model.bin \
                                         --epochs_num 3 --batch_size 64

python3 inference/ --load_model_path models/image_classifier_model.bin \
                                                --tokenizer virtual \
                                                --config_path models/vit/base-16-224_config.json \
                                                --test_path datasets/cifar10/test.tsv \
                                                --prediction_path datasets/cifar10/prediction.tsv \
                                                --labels_num 10

CIFAR10 dataset has 10 labels (--labels_num 10).


The example of pre-training S2T model on LibriSpeech] dataset: One should modify models/special_tokens_map.json to models/xlmroberta_special_tokens_map.json in the file tencentpretrain/utils/, and process the data into a format that TencentPretrain can handle. We provide the prepared 10h version in Downstream datasets section.

python3 scripts/ --input_path datasets/librispeech/train-10h \
                                            --output_path datasets/librispeech/train-10h.tsv

And then preprocess and pretrain:

python3 --corpus_path datasets/librispeech/train-10h.tsv \
                      --spm_model_path models/sentencepiece.bpe.model \
                      --dataset_path \
                      --processes_num 8 --data_processor s2t

python3 --dataset_path  \
                    --spm_model_path models/sentencepiece.bpe.model \
                    --config_path models/s2t/small_config.json \
                    --output_model_path models/output_model.bin \
                    --accumulation_steps 8 \
                    --world_size 4 --gpu_ranks 0 1 2 3 \
                    --total_steps 100000 --save_checkpoint_steps 10000 --report_steps 100 \
                    --batch_size 8 --learning_rate 2e-3

To fine-tune a S2T model.  --add_column argument is added to introduce the column names in the first line when preparing the dataset. The example of fune-tuning on LibriSpeech dataset:

python3 scripts/ --input_path datasets/librispeech/train-10h \
                                            --output_path datasets/librispeech/train-10h.tsv \

python3 scripts/ --input_path datasets/librispeech/dev-clean \
                                            --output_path datasets/librispeech/dev-clean.tsv \

python3 finetune/ --pretrained_model_path models/output_model.bin \
                                    --spm_model_path models/sentencepiece.bpe.model \
                                    --config_path models/s2t/small_config.json \
                                    --train_path datasets/librispeech/train-10h.tsv \
                                    --dev_path datasets/librispeech/dev-clean.tsv \
                                    --output_model_path models/finetuned_model.bin \
                                    --batch_size 8 --epochs_num 10 \
                                    --learning_rate 2e-4 --report_steps 200

During inference stage,beam search is applied, and the beam size can be adjusted by --beam_width:

python3 scripts/ --input_path datasets/librispeech/test-clean \
                                            --output_path datasets/librispeech/test-clean.tsv \

python3 inference/ --load_model_path models/finetuned_model.bin \
                                           --spm_model_path models/sentencepiece.bpe.model \
                                           --config_path models/s2t/small_config.json \
                                           --test_path datasets/librispeech/test-clean.tsv \
                                           --prediction_path output.txt \
                                           --batch_size 8 --tgt_seq_length 100 \
                                           --beam_width 5

When inferring a S2T model downloaded from Huggingface, we need to change "cls_token": "<s>" into "cls_token": "</s>" in models/special_tokens_map.json .

python3 scripts/ --input_model_path s2t_huggingface_model.bin \
                                                                   --output_model_path s2t_tencentpretrain_model.bin

python3 inference/ --load_model_path s2t_tencentpretrain_model.bin \
                                           --spm_model_path models/sentencepiece.bpe.model  \
                                           --config_path models/s2t/small_config.json \
                                           --test_path datasets/librispeech/test-clean.tsv \
                                           --prediction_path output.txt \
                                           --batch_size 8 --tgt_seq_length 100 \
                                           --beam_width 5
Clone this wiki locally