预训练模型仓库

借助TencentPretrain，我们训练不同性质的预训练模型（例如基于不同模态、编码器、目标任务）。以下所有的预训练模型都是TencentPretrain格式的，可以由TencentPretrain直接加载。未来会发布更多的预训练模型。除非另有说明，否则中文预训练模型使用 models/google_zh_vocab.txt 作为词典（原始Google BERT项目中的中文词典）以及BERT tokenizer作为分词器。models/bert/base_config.json 为默认的配置文件；常用的词典和配置文件包含在 models 文件夹中，用户无需下载。此外，我们通过 scripts/convert_xxx_from_tencentpretrain_to_huggingface.py 将TencentPretrain预训练的模型转换为Huggingface Transformers支持的格式，并上传到了Huggingface模型仓库（uer用户）。由于TencentPretrain是在UER的基础上进一步开发的，因此在Huggingface Transformers项目下的用户名是uer。TencentPretrain与UER全面兼容，支持相同格式的预训练模型以及有基本相同的使用方式（部分场景需要将uer改为tencentpretrain）。下面介绍这些预训练模型权重，给出它们的下载链接，以及说明它们的使用方式。注意到，受限于篇幅，我们将预训练权重的细节描述放到了相应的Huggingface模型仓库中。在介绍具体预训练模型权重的时候，我们会给出其对应的Huggingface模型仓库链接。

中文RoBERTa预训练模型

24个不同尺寸的中文RoBERTa预训练模型。语料为CLUECorpusSmall。配置文件在 models/bert/ 路径下。我们只为Tiny，Mini，Small，Medium，Base，Large模型提供了配置文件。为了加载下面的其他模型，我们需要修改配置文件中的 emb_size，feedforward_size，hidden_size，heads_num，layers_num。注意到emb_size等于hidden_size，feedforward_size是hidden_size的4倍，heads_num等于hidden_size除以64。更多的细节请参考这里。

下面列出不同层数 L（layers_num），不同隐层维度 H（hidden_size）的中文RoBERTa预训练权重链接：

层数/隐层维度	H=128	H=256	H=512	H=768
L=2	2/128 (Tiny)	2/256	2/512	2/768
L=4	4/128	4/256 (Mini)	4/512 (Small)	4/768
L=6	6/128	6/256	6/512	6/768
L=8	8/128	8/256	8/512 (Medium)	8/768
L=10	10/128	10/256	10/512	10/768
L=12	12/128	12/256	12/512	12/768 (Base)

这里以Tiny预训练模型权重为例说明以上权重的使用方法。我们通过上面的链接下载Tiny预训练模型权重，放到 models/ 文件夹下。我们可以在其基础上增量的预训练：

python3 preprocess.py --corpus_path corpora/book_review.txt --vocab_path models/google_zh_vocab.txt \
                      --dataset_path dataset.pt --processes_num 8 --data_processor mlm

python3 pretrain.py --dataset_path dataset.pt --pretrained_model_path models/cluecorpussmall_roberta_tiny_seq512_model.bin \
                    --vocab_path models/google_zh_vocab.txt --config_path models/bert/tiny_config.json \
                    --output_model_path models/output_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 5000 --save_checkpoint_steps 2500 --batch_size 64 \
                    --data_processor mlm --target mlm

或者用其进行分类：

python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_roberta_tiny_seq512_model.bin \
                                   --vocab_path models/google_zh_vocab.txt --config_path models/bert/tiny_config.json \
                                   --train_path datasets/book_review/train.tsv \
                                   --dev_path datasets/book_review/dev.tsv \
                                   --test_path datasets/book_review/test.tsv \
                                   --learning_rate 3e-4 --epochs_num 8 --batch_size 64

在微调阶段，不同尺寸的预训练模型，通常需要不同的超参数。使用网格搜索寻找分类模型的最佳超参数示例：

python3 finetune/run_classifier_grid.py --pretrained_model_path models/cluecorpussmall_roberta_tiny_seq512_model.bin \
                                        --vocab_path models/google_zh_vocab.txt \
                                        --config_path models/bert/tiny_config.json \
                                        --train_path datasets/book_review/train.tsv \
                                        --dev_path datasets/book_review/dev.tsv \
                                        --learning_rate_list 3e-5 1e-4 3e-4 --epochs_num_list 3 5 8 --batch_size_list 32 64

通过上面的网格搜索脚本，可以复现这里列出的实验结果。

中文RoBERTa-WWM预训练模型

7个不同尺寸的中文RoBERTa-WWM预训练模型。语料为CLUECorpusSmall。配置文件在 models/bert/ 路径下。我们发现 whole word masking (WWM) 预训练模型在下游任务上往往有更好的效果。Xlarge预训练权重的细节请参考这里。其他预训练权重的细节请参考这里。

下面列出不同尺寸的中文RoBERTa-WWM预训练权重链接：

模型链接
L=2/H=128 (Tiny)
L=4/H=256 (Mini)
L=4/H=512 (Small)
L=8/H=512 (Medium)
L=12/H=768 (Base)
L=24/H=1024 (Large)
L=36/H=1536 (Xlarge)

这里以RoBERTa-WWM Tiny预训练模型权重为例说明以上权重的使用方法。我们通过上面的链接下载RoBERTa-WWM Tiny预训练模型权重，放到 models/ 文件夹下。我们可以在其基础上增量的预训练：

python3 preprocess.py --corpus_path corpora/book_review.txt --vocab_path models/google_zh_vocab.txt \
                      --dataset_path dataset.pt --processes_num 8 --data_processor mlm

python3 pretrain.py --dataset_path dataset.pt --pretrained_model_path models/cluecorpussmall_roberta_wwm_tiny_seq512_model.bin \
                    --vocab_path models/google_zh_vocab.txt --config_path models/bert/tiny_config.json \
                    --output_model_path models/output_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 5000 --save_checkpoint_steps 2500 --batch_size 64 \
                    --data_processor mlm --target mlm

或者用其进行分类：

python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_roberta_wwm_tiny_seq512_model.bin \
                                   --vocab_path models/google_zh_vocab.txt --config_path models/bert/tiny_config.json
                                   --train_path datasets/book_review/train.tsv \
                                   --dev_path datasets/book_review/dev.tsv \
                                   --test_path datasets/book_review/test.tsv \
                                   --learning_rate 3e-4 --epochs_num 8 --batch_size 64

使用网格搜索寻找RoBERTa-WWM分类模型的最佳超参数示例：

python3 finetune/run_classifier_grid.py --pretrained_model_path models/cluecorpussmall_roberta_wwm_tiny_seq512_model.bin \
                                        --vocab_path models/google_zh_vocab.txt \
                                        --config_path models/bert/tiny_config.json \
                                        --train_path datasets/book_review/train.tsv \
                                        --dev_path datasets/book_review/dev.tsv
                                        --learning_rate_list 3e-5 1e-4 3e-4 --epochs_num_list 3 5 8 --batch_size_list 32 64

通过上面的网格搜索脚本，可以复现这里列出的实验结果。

基于词的中文RoBERTa预训练模型

5个不同尺寸的基于词的中文RoBERTa预训练模型。语料为CLUECorpusSmall。配置文件在 models/bert/ 路径下。分词工具为Google sentencepiece，使用的sentencepiece模型为 models/cluecorpussmall_spm.model 。目前主流的中文预训练模型是基于字的。我们发现基于词的预训练模型在下游任务上往往有更好的效果，并且在推理速度上更有优势（由于更短的序列长度）。更多的细节请参考这里。

下面列出不同尺寸的基于词的中文RoBERTa预训练权重链接：

模型链接
L=2/H=128 (Tiny)
L=4/H=256 (Mini)
L=4/H=512 (Small)
L=8/H=512 (Medium)
L=12/H=768 (Base)

这里以基于词的Tiny预训练模型权重为例说明以上权重的使用方法。我们通过上面的链接下载基于词的Tiny预训练模型权重，放到 models/ 文件夹下。我们可以在其基础上增量的预训练：

python3 preprocess.py --corpus_path corpora/book_review.txt --spm_model_path models/cluecorpussmall_spm.model \
                      --dataset_path dataset.pt --processes_num 8 --data_processor mlm

python3 pretrain.py --dataset_path dataset.pt --pretrained_model_path models/cluecorpussmall_roberta_word_tiny_seq512_model.bin \
                    --spm_model_path models/cluecorpussmall_spm.model --config_path models/bert/tiny_config.json \
                    --output_model_path models/output_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 5000 --save_checkpoint_steps 2500 --batch_size 64 \
                    --data_processor mlm --target mlm

或者用其进行分类：

python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_word_roberta_tiny_seq512_model.bin \
                                   --spm_model_path models/cluecorpussmall_spm.model \
                                   --config_path models/bert/tiny_config.json \
                                   --train_path datasets/book_review/train.tsv \
                                   --dev_path datasets/book_review/dev.tsv \
                                   --test_path datasets/book_review/test.tsv \
                                   --learning_rate 3e-4 --epochs_num 8 --batch_size 64

使用网格搜索寻找基于词的分类模型的最佳超参数示例：

python3 finetune/run_classifier_grid.py --pretrained_model_path models/cluecorpussmall_word_roberta_tiny_seq512_model.bin \
                                        --spm_model_path models/cluecorpussmall_spm.model \
                                        --config_path models/bert/tiny_config.json \
                                        --train_path datasets/book_review/train.tsv \
                                        --dev_path datasets/book_review/dev.tsv \
                                        --learning_rate_list 3e-5 1e-4 3e-4 --epochs_num_list 3 5 8 --batch_size_list 32 64

通过上面的网格搜索脚本，可以复现这里列出的实验结果。

基于不同语料的中文GPT-2预训练模型

我们基于不同的语料，训练了一系列GPT-2语言模型。配置文件在 models/gpt2/ 路径下。下面列出它们的权重链接和细节描述链接（Huggingface模型仓库）：

模型链接	细节描述链接
通用中文GPT-2-xlarge预训练模型	https://huggingface.co/uer/gpt2-xlarge-chinese-cluecorpussmall
通用中文GPT-2-large预训练模型	https://huggingface.co/uer/gpt2-large-chinese-cluecorpussmall
通用中文GPT-2-medium预训练模型	https://huggingface.co/uer/gpt2-medium-chinese-cluecorpussmall
通用中文GPT-2预训练模型	https://huggingface.co/uer/gpt2-chinese-cluecorpussmall
通用中文GPT-2预训练小模型	https://huggingface.co/uer/gpt2-distil-chinese-cluecorpussmall
古诗词GPT-2预训练模型	https://huggingface.co/uer/gpt2-chinese-poem
对联GPT-2预训练模型	https://huggingface.co/uer/gpt2-chinese-couplet
中文歌词GPT-2预训练模型	https://huggingface.co/uer/gpt2-chinese-lyric
文言文GPT-2预训练模型	https://huggingface.co/uer/gpt2-chinese-ancient

需要注意的是，古诗词和文言文模型使用了扩展的词典（分别为models/google_zh_poem_vocab.txt和models/google_zh_ancient_vocab.txt）。通用中文GPT-2预训练小模型使用的配置文件是 models/gpt2/distil_config.json ,其余的预训练权重使用的配置文件是 models/gpt2/config.json 。

这里以通用中文GPT-2预训练小模型权重为例说明以上权重的使用方法。我们通过上面的链接下载通用中文GPT-2预训练小模型权重，放到 models/ 文件夹下。我们可以在其基础上增量的预训练：

python3 preprocess.py --corpus_path corpora/book_review.txt \
                      --vocab_path models/google_zh_vocab.txt \
                      --dataset_path dataset.pt --processes_num 8 \
                      --seq_length 128 --data_processor lm 

python3 pretrain.py --dataset_path dataset.pt \
                    --pretrained_model_path models/cluecorpussmall_gpt2_distil_seq1024_model.bin \
                    --vocab_path models/google_zh_vocab.txt \
                    --config_path models/gpt2/distil_config.json \
                    --output_model_path models/book_review_gpt2_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 10000 --save_checkpoint_steps 5000 --report_steps 1000 \
                    --learning_rate 5e-5 --batch_size 64

或者用其进行分类：

python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_gpt2_distil_seq1024_model.bin \
                                   --vocab_path models/google_zh_vocab.txt \
                                   --config_path models/gpt2/distil_config.json \
                                   --train_path datasets/book_review/train.tsv \
                                   --dev_path datasets/book_review/dev.tsv \
                                   --test_path datasets/book_review/test.tsv \
                                   --learning_rate 3e-5 --epochs_num 8 --batch_size 64

我们可以通过GPT-2模型进行文本生成。首先创建 story_beginning.txt ，在里面输入文本的开头，然后利用 scripts/ 文件夹下的 generate_lm.py 脚本进行文本生成：

python3 scripts/generate_lm.py --load_model_path models/cluecorpussmall_gpt2_distil_seq1024_model.bin \
                               --vocab_path models/google_zh_vocab.txt \
                               --config_path models/gpt2/distil_config.json \
                               --test_path story_beginning.txt --prediction_path story_full.txt \
                               --seq_length 128

中文ALBERT预训练模型

我们基于CLUECorpusSmall语料，训练了一系列ALBERT预训练模型。配置文件在 models/albert/ 路径下。下面列出它们的权重链接和细节描述链接（Huggingface模型仓库）：

模型链接	细节描述链接
通用中文ALBERT-base预训练模型	https://huggingface.co/uer/albert-base-chinese-cluecorpussmall
通用中文ALBERT-large预训练模型	https://huggingface.co/uer/albert-large-chinese-cluecorpussmall

这里以通用中文ALBERT-base预训练模型权重为例说明以上权重的使用方法。我们通过上面的链接下载通用中文ALBERT-base预训练模型权重，放到 models/ 文件夹下。我们可以在其基础上进行下游任务微调：

python3 finetune/run_classifier.py --pretrained_model_path models/cluecorpussmall_albert_base_seq512_model.bin \
                                   --vocab_path models/google_zh_vocab.txt --config_path models/albert/base_config.json \
                                   --train_path datasets/book_review/train.tsv \
                                   --dev_path datasets/book_review/dev.tsv \
                                   --test_path datasets/book_review/test.tsv \
                                   --learning_rate 2e-5 --epochs_num 3 --batch_size 64

python3 inference/run_classifier_infer.py --load_model_path models/finetuned_model.bin \
                                          --vocab_path models/google_zh_vocab.txt \
                                          --config_path models/albert/base_config.json \
                                          --test_path datasets/book_review/test_nolabel.tsv \
                                          --prediction_path datasets/book_review/prediction.tsv \
                                          --labels_num 2

中文T5预训练模型

我们基于CLUECorpusSmall语料，训练了一系列T5预训练模型。配置文件在 models/t5/ 路径下。下面列出它们的权重链接和细节描述链接（Huggingface模型仓库）：

模型链接	细节描述链接
通用中文T5-small预训练模型	https://huggingface.co/uer/t5-small-chinese-cluecorpussmall
通用中文T5-base预训练模型	https://huggingface.co/uer/t5-base-chinese-cluecorpussmall

这里以通用中文T5-small预训练模型权重为例说明以上权重的使用方法。我们通过上面的链接下载通用中文T5-small预训练模型权重，放到 models/ 文件夹下。我们可以在其基础上增量的预训练：

python3 preprocess.py --corpus_path corpora/book_review.txt \
                      --vocab_path models/google_zh_with_sentinel_vocab.txt \
                      --dataset_path dataset.pt \
                      --processes_num 8 --seq_length 128 \
                      --dynamic_masking --data_processor t5

python3 pretrain.py --dataset_path dataset.pt \
                    --pretrained_model_path models/cluecorpussmall_t5_small_seq512_model.bin \
                    --vocab_path models/google_zh_with_sentinel_vocab.txt \
                    --config_path models/t5/small_config.json \
                    --output_model_path models/book_review_t5_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 10000 --save_checkpoint_steps 5000 --report_steps 1000 \
                    --learning_rate 5e-4 --batch_size 64 \
                    --span_masking --span_geo_prob 0.3 --span_max_length 5

或者在其之上进行微调：

python3 finetune/run_text2text.py --pretrained_model_path models/cluecorpussmall_t5_small_seq512_model.bin \
                                  --vocab_path models/google_zh_with_sentinel_vocab.txt \
                                  --config_path models/t5/small_config.json \
                                  --train_path datasets/tnews_text2text/train.tsv \
                                  --dev_path datasets/tnews_text2text/dev.tsv \
                                  --seq_length 128 --tgt_seq_length 8  --learning_rate 3e-4 --epochs_num 3 --batch_size 32

python3 inference/run_text2text_infer.py --load_model_path models/finetuned_model.bin \
                                         --vocab_path models/google_zh_with_sentinel_vocab.txt \
                                         --config_path models/t5/small_config.json \
                                         --test_path datasets/tnews_text2text/test_nolabel.tsv \
                                         --prediction_path datasets/tnews_text2text/prediction.tsv \
                                         --seq_length 128 --tgt_seq_length 8 --batch_size 32

可以在这里下载text2text格式的tnews数据集。

中文T5-v1_1预训练模型

我们基于CLUECorpusSmall语料，训练了一系列T5-v1_1预训练模型。配置文件在 models/t5-v1_1/ 路径下。下面列出它们的权重链接和细节描述链接（Huggingface模型仓库）：

模型链接	细节描述链接
通用中文T5-v1_1-small预训练模型	https://huggingface.co/uer/t5-v1_1-small-chinese-cluecorpussmall
通用中文T5-v1_1-base预训练模型	https://huggingface.co/uer/t5-v1_1-base-chinese-cluecorpussmall

这里以通用中文T5-v1_1-small预训练模型权重为例说明以上权重的使用方法。我们通过上面的链接下载通用中文T5-v1_1-small预训练模型权重，放到 models/ 文件夹下。我们可以在其基础上增量的预训练：

python3 preprocess.py --corpus_path corpora/book_review.txt \
                      --vocab_path models/google_zh_with_sentinel_vocab.txt \
                      --dataset_path dataset.pt \
                      --processes_num 8 --seq_length 128 \
                      --dynamic_masking --data_processor t5

python3 pretrain.py --dataset_path dataset.pt \
                    --pretrained_model_path models/cluecorpussmall_t5-v1_1_small_seq512_model.bin \
                    --vocab_path models/google_zh_with_sentinel_vocab.txt \
                    --config_path models/t5-v1_1/small_config.json \
                    --output_model_path models/book_review_t5-v1_1_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 10000 --save_checkpoint_steps 5000 --report_steps 1000 \
                    --learning_rate 5e-4 --batch_size 64 \
                    --span_masking --span_geo_prob 0.3 --span_max_length 5

或者在其之上进行微调：

python3 finetune/run_text2text.py --pretrained_model_path models/cluecorpussmall_t5-v1_1_small_seq512_model.bin \
                                  --vocab_path models/google_zh_with_sentinel_vocab.txt \
                                  --config_path models/t5-v1_1/small_config.json \
                                  --train_path datasets/tnews_text2text/train.tsv \
                                  --dev_path datasets/tnews_text2text/dev.tsv \
                                  --seq_length 128 --tgt_seq_length 8  --learning_rate 3e-4 --epochs_num 3 --batch_size 32

python3 inference/run_text2text_infer.py --load_model_path models/finetuned_model.bin \
                                         --vocab_path models/google_zh_with_sentinel_vocab.txt \
                                         --config_path models/t5-v1_1/small_config.json \
                                         --test_path datasets/tnews_text2text/test_nolabel.tsv \
                                         --prediction_path datasets/tnews_text2text/prediction.tsv \
                                         --seq_length 128 --tgt_seq_length 8 --batch_size 32

中文PEGASUS预训练模型

我们基于CLUECorpusSmall语料，训练了一系列PEGASUS预训练模型。配置文件在 models/pegasus/ 路径下。下面列出它们的权重链接和细节描述链接（Huggingface模型仓库）：

模型链接	细节描述链接
通用中文PEGASUS-base预训练模型	https://huggingface.co/uer/pegasus-base-chinese-cluecorpussmall
通用中文PEGASUS-large预训练模型	https://huggingface.co/uer/pegasus-large-chinese-cluecorpussmall

中文BART预训练模型

我们基于CLUECorpusSmall语料，训练了一系列BART预训练模型。配置文件在 models/bart/ 路径下。下面列出它们的权重链接和细节描述链接（Huggingface模型仓库）：

模型链接	细节描述链接
通用中文BART-base预训练模型	https://huggingface.co/uer/bart-base-chinese-cluecorpussmall
通用中文BART-large预训练模型	https://huggingface.co/uer/bart-large-chinese-cluecorpussmall

中文RoBERTa下游任务微调模型

我们基于RoBERTa预训练模型，在一系列下游任务上进行了微调。SBERT ChineseTextualInference 自然语言推理模型的配置文件为 models/sbert/sbase_config.json 。其余模型的配置文件均为 models/bert/base_config.json 。下面列出它们的权重链接和细节描述链接（Huggingface模型仓库）：

模型链接	细节描述链接
JD full 情感分类	https://huggingface.co/uer/roberta-base-finetuned-jd-full-chinese
JD binary 情感分类	https://huggingface.co/uer/roberta-base-finetuned-jd-binary-chinese
Dianping 情感分类	https://huggingface.co/uer/roberta-base-finetuned-dianping-chinese
Ifeng 新闻主题分类	https://huggingface.co/uer/roberta-base-finetuned-ifeng-chinese
Chinanews 新闻主题分类	https://huggingface.co/uer/roberta-base-finetuned-chinanews-chinese
CLUENER2020 命名实体识别	https://huggingface.co/uer/roberta-base-finetuned-cluener2020-chinese
抽取式问答	https://huggingface.co/uer/roberta-base-chinese-extractive-qa
ChineseTextualInference SBERT 自然语言推理	https://huggingface.co/uer/sbert-base-chinese-nli

可以加载上面的模型进行增量预训练、微调、推理。

中文Transformer之外的预训练模型

我们基于CLUECorpusSmall语料，训练了一系列Transformer之外的预训练模型。下面列出它们的权重链接和细节描述：

模型链接	配置文件	模型结构细节	训练细节
通用中文LSTM语言模型	models/rnn_config.json	--embedding word --remove_embedding_layernorm --encoder lstm --target lm	步数：500000 学习率：1e-3 batch size：64*8（GPU数量）长度：256
通用中文GRU语言模型	models/rnn_config.json	--embedding word --remove_embedding_layernorm --encoder gru --target lm	步数：500000 学习率：1e-3 batch size：64*8（GPU数量）长度：256
通用中文GatedCNN语言模型	models/gatedcnn_9_config.json	--embedding word --remove_embedding_layernorm --encoder gatedcnn --target lm	步数：500000 学习率：1e-4 batch size：64*8（GPU数量）长度：256
通用中文ELMo预训练模型	models/birnn_config.json	--embedding word --remove_embedding_layernorm --encoder bilstm --target bilm	步数：500000 学习率：5e-4 batch size：64*8（GPU数量）长度：256

其他机构中文预训练模型

模型链接	描述	细节描述链接
Google Chinese BERT-Base	配置文件：models/bert/base_config.json 词典：models/google_zh_vocab.txt 分词器：BertTokenizer	https://github.com/google-research/bert
Google Chinese ALBERT-Base	配置文件：models/albert/base_config.json 词典：models/google_zh_vocab.txt 分词器：BertTokenizer	https://github.com/google-research/albert
Google Chinese ALBERT-Large	配置文件：models/albert/large_config.json 词典：models/google_zh_vocab.txt 分词器：BertTokenizer	https://github.com/google-research/albert
Google Chinese ALBERT-Xlarge	配置文件：models/albert/xlarge_config.json 词典：models/google_zh_vocab.txt 分词器：BertTokenizer	https://github.com/google-research/albert
Google Chinese ALBERT-Xxlarge	配置文件：models/albert/xxlarge_config.json 词典：models/google_zh_vocab.txt 分词器：BertTokenizer	https://github.com/google-research/albert
HFL Chinese BERT-wwm	配置文件：models/bert/base_config.json 词典：models/google_zh_vocab.txt 分词器：BertTokenizer	https://github.com/ymcui/Chinese-BERT-wwm
HFL Chinese BERT-wwm-ext	配置文件：models/bert/base_config.json 词典：models/google_zh_vocab.txt 分词器：BertTokenizer	https://github.com/ymcui/Chinese-BERT-wwm
HFL Chinese RoBERTa-wwm-ext	配置文件：models/bert/base_config.json 词典：models/google_zh_vocab.txt 分词器：BertTokenizer	https://github.com/ymcui/Chinese-BERT-wwm
HFL Chinese RoBERTa-wwm-large-ext	配置文件：models/bert/large_config.json 词典：models/google_zh_vocab.txt 分词器：BertTokenizer	https://github.com/ymcui/Chinese-BERT-wwm

其他机构英文预训练模型

模型链接	描述	细节描述链接
English BERT-Base-uncased	配置文件：models/bert/base_config.json 词典：models/google_uncased_en_vocab.txt 分词器：BertTokenizer	https://github.com/google-research/bert
English BERT-Base-cased	配置文件：models/bert/base_config.json 词典：models/google_cased_en_vocab.txt 分词器：BertTokenizer	https://github.com/google-research/bert
English BERT-Large-uncased	配置文件：models/bert/large_config.json 词典：models/google_uncased_en_vocab.txt 分词器：BertTokenizer	https://github.com/google-research/bert
English BERT-Large-cased	配置文件：models/bert/large_config.json 词典：models/google_cased_en_vocab.txt 分词器：BertTokenizer	https://github.com/google-research/bert
English BERT-Large-WWM-uncased	配置文件：models/bert/large_config.json 词典：models/google_uncased_en_vocab.txt 分词器：BertTokenizer	https://github.com/google-research/bert
English BERT-Large-WWM-cased	配置文件：models/bert/large_config.json 词典：models/google_cased_en_vocab.txt 分词器：BertTokenizer	https://github.com/google-research/bert
English RoBERTa-Base	配置文件：models/xlm-roberta/base_config.json 词典：models/huggingface_gpt2_vocab.txt models/huggingface_gpt2_merges.txt 分词器：BPETokenizer	https://huggingface.co/roberta-base
English RoBERTa-Large	配置文件：models/xlm-roberta/large_config.json 词典：models/huggingface_gpt2_vocab.txt models/huggingface_gpt2_merges.txt 分词器：BPETokenizer	https://huggingface.co/roberta-large

Google机构ViT预训练模型

模型链接	描述	细节描述链接
ViT-base-patch32-224-in21k	配置文件：models/vit/base-32-224_config.json 分词器：VirtualTokenizer	https://huggingface.co/google/vit-base-patch32-224-in21k
ViT-base-patch16-224-in21k	配置文件：models/vit/base-16-224_config.json 分词器：VirtualTokenizer	https://huggingface.co/google/vit-base-patch16-224-in21k
ViT-large-patch32-224-in21k	配置文件：models/vit/large-32-224_config.json 分词器：VirtualTokenizer	https://huggingface.co/google/vit-large-patch32-224-in21k
ViT-large-patch16-224-in21k	配置文件：models/vit/large-16-224_config.json 分词器：VirtualTokenizer	https://huggingface.co/google/vit-large-patch16-224-in21k
ViT-huge-patch14-224-in21k	配置文件：models/vit/huge-14-224_config.json 分词器：VirtualTokenizer	https://huggingface.co/google/vit-huge-patch14-224-in21k

Facebook机构S2T预训练模型

我们需要在tencentpretrain/utils/constants.py中修改特殊字符映射表路径，从models/special_tokens_map.json改为models/xlmroberta_special_tokens_map.json。Sentencepiece被用于tokeniztion（--spm_model_path models/sentencepiece.bpe.model --tokenizer bert）。

模型链接	描述	细节描述链接
S2T-small-librispeech-asr	配置文件：models/s2t/small_config.json 分词器：BertTokenizer	https://huggingface.co/facebook/s2t-small-librispeech-asr
S2T-medium-librispeech-asr	配置文件：models/s2t/medium_config.json 分词器：BertTokenizer	https://huggingface.co/facebook/s2t-medium-librispeech-asr
S2T-large-librispeech-asr	配置文件：models/s2t/large_config.json 分词器：BertTokenizer	https://huggingface.co/facebook/s2t-large-librispeech-asr

Home
主页
- 项目特色
- 依赖环境
- 快速上手
- 预训练数据
- 下游任务数据集
- 预训练模型仓库
- 使用说明
- 竞赛解决方案
  - 中文任务测评基准CLUE
  - SMP2020-EWECT
  - SMP2019-ECISA
  - CCF-BDCI2021-面向黑灰产治理的恶意短信变体字还原
  - 英文任务测评基准GLUE
  - 视觉任务评测基准
- 引用

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

预训练模型仓库

中文RoBERTa预训练模型

中文RoBERTa-WWM预训练模型

基于词的中文RoBERTa预训练模型

基于不同语料的中文GPT-2预训练模型

中文ALBERT预训练模型

中文T5预训练模型

中文T5-v1_1预训练模型

中文PEGASUS预训练模型

中文BART预训练模型

中文RoBERTa下游任务微调模型

中文Transformer之外的预训练模型

其他机构中文预训练模型

其他机构英文预训练模型

Google机构ViT预训练模型

Facebook机构S2T预训练模型

Clone this wiki locally