Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetuning transphone g2p #13

Open
pragvr opened this issue Jun 29, 2023 · 3 comments
Open

Finetuning transphone g2p #13

pragvr opened this issue Jun 29, 2023 · 3 comments

Comments

@pragvr
Copy link

pragvr commented Jun 29, 2023

Thank you for sharing your work!
I was wondering if it's possible to finetune the transphone G2P model with proprietary lexicons. If yes, could you please share some instructions on how to achieve this?

@xinjli
Copy link
Owner

xinjli commented Jun 29, 2023

hi, thanks for your question!

It currently does not support train/fine-tuning with your own lexicon, but it should not be very difficult to modify the code to achieve this. To do this,
You can first implement your own dataset in transphone/model/dataset.py
and then switch loading your dataset at transphone/bin/train_g2p.py and use it to train with your own data

Then it should be working I think

@pragvr
Copy link
Author

pragvr commented Jul 5, 2023

Hi again,
I incorporated the changes you suggested and got the training from scratch part working. However, I still have trouble with the finetuning part mainly because the src and tgt token embedding sizes are different for pre-trained model vs. the data that I have (mostly stress markers for english that are not there in existing data). Would you have any suggestions on how to train this?

@xinjli
Copy link
Owner

xinjli commented Jul 5, 2023

Yes, you can do it in two steps:

  1. there are vocab files in the pretrained model's directory. You can first append your new stress symbols to the end of the vocab.tgt file
  2. then you need to modify the model loader to initialize the existing embedding with the pretrained embedding, but the new appended vocab embedding with random initialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants