wav2tok: Deep Sequence Tokenizer for Audio Retrieval

Codes for Reproducibility

Paper link: https://openreview.net/forum?id=v8Mi8KU6056

Citation (bibtex):

@inproceedings{banerjee2023wav2tok,
 title={wav2tok: Deep Sequence Tokenizer for Audio Retrieval},
 author={Banerjee, Adhiraj and Arora, Vipul},
 booktitle={The Eleventh International Conference on Learning Representations},
 year={2023}
 }

Repository Structure:

wav2tok/Src

make 3 more folders bin, weights, Dataset

wav2tok/Src

   /bin 
   
   /weights
   
   /Dataset

Training wav2tok

Keep dataset in wav2tok/Dataset

Make a list of data splits and save as .bin file wav2tok/bin

audios.bin == [X_train , X_test]

X_train, X_test -> lists of audio paths

                   [audio path 1, audio path 2, ...]

OR

Make a list of data dictionaries and save as .bin file to wav2tok/bin

  audios.bin == [X_train , X_test]

  X_train, X_test -> dictionaries of audio

  Keys -> Classes or labels (song_id1, song_id2)

  Values -> List of audio paths (10 utterances for song_id1)


            {class 1 : list of audio , class 2 : list of audio ...}

Code for Training wav2tok:

We have a dedicated function for training a wav2tok model.

                 wav2tok/Src/train.py

Functions used in wav2tok/Src/train.py:

 wav2tok() from wav2tok/Src/wav2tok.py

Trainer() from wav2tok/Src/training_function_library.py

To train a wav2tok model just run in command prompt,

 python3 train.py --args1 arg_value1 --args2 arg_value2

Arguments to pass:

Details of Args for Trainer(...) function

--debug -> int, 1 for debug mode, 0 for work mode  

--dataset -> str, Dataset filename (dataset: {filename}.bin)

--is_dict -> int (Default: 0 -> False), if Dataset is a dictionary or list 

--sample_subdataset -> int (Default: 0 -> False), sample random subsets of data for training on Large datasets 
                       Works only if --is_dict == 0

--subdata_split -> float (Default: 0.1), How big of a portion are the subdatasets in comparison to the large dataset

--is_triplet -> int (Default:  0 ->  False), if you want to train with Batches of Triplets (anchor, positive, negative)
                
--is_single -> int (Default:  0 ->  False), if you want to train with batches of audio (anchor)
    

##########    Default Training is done with pairs of audio (anchor, positive) ##############


--same_length -> int (Default:  0 ->  False), if you want to time stretch audios in each batch of (anchor) or (anchor. positive), (anchor, positive, negative) to same length  

--apply_augmentation -> int (Default:  0 ->  False), works if is_dict == True, apply augmentations to pairs sampled from dictionary === (anchor, positive), apply augmentation to positive

--batch_size -> int (Default: 4), Training batch size

--EPOCHS -> int (Default: 100), Number of full data passes 

--autosave_epoch -> int (Default: 5), autosave model parameters in {autosave} number of epochs

--patience -> int (Default: 5), stop training if evaluation metric doesn't increase for {patience} number of epochs

--name -> str (Default: 'TrialTok' ), Model parameters save filename 

--epoch_start -> int (Default: 0), To start training at {epoch_start} epoch.

--device -> str (Default: 'cuda'), GPU device name

Details on Args for optimizer, learning rate scheduler, weight saving and loading

  --learning_rate ->   float (Default: 2e-3), Learning rate for Training (we use the ADAM optimizer with default settings for training)
  
  --use_scheduler -> int (Default:  1 -> True), if you want to use a learning rate scheduler (We use a linear learning rate scheduler with warmup)
  
  --train_steps -> int (Default: None, Calculated as EPOCHS* dataset_length //batch_size), number of training steps

  --warmup -> float (Default: 0.08), Percentage of training steps to be used for warm up 

  --load_dir -> str (Default: None), Model name to load
  
  --load_model_epochid -> int (Default: None), Epoch id to load 
  
  --best_model -> int (Default:  1 -> True), if you want to load the best version of model

Details of Args for class wav2tok

--debug -> int (Default: 0), 1 for debug mode, 0 for work mode  

--use_transformer -> int (Default:  0 -> False), if you want to use a transformer network as encoder ,
                                         but you have to set the args in wav2tok/Src/wav2tok.py
                                         in class TransformerEncoder and TransformerSentenceEncoderLayer

                                         We use BiLSTM encoder, you can tweak parameters 
                                                        in wav2tok/Src/wav2tok.py class Emb

--input_dim -> int (Default: 39), input Feature dim (STFT dim or MFCC dim) 

--emb_dim -> int (Default: 256), Embedding dim (encoder output dim)

--num_tokens ->  int (Default: 50), number of tokens to use for tokenization 

--num_layers -> int (Default: 2), number of layers to use for BiLSTM model (no effect if you want to use Transformer) 
      
--device -> str (Default: 'cuda'), GPU device name

--dataset -> str (Default: None),Dataset name for clustering ('audios') / takes the training spilt for clustering

--mfcc -> int (Default:  0 -> False), if you want to use MFCC features

--cluster_split -> float (Default: 1.0), percentage of training data to use for clustering (data is sampled randomly)
           
--iter_clust -> int   (Default: 500), number of training steps before each clustering session

--clip -> int (Default:  0 -> False), works if is_dict = False, if you want to clip the to some duration
 
--clip_duration -> float (Default: 3), clip audio to {clip_duration} seconds

--sr -> int (Default: 16000), sampling rate of audio

--use_cosine -> int  (Default:  0 -> False), use cosine similarity in matching task instead of parameterized similarity score
       
--temp -> float (Default: 0.1), temperature for the logits used in cross-entropy calculation
    
--alpha , --beta -> floats  (Default: 0.01,0.01) positive constants in likelihood loss

Brief on the functions present in class wav2tok:

 forward -> input: seq1 , seq2, training_steps

       output: loss with gradients, logs



cluster -> input: dataset name -> string ('audios')

       output: Performs clustering and 
               sets the token classifier codebook



 get_feats -> input: audio-> wav , mfcc -> Boolean 

         output: MFCCs if MFCC == True else STFT matrix 
                 (you can the parameters for extraction of features 
                  manually inside the code segment )



get_embs -> input: audio -> STFT or MFCC

        output: numpy array of Embeddings


initialize_classifier -> input ->  Codebook of token representations, 
                               shape: (number of tokens, Embedding dim)
                     output -> sets token classifier codebook as input





ctc_loss_cal -> input: logits of shape (Time, classes), token sequence

            output: CTC loss or likelihood loss


gen_prototype -> input: Concatenated sequences of representations {Z, Z'},
                    Concatenated sequences of tokens {T,T'},
                    Unique tokens in concatenated sequence {T,T'}
        
             output: Dictionary of Prototypes {token: prototype or average representation in {Z,Z'}  mapping to token }



matching_loss -> input: Dictionary of Prototypes

             output: contrastive loss 



inter_dict_weights -> calculates distances from codebook representations
                  Helper function to matching_loss

Weight Saving and loading

The Trainer function saves the best weights as well as weights every 5 (default) epochs

Uses load_Weights and save_Weights functions in wav2tok/Src/new_function_library.py

save_weights -> input: model instance, epoch_id, name

            output: save weights in wav2tok/weights/{name}_{epoch}.bin



load_Weights -> input: model instance, epoch_id, name 

            output: load weight to model

Code Examples

Args to pass to wav2tok/Src/train.py for different cases of audio dataset

Case 1:

wav2tok/bin/audio.bin == [X_train, X_test]

X_train, X_test -> dictionaries {class 1 : list of audio , class 2 : list of audio ...}

python3 train.py --dataset audios --sr 16000 --is_dict 1 --cluster_split 1.0  --apply_augmentation 1 \
                 --iter_clust 1000 --input_dim 39 --emb_dim 256 --num_tokens 50 --batch_size 16 --name TrialTok

apply_augmentation = True, if you want to sample another sequence of same class 
                       and apply augmentation to it



                 False , if you want to only sample another sequence of same class

Case 2:

wav2tok/bin/audio.bin == [X_train, X_test]

X_train, X_test -> list of audio paths [audio path 1, audio path 2, ...]

 python3 train.py --dataset audios --sr 16000 --is_dict 0 --cluster_split 1.0 --iter_clust 1000 \
                    --input_dim 39 --emb_dim 256 --num_tokens 50 --batch_size 16 --name TrialTok


apply_augmentation = doesn't matter similar sequence generated via audio augmentations

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
Src		Src
LICENSE		LICENSE
README.md		README.md
slides.pdf		slides.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wav2tok: Deep Sequence Tokenizer for Audio Retrieval

Repository Structure:

Training wav2tok

OR

Code for Training wav2tok:

Arguments to pass:

Details of Args for Trainer(...) function

Details on Args for optimizer, learning rate scheduler, weight saving and loading

Details of Args for class wav2tok

Brief on the functions present in class wav2tok:

Weight Saving and loading

Code Examples

Args to pass to wav2tok/Src/train.py for different cases of audio dataset

Case 1:

Case 2:

About

Releases

Packages

Contributors 2

Languages

License

madhavlab/wav2tok

Folders and files

Latest commit

History

Repository files navigation

wav2tok: Deep Sequence Tokenizer for Audio Retrieval

Repository Structure:

Training wav2tok

OR

Code for Training wav2tok:

Arguments to pass:

Details of Args for Trainer(...) function

Details on Args for optimizer, learning rate scheduler, weight saving and loading

Details of Args for class wav2tok

Brief on the functions present in class wav2tok:

Weight Saving and loading

Code Examples

Args to pass to wav2tok/Src/train.py for different cases of audio dataset

Case 1:

Case 2:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages