transformers_domain_adaptation.VocabAugmentor¶

class transformers_domain_adaptation.VocabAugmentor(tokenizer, cased, target_vocab_size)[source]¶

Find new tokens to add to a tokenizer’s vocabulary.

A new vocabulary is learnt from the training corpus using the same tokenization model (WordPiece, BPE, Unigram). The most common tokens of this new vocabulary that do not exist in the existing vocabulary are selected.

Parameters:

tokenizer (transformers.tokenization_utils_fast.PreTrainedTokenizerFast) – A Rust-based 🤗 Tokenizer
cased (bool) – If False, ignore uppercases in corpus
target_vocab_size (int) – Size of augmented vocabulary

Raises:

ValueError – If target_vocab_size is larger or equal to the existing vocabulary of tokenizer
RuntimeError – If tokenizer uses an unsupported tokenization model

get_new_tokens(training_corpus)[source]¶

Obtain new tokens found in training_corpus.

New tokens contains the most common tokens that do not exist in the tokenizer’s vocabulary.

Parameters:: training_corpus (Union[Corpus, pathlib.Path, str]) – The training corpus
Return type:: List[Token]