transformers_domain_adaptation.VocabAugmentorΒΆ
-
class
transformers_domain_adaptation.
VocabAugmentor
(tokenizer, cased, target_vocab_size)[source]ΒΆ Find new tokens to add to a
tokenizer
βs vocabulary.A new vocabulary is learnt from the training corpus using the same tokenization model (WordPiece, BPE, Unigram). The most common tokens of this new vocabulary that do not exist in the existing vocabulary are selected.
- Parameters:
tokenizer (transformers.tokenization_utils_fast.PreTrainedTokenizerFast) β A Rust-based π€ Tokenizer
cased (bool) β If False, ignore uppercases in corpus
target_vocab_size (int) β Size of augmented vocabulary
- Raises:
ValueError β If
target_vocab_size
is larger or equal to the existing vocabulary oftokenizer
RuntimeError β If
tokenizer
uses an unsupported tokenization model