transformers_domain_adaptation.VocabAugmentorΒΆ

class transformers_domain_adaptation.VocabAugmentor(tokenizer, cased, target_vocab_size)[source]ΒΆ

Find new tokens to add to a tokenizer’s vocabulary.

A new vocabulary is learnt from the training corpus using the same tokenization model (WordPiece, BPE, Unigram). The most common tokens of this new vocabulary that do not exist in the existing vocabulary are selected.

Parameters:
  • tokenizer (transformers.tokenization_utils_fast.PreTrainedTokenizerFast) – A Rust-based πŸ€— Tokenizer

  • cased (bool) – If False, ignore uppercases in corpus

  • target_vocab_size (int) – Size of augmented vocabulary

Raises:
  • ValueError – If target_vocab_size is larger or equal to the existing vocabulary of tokenizer

  • RuntimeError – If tokenizer uses an unsupported tokenization model

get_new_tokens(training_corpus)[source]ΒΆ

Obtain new tokens found in training_corpus.

New tokens contains the most common tokens that do not exist in the tokenizer’s vocabulary.

Parameters:

training_corpus (Union[Corpus, pathlib.Path, str]) – The training corpus

Return type:

List[Token]