transformers_domain_adaptation.DataSelector

class transformers_domain_adaptation.DataSelector(keep, tokenizer, similarity_metrics=None, diversity_metrics=None)[source]

Select subset of data that is likely to be beneficial for domain pre-training.

This class is sklearn-compatible and implements the sklearn Transformers interface.

Parameters:
  • keep (Union[int, float]) – Quantity of documents from corpus to keep. To specify number of documents, use int. To specify percentage of documents in corpus, use float.

  • tokenizer (transformers.tokenization_utils_fast.PreTrainedTokenizerFast) – A Rust-based 🤗 Tokenizer

  • similarity_metrics (Optional[Sequence[str]]) – An optional list of similarity metrics

  • diversity_metrics (Optional[Sequence[str]]) – An optional list of diversity metrics

Note

For a list of similarity and diversity metrics, refer to transformers_domain_adaptation.data_selection.metrics

Note

At least one similarity/diversity metric must be provided.

fit(ft_corpus)[source]

Compute corpus-level term distribution of ft_corpus.

A new fitted attribute .ft_term_dist_ of shape (\(V\),) is created, where \(V\) is the size of the tokenizer vocabulary.

Parameters:

ft_corpus (Corpus) – The fine-tuning corpus. Not to be confused with the domain pre-training corpus (which is used in transform())

Note

The ft_corpus is treated as a single “document”, which will be compared against other documents in the in-domain corpus in transform()

transform(docs)[source]

Create a relevant subset of documents from the training corpus based on the provided data selection metrics.

Parameters:

docs (Corpus) – The training corpus

Returns:

A subset of relevant docs for domain pre-training

Return type:

Corpus