transformers_domain_adaptation.DataSelector¶
-
class
transformers_domain_adaptation.
DataSelector
(keep, tokenizer, similarity_metrics=None, diversity_metrics=None)[source]¶ Select subset of data that is likely to be beneficial for domain pre-training.
This class is sklearn-compatible and implements the sklearn Transformers interface.
- Parameters:
keep (Union[int, float]) – Quantity of documents from corpus to keep. To specify number of documents, use
int
. To specify percentage of documents in corpus, usefloat
.tokenizer (transformers.tokenization_utils_fast.PreTrainedTokenizerFast) – A Rust-based 🤗 Tokenizer
similarity_metrics (Optional[Sequence[str]]) – An optional list of similarity metrics
diversity_metrics (Optional[Sequence[str]]) – An optional list of diversity metrics
Note
For a list of similarity and diversity metrics, refer to transformers_domain_adaptation.data_selection.metrics
Note
At least one similarity/diversity metric must be provided.
-
fit
(ft_corpus)[source]¶ Compute corpus-level term distribution of
ft_corpus
.A new fitted attribute
.ft_term_dist_
of shape (\(V\),) is created, where \(V\) is the size of thetokenizer
vocabulary.- Parameters:
ft_corpus (Corpus) – The fine-tuning corpus. Not to be confused with the domain pre-training corpus (which is used in
transform()
)
Note
The
ft_corpus
is treated as a single “document”, which will be compared against other documents in the in-domain corpus intransform()