transformers_domain_adaptation.data_selection.metrics.diversity¶
Diversity metrics for data selection introduced by Ruder and Plank.
The functions here were adapted and vectorized from those in the authors’ repo.
-
transformers_domain_adaptation.data_selection.metrics.diversity.
number_of_term_types
(example)[source]¶ Count the number of term types of the example.
- Parameters:
example (Sequence[Token]) –
- Return type:
int
-
transformers_domain_adaptation.data_selection.metrics.diversity.
type_token_diversity
(example)[source]¶ Calculate diversity based on the type-token ratio of the example.
- Parameters:
example (Sequence[Token]) –
- Return type:
float
-
transformers_domain_adaptation.data_selection.metrics.diversity.
entropy
(example, vocab2id)[source]¶ Calculate Entropy (https://en.wikipedia.org/wiki/Entropy_(information_theory%29#Definition).
- Parameters:
example (Sequence[Token]) –
vocab2id (Dict[Token, int]) –
- Return type:
float
-
transformers_domain_adaptation.data_selection.metrics.diversity.
simpsons_index
(example, train_term_dist, vocab2id)[source]¶ Calculate Simpson’s Index (https://en.wikipedia.org/wiki/Diversity_index#Simpson_index).
- Parameters:
example (Sequence[Token]) –
train_term_dist (numpy.ndarray) –
vocab2id (Dict[Token, int]) –
- Return type:
float
-
transformers_domain_adaptation.data_selection.metrics.diversity.
renyi_entropy
(example, domain_term_dist, vocab2id)[source]¶ Calculate Rényi Entropy (https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy).
- Parameters:
example (Sequence[Token]) –
domain_term_dist (numpy.ndarray) –
vocab2id (Dict[Token, int]) –
- Return type:
float
-
transformers_domain_adaptation.data_selection.metrics.diversity.
diversity_func_factory
(metric, train_term_dist, vocab2id)[source]¶ Return the corresponding diversity function based on the provided metric.
- Parameters:
metric (str) – Diversity metric
train_term_dist (numpy.ndarray) – Term distribution of the training data
vocab2id (Dict[Token, int]) – Vocabulary-to-id mapping
- Raises:
ValueError – If metric does not exist in DIVERSITY_FEATURES
- Return type:
Callable[Sequence[Token], float]