transformers_domain_adaptation.data_selection.metrics.diversity

Diversity metrics for data selection introduced by Ruder and Plank.

The functions here were adapted and vectorized from those in the authors’ repo.

transformers_domain_adaptation.data_selection.metrics.diversity.number_of_term_types(example)[source]

Count the number of term types of the example.

Parameters:

example (Sequence[Token]) –

Return type:

int

transformers_domain_adaptation.data_selection.metrics.diversity.type_token_diversity(example)[source]

Calculate diversity based on the type-token ratio of the example.

Parameters:

example (Sequence[Token]) –

Return type:

float

transformers_domain_adaptation.data_selection.metrics.diversity.entropy(example, vocab2id)[source]

Calculate Entropy (https://en.wikipedia.org/wiki/Entropy_(information_theory%29#Definition).

Parameters:
  • example (Sequence[Token]) –

  • vocab2id (Dict[Token, int]) –

Return type:

float

transformers_domain_adaptation.data_selection.metrics.diversity.simpsons_index(example, train_term_dist, vocab2id)[source]

Calculate Simpson’s Index (https://en.wikipedia.org/wiki/Diversity_index#Simpson_index).

Parameters:
  • example (Sequence[Token]) –

  • train_term_dist (numpy.ndarray) –

  • vocab2id (Dict[Token, int]) –

Return type:

float

transformers_domain_adaptation.data_selection.metrics.diversity.renyi_entropy(example, domain_term_dist, vocab2id)[source]

Calculate Rényi Entropy (https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy).

Parameters:
  • example (Sequence[Token]) –

  • domain_term_dist (numpy.ndarray) –

  • vocab2id (Dict[Token, int]) –

Return type:

float

transformers_domain_adaptation.data_selection.metrics.diversity.diversity_func_factory(metric, train_term_dist, vocab2id)[source]

Return the corresponding diversity function based on the provided metric.

Parameters:
  • metric (str) – Diversity metric

  • train_term_dist (numpy.ndarray) – Term distribution of the training data

  • vocab2id (Dict[Token, int]) – Vocabulary-to-id mapping

Raises:

ValueError – If metric does not exist in DIVERSITY_FEATURES

Return type:

Callable[Sequence[Token], float]