transformer_smaller_training_vocab#

transformer_smaller_training_vocab.get_texts_from_dataset(dataset, key)View on GitHub#

Extract the texts of a dataset given their keys.

Note

This function is only available if the datasets-extra is installed.

Parameters:
  • dataset (Union[Dataset, DatasetDict]) – The huggingface dataset used for training.

  • key (Union[str, Tuple[str, str]]) – Either a simple string, being the key referring the text or a Tuple of two strings, referring to the keys for a text pair

Return type:

Iterator[Union[str, List[str], Tuple[str, str], Tuple[List[str], List[str]]]]

Returns: the texts or text pairs extracted from the dataset

transformer_smaller_training_vocab.recreate_vocab(model, tokenizer, used_tokens, saved_vocab, saved_embeddings, empty_cuda_cache=None)View on GitHub#

Recreates the full vocabulary from a reduced model.

Combines the stored embeddings with the updated embeddings of the reduced model and stores everything in place to have a model functioning on full vocabulary.

Parameters:
  • model (PreTrainedModel) – The reduced transformers model to recreate

  • tokenizer (PreTrainedTokenizer) – The reduced tokenizer to recreate

  • used_tokens (List[int]) – The ids of tokens that are still contained

  • saved_vocab (Dict[str, int]) – The full vocabulary that was saved.

  • saved_embeddings (Tensor) – The saved embeddings of the full transformer before training.

  • empty_cuda_cache (Optional[bool]) – Defaults to True if the model is stored on cuda and False otherwise. If False, for some time, the weights will be in memory twice (Full + Reduced), before the garbage collection removes the Full weights from cache. If True, the cache will be emptied, before the reduced weights will be loaded to the device of the model and therefore won’t have a temporarily higher memory footprint.

Return type:

None

transformer_smaller_training_vocab.reduce_train_vocab(model, tokenizer, texts, empty_cuda_cache=None, optimizer=None)View on GitHub#

Contextmanager to temporary reduce the model for training.

Examples

>>> with reduce_train_vocab(model, tokenizer, texts):
>>>     # train reduced model
>>> # save full model again
Parameters:
  • model (PreTrainedModel) – The transformers model to reduce

  • tokenizer (PreTrainedTokenizer) – The tokenizer respective to the transformers model

  • texts (Sequence[Union[str, List[str], Tuple[str, str], Tuple[List[str], List[str]]]]) – A Sequence of either texts, pre-tokenized texts, text-pairs or pre-tokenized textpairs. Usually the full training + validation data used when training. The model & tokenizer vocabulary will be reduced to only tokens that are found in those texts.

  • empty_cuda_cache (Optional[bool]) – Defaults to True if the model is stored on cuda and False otherwise. If False, for some time, the weights will be in memory twice (Full + Reduced), before the garbage collection removes the Full weights from cache. If True, the cache will be emptied, before the reduced weights will be loaded to the device of the model and therefore won’t have a temporarily higher memory footprint.

  • optimizer (Optional[Optimizer]) – Defaults to None If provided, the optimizer parameters will be updated, to use the reduced embeddings instead of the old pointer. It is crucial to provide the optimizer if one was created before reducing the model.

Return type:

Iterator[None]

transformer_smaller_training_vocab.reduce_train_vocab_and_context(model, tokenizer, texts, empty_cuda_cache=None, optimizer=None)View on GitHub#

Reduce the vocabulary given a set of texts.

Reduces the vocabulary of a model and a tokenizer by checking which tokens are used in the text and discarding all unused tokens.

Parameters:
  • model (PreTrainedModel) – The transformers model to reduce

  • tokenizer (PreTrainedTokenizer) – The tokenizer respective to the transformers model

  • texts (Sequence[Union[str, List[str], Tuple[str, str], Tuple[List[str], List[str]]]]) – A Sequence of either texts, pre-tokenized texts, text-pairs or pre-tokenized textpairs. Usually the full training + validation data used when training. The model & tokenizer vocabulary will be reduced to only tokens that are found in those texts.

  • empty_cuda_cache (Optional[bool]) – Defaults to True if the model is stored on cuda and False otherwise. If False, for some time, the weights will be in memory twice (Full + Reduced), before the garbage collection removes the Full weights from cache. If True, the cache will be emptied, before the reduced weights will be loaded to the device of the model and therefore won’t have a temporarily higher memory footprint.

  • optimizer (Optional[Optimizer]) – Defaults to None If provided, the optimizer parameters will be updated, to use the reduced embeddings instead of the old pointer. It is crucial to provide the optimizer if one was created before reducing the model.

Return type:

Tuple[List[int], Dict[str, int], Tensor]

Returns:

All information required to restore the original vocabulary after training, consisting of:

  • used_tokens (List[int]) - The ids of all tokens that will be kept in vocabulary.

  • saved_vocab (Dict[str,int]) - The original vocabulary to recreate the tokenizer.

  • saved_embeddings (Tensor) - The original embedding weights.