pytext.models.embeddings package

Submodules

pytext.models.embeddings.char_embedding module

class pytext.models.embeddings.char_embedding.CharacterEmbedding(num_embeddings: int, embed_dim: int, out_channels: int, kernel_sizes: List[int], *args, **kwargs)[source]

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

Module for character aware CNN embeddings for tokens. It uses convolution followed by max-pooling over character embeddings to obtain an embedding vector for each token.

Implementation is loosely based on https://arxiv.org/abs/1508.06615 but, does not implement the Highway Network illustrated in the paper.

Parameters:
  • num_embeddings (int) – Total number of characters (vocabulary size).
  • embed_dim (int) – Size of embedding vector.
  • out_channels (int) – Number of output channels.
  • kernel_sizes (List[int]) – Dimension of input Tensor passed to MLP.
char_embed

nn.Embedding – Character embedding table.

convs

nn.ModuleList – Convolution layers that operate on character

embeddings.
embedding_dim

int – Dimension of the final token embedding produced.

Config

alias of pytext.config.field_config.CharFeatConfig

forward(chars: torch.Tensor) → torch.Tensor[source]

Given a batch of sentences such that tokens are broken into character ids, produce token embedding vectors for each sentence in the batch.

Parameters:
  • chars (torch.Tensor) – Batch of sentences where each token is broken
  • characters. (into) –
  • Dimension – batch size X maximum sentence length X maximum word length
Returns:

Embedded batch of sentences. Dimension: batch size X maximum sentence length, token embedding size. Token embedding size = out_channels * len(self.convs))

Return type:

torch.Tensor

classmethod from_config(config: pytext.config.field_config.CharFeatConfig, metadata: pytext.fields.field.FieldMeta)[source]

Factory method to construct an instance of CharacterEmbedding from the module’s config object and the field’s metadata object.

Parameters:
  • config (CharFeatConfig) – Configuration object specifying all the parameters of CharacterEmbedding.
  • metadata (FieldMeta) – Object containing this field’s metadata.
Returns:

An instance of CharacterEmbedding.

Return type:

type

pytext.models.embeddings.dict_embedding module

class pytext.models.embeddings.dict_embedding.DictEmbedding(num_embeddings: int, embed_dim: int, pooling_type: pytext.config.module_config.PoolingType, *args, **kwargs)[source]

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase, torch.nn.modules.sparse.Embedding

Module for dictionary feature embeddings for tokens. Dictionary features are also known as gazetteer features. These are per token discrete features that the module learns embeddings for. Example: For the utterance Order coffee from Starbucks, the dictionary features could be

[
    {"tokenIdx": 1, "features": {"drink/beverage": 0.8, "music/song": 0.2}},
    {"tokenIdx": 3, "features": {"store/coffee_shop": 1.0}}
]

:: Thus, for a given token there can be more than one dictionary features each of which has a confidence score. The final embedding for a token is the weighted average of the dictionary embeddings followed by a pooling operation such that the module produces an embedding vector per token.

Parameters:
  • num_embeddings (int) – Total number of dictionary features (vocabulary size).
  • embed_dim (int) – Size of embedding vector.
  • pooling_type (PoolingType) – Type of pooling for combining the dictionary feature embeddings.
pooling_type

PoolingType – Type of pooling for combining the dictionary feature embeddings.

Config

alias of pytext.config.field_config.DictFeatConfig

forward(feats: torch.Tensor, weights: torch.Tensor, lengths: torch.Tensor) → torch.Tensor[source]

Given a batch of sentences such containing dictionary feature ids per token, produce token embedding vectors for each sentence in the batch.

Parameters:
  • feats (torch.Tensor) – Batch of sentences with dictionary feature ids.
  • weights (torch.Tensor) – Batch of sentences with dictionary feature
  • for the dictionary features. (weights) –
  • lengths (torch.Tensor) – Batch of sentences with the number of
  • features per token. (dictionary) –
Returns:

Embedded batch of sentences. Dimension: batch size X maximum sentence length, token embedding size. Token embedding size = embed_dim passed to the constructor.

Return type:

torch.Tensor

classmethod from_config(config: pytext.config.field_config.DictFeatConfig, metadata: pytext.fields.field.FieldMeta)[source]

Factory method to construct an instance of DictEmbedding from the module’s config object and the field’s metadata object.

Parameters:
  • config (DictFeatConfig) – Configuration object specifying all the
  • of DictEmbedding. (parameters) –
  • metadata (FieldMeta) – Object containing this field’s metadata.
Returns:

An instance of DictEmbedding.

Return type:

type

pytext.models.embeddings.embedding_base module

class pytext.models.embeddings.embedding_base.EmbeddingBase(embedding_dim: int)[source]

Bases: pytext.models.module.Module

Base class for token level embedding modules.

Parameters:embedding_dim (int) – Size of embedding vector.
num_emb_modules

int – Number of ways to embed a token.

embedding_dim

int – Size of embedding vector.

Config

alias of pytext.config.component.ComponentMeta.__new__.<locals>.Config

get_param_groups_for_optimizer() → List[Dict[str, torch.nn.parameter.Parameter]][source]

Organize module parameters into param_groups (or layers), so the optimizer and / or schedulers can have custom behavior per layer.

pytext.models.embeddings.embedding_list module

class pytext.models.embeddings.embedding_list.EmbeddingList(embeddings: Iterable[pytext.models.embeddings.embedding_base.EmbeddingBase], concat: bool)[source]

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase, torch.nn.modules.container.ModuleList

There are more than one way to embed a token and this module provides a way to generate a list of sub-embeddings, concat embedding tensors into a single Tensor or return a tuple of Tensors that can be used by downstream modules.

Parameters:
  • embeddings (Iterable[EmbeddingBase]) – A sequence of embedding modules to
  • a token. (embed) –
  • concat (bool) – Whether to concatenate the embedding vectors emitted from
  • modules. (embeddings) –
num_emb_modules

int – Number of flattened embeddings in embeddings, e.g: ((e1, e2), e3) has 3 in total

input_start_indices

List[int] – List of indices of the sub-embeddings in the embedding list.

concat

bool – Whether to concatenate the embedding vectors emitted from embeddings modules.

embedding_dim

Total embedding size, can be a single int or tuple of int depending on concat setting

Config

alias of pytext.config.component.ComponentMeta.__new__.<locals>.Config

forward(*emb_input) → Union[torch.Tensor, Tuple[torch.Tensor]][source]

Get embeddings from all sub-embeddings and either concatenate them into one Tensor or return them in a tuple.

Parameters:*emb_input (type) – Sequence of token level embeddings to combine. The inputs should match the size of configured embeddings. Each of them is either a Tensor or a tuple of Tensors.
Returns:
If concat is True then
a Tensor is returned by concatenating all embeddings. Otherwise all embeddings are returned in a tuple.
Return type:Union[torch.Tensor, Tuple[torch.Tensor]]
get_param_groups_for_optimizer() → List[Dict[str, torch.nn.parameter.Parameter]][source]

Organize child embedding parameters into param_groups (or layers), so the optimizer and / or schedulers can have custom behavior per layer. The param_groups from each child embedding are aligned at the first (lowest) param_group.

pytext.models.embeddings.pretrained_model_embedding module

class pytext.models.embeddings.pretrained_model_embedding.PretrainedModelEmbedding(embedding_dim: int)[source]

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

Module for providing token embeddings from a pretrained model.

Config

alias of pytext.config.field_config.PretrainedModelEmbeddingConfig

forward(embedding: torch.Tensor) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.config.field_config.PretrainedModelEmbeddingConfig, *args, **kwargs)[source]

pytext.models.embeddings.word_embedding module

class pytext.models.embeddings.word_embedding.WordEmbedding(num_embeddings: int, embedding_dim: int, embeddings_weight: torch.Tensor, init_range: List[int], unk_token_idx: int, mlp_layer_dims: List[int])[source]

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

A word embedding wrapper module around torch.nn.Embedding with options to initialize the word embedding weights and add MLP layers acting on each word.

Note: Embedding weights for UNK token are always initialized to zeros.

Parameters:
  • num_embeddings (int) – Total number of words/tokens (vocabulary size).
  • embedding_dim (int) – Size of embedding vector.
  • embeddings_weight (torch.Tensor) – Pretrained weights to initialize the embedding table with.
  • init_range (List[int]) – Range of uniform distribution to initialize the weights with if embeddings_weight is None.
  • unk_token_idx (int) – Index of UNK token in the word vocabulary.
  • mlp_layer_dims (List[int]) – List of layer dimensions (if any) to add on top of the embedding lookup.
Config

alias of pytext.config.field_config.WordFeatConfig

forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

freeze()[source]
classmethod from_config(config: pytext.config.field_config.WordFeatConfig, metadata: pytext.fields.field.FieldMeta)[source]

Factory method to construct an instance of WordEmbedding from the module’s config object and the field’s metadata object.

Parameters:
  • config (WordFeatConfig) – Configuration object specifying all the
  • of WordEmbedding. (parameters) –
  • metadata (FieldMeta) – Object containing this field’s metadata.
Returns:

An instance of WordEmbedding.

Return type:

type

Module contents

class pytext.models.embeddings.EmbeddingBase(embedding_dim: int)[source]

Bases: pytext.models.module.Module

Base class for token level embedding modules.

Parameters:embedding_dim (int) – Size of embedding vector.
num_emb_modules

int – Number of ways to embed a token.

embedding_dim

int – Size of embedding vector.

Config

alias of pytext.config.component.ComponentMeta.__new__.<locals>.Config

get_param_groups_for_optimizer() → List[Dict[str, torch.nn.parameter.Parameter]][source]

Organize module parameters into param_groups (or layers), so the optimizer and / or schedulers can have custom behavior per layer.

class pytext.models.embeddings.EmbeddingList(embeddings: Iterable[pytext.models.embeddings.embedding_base.EmbeddingBase], concat: bool)[source]

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase, torch.nn.modules.container.ModuleList

There are more than one way to embed a token and this module provides a way to generate a list of sub-embeddings, concat embedding tensors into a single Tensor or return a tuple of Tensors that can be used by downstream modules.

Parameters:
  • embeddings (Iterable[EmbeddingBase]) – A sequence of embedding modules to
  • a token. (embed) –
  • concat (bool) – Whether to concatenate the embedding vectors emitted from
  • modules. (embeddings) –
num_emb_modules

int – Number of flattened embeddings in embeddings, e.g: ((e1, e2), e3) has 3 in total

input_start_indices

List[int] – List of indices of the sub-embeddings in the embedding list.

concat

bool – Whether to concatenate the embedding vectors emitted from embeddings modules.

embedding_dim

Total embedding size, can be a single int or tuple of int depending on concat setting

Config

alias of pytext.config.component.ComponentMeta.__new__.<locals>.Config

forward(*emb_input) → Union[torch.Tensor, Tuple[torch.Tensor]][source]

Get embeddings from all sub-embeddings and either concatenate them into one Tensor or return them in a tuple.

Parameters:*emb_input (type) – Sequence of token level embeddings to combine. The inputs should match the size of configured embeddings. Each of them is either a Tensor or a tuple of Tensors.
Returns:
If concat is True then
a Tensor is returned by concatenating all embeddings. Otherwise all embeddings are returned in a tuple.
Return type:Union[torch.Tensor, Tuple[torch.Tensor]]
get_param_groups_for_optimizer() → List[Dict[str, torch.nn.parameter.Parameter]][source]

Organize child embedding parameters into param_groups (or layers), so the optimizer and / or schedulers can have custom behavior per layer. The param_groups from each child embedding are aligned at the first (lowest) param_group.

class pytext.models.embeddings.WordEmbedding(num_embeddings: int, embedding_dim: int, embeddings_weight: torch.Tensor, init_range: List[int], unk_token_idx: int, mlp_layer_dims: List[int])[source]

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

A word embedding wrapper module around torch.nn.Embedding with options to initialize the word embedding weights and add MLP layers acting on each word.

Note: Embedding weights for UNK token are always initialized to zeros.

Parameters:
  • num_embeddings (int) – Total number of words/tokens (vocabulary size).
  • embedding_dim (int) – Size of embedding vector.
  • embeddings_weight (torch.Tensor) – Pretrained weights to initialize the embedding table with.
  • init_range (List[int]) – Range of uniform distribution to initialize the weights with if embeddings_weight is None.
  • unk_token_idx (int) – Index of UNK token in the word vocabulary.
  • mlp_layer_dims (List[int]) – List of layer dimensions (if any) to add on top of the embedding lookup.
Config

alias of pytext.config.field_config.WordFeatConfig

forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

freeze()[source]
classmethod from_config(config: pytext.config.field_config.WordFeatConfig, metadata: pytext.fields.field.FieldMeta)[source]

Factory method to construct an instance of WordEmbedding from the module’s config object and the field’s metadata object.

Parameters:
  • config (WordFeatConfig) – Configuration object specifying all the
  • of WordEmbedding. (parameters) –
  • metadata (FieldMeta) – Object containing this field’s metadata.
Returns:

An instance of WordEmbedding.

Return type:

type

class pytext.models.embeddings.DictEmbedding(num_embeddings: int, embed_dim: int, pooling_type: pytext.config.module_config.PoolingType, *args, **kwargs)[source]

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase, torch.nn.modules.sparse.Embedding

Module for dictionary feature embeddings for tokens. Dictionary features are also known as gazetteer features. These are per token discrete features that the module learns embeddings for. Example: For the utterance Order coffee from Starbucks, the dictionary features could be

[
    {"tokenIdx": 1, "features": {"drink/beverage": 0.8, "music/song": 0.2}},
    {"tokenIdx": 3, "features": {"store/coffee_shop": 1.0}}
]

:: Thus, for a given token there can be more than one dictionary features each of which has a confidence score. The final embedding for a token is the weighted average of the dictionary embeddings followed by a pooling operation such that the module produces an embedding vector per token.

Parameters:
  • num_embeddings (int) – Total number of dictionary features (vocabulary size).
  • embed_dim (int) – Size of embedding vector.
  • pooling_type (PoolingType) – Type of pooling for combining the dictionary feature embeddings.
pooling_type

PoolingType – Type of pooling for combining the dictionary feature embeddings.

Config

alias of pytext.config.field_config.DictFeatConfig

forward(feats: torch.Tensor, weights: torch.Tensor, lengths: torch.Tensor) → torch.Tensor[source]

Given a batch of sentences such containing dictionary feature ids per token, produce token embedding vectors for each sentence in the batch.

Parameters:
  • feats (torch.Tensor) – Batch of sentences with dictionary feature ids.
  • weights (torch.Tensor) – Batch of sentences with dictionary feature
  • for the dictionary features. (weights) –
  • lengths (torch.Tensor) – Batch of sentences with the number of
  • features per token. (dictionary) –
Returns:

Embedded batch of sentences. Dimension: batch size X maximum sentence length, token embedding size. Token embedding size = embed_dim passed to the constructor.

Return type:

torch.Tensor

classmethod from_config(config: pytext.config.field_config.DictFeatConfig, metadata: pytext.fields.field.FieldMeta)[source]

Factory method to construct an instance of DictEmbedding from the module’s config object and the field’s metadata object.

Parameters:
  • config (DictFeatConfig) – Configuration object specifying all the
  • of DictEmbedding. (parameters) –
  • metadata (FieldMeta) – Object containing this field’s metadata.
Returns:

An instance of DictEmbedding.

Return type:

type

class pytext.models.embeddings.CharacterEmbedding(num_embeddings: int, embed_dim: int, out_channels: int, kernel_sizes: List[int], *args, **kwargs)[source]

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

Module for character aware CNN embeddings for tokens. It uses convolution followed by max-pooling over character embeddings to obtain an embedding vector for each token.

Implementation is loosely based on https://arxiv.org/abs/1508.06615 but, does not implement the Highway Network illustrated in the paper.

Parameters:
  • num_embeddings (int) – Total number of characters (vocabulary size).
  • embed_dim (int) – Size of embedding vector.
  • out_channels (int) – Number of output channels.
  • kernel_sizes (List[int]) – Dimension of input Tensor passed to MLP.
char_embed

nn.Embedding – Character embedding table.

convs

nn.ModuleList – Convolution layers that operate on character

embeddings.
embedding_dim

int – Dimension of the final token embedding produced.

Config

alias of pytext.config.field_config.CharFeatConfig

forward(chars: torch.Tensor) → torch.Tensor[source]

Given a batch of sentences such that tokens are broken into character ids, produce token embedding vectors for each sentence in the batch.

Parameters:
  • chars (torch.Tensor) – Batch of sentences where each token is broken
  • characters. (into) –
  • Dimension – batch size X maximum sentence length X maximum word length
Returns:

Embedded batch of sentences. Dimension: batch size X maximum sentence length, token embedding size. Token embedding size = out_channels * len(self.convs))

Return type:

torch.Tensor

classmethod from_config(config: pytext.config.field_config.CharFeatConfig, metadata: pytext.fields.field.FieldMeta)[source]

Factory method to construct an instance of CharacterEmbedding from the module’s config object and the field’s metadata object.

Parameters:
  • config (CharFeatConfig) – Configuration object specifying all the parameters of CharacterEmbedding.
  • metadata (FieldMeta) – Object containing this field’s metadata.
Returns:

An instance of CharacterEmbedding.

Return type:

type

class pytext.models.embeddings.PretrainedModelEmbedding(embedding_dim: int)[source]

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

Module for providing token embeddings from a pretrained model.

Config

alias of pytext.config.field_config.PretrainedModelEmbeddingConfig

forward(embedding: torch.Tensor) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.config.field_config.PretrainedModelEmbeddingConfig, *args, **kwargs)[source]