pytext.data package

Submodules

pytext.data.bptt_lm_data_handler module

class pytext.data.bptt_lm_data_handler.BPTTLanguageModelDataHandler(bptt_len: int, **kwargs)[source]

Bases: pytext.data.data_handler.DataHandler

BPTTLanguageModelDataHandler treats data as a single document, concatenating all tokens together. BPTTIterator arranges the dataset into columns of batch size and subdivides the source data into chunks of length bptt_len. It enables hidden state of ith batch carried over to (i+1)th batch.

Parameters:bptt_len (int) – Input sequence length to backpropagate to.
Config[source]

alias of BPTTLanguageModelDataHandler.Config

classmethod from_config(config: pytext.data.bptt_lm_data_handler.BPTTLanguageModelDataHandler.Config, feature_config: pytext.config.field_config.FeatureConfig, label_config: pytext.config.field_config.WordLabelConfig, **kwargs)[source]

Factory method to construct an instance of BPTTLanguageModelDataHandler from the module’s config object and feature config object.

Parameters:
  • config (LanguageModelDataHandler.Config) – Configuration object specifying all the parameters of BPTTLanguageModelDataHandler.
  • feature_config (FeatureConfig) – Configuration object specifying all the parameters of all input features.
Returns:

An instance of BPTTLanguageModelDataHandler.

Return type:

type

get_test_iter(file_path: str, batch_size: int) → pytext.data.data_handler.BatchIterator[source]

Get test data iterator from test data file.

Parameters:
  • file_path (str) – Path to test data file.
  • batch_size (int) – Batch size
Returns:

An instance of BatchIterator to iterate over the

supplied test data file.

Return type:

BatchIterator

init_feature_metadata(train_data: torchtext.data.dataset.Dataset, eval_data: torchtext.data.dataset.Dataset, test_data: torchtext.data.dataset.Dataset)[source]

Prepares the metadata for the language model features.

init_target_metadata(train_data: torchtext.data.dataset.Dataset, eval_data: torchtext.data.dataset.Dataset, test_data: torchtext.data.dataset.Dataset)[source]

Prepares the metadata for the language model target.

preprocess(data: List[Dict[str, Any]])[source]

preprocess the raw data to create TorchText.Example, this is the second step in whole processing pipeline :returns: data (Generator[Dict[str, Any]])

preprocess_row(row_data: Dict[str, Any]) → List[str][source]

Preprocess steps for a single input row.

Parameters:row_data (Dict[str, Any]) – Dict representing the input row and columns.
Returns:List of tokens.
Return type:List[str]

pytext.data.compositional_data_handler module

class pytext.data.compositional_data_handler.CompositionalDataHandler(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, **kwargs)[source]

Bases: pytext.data.data_handler.DataHandler

Config[source]

alias of CompositionalDataHandler.Config

FULL_FEATURES = ['word_feat', 'dict_feat', 'action_idx_feature']
classmethod from_config(config: pytext.data.compositional_data_handler.CompositionalDataHandler.Config, feature_config: pytext.config.field_config.FeatureConfig, *args, **kwargs)[source]
preprocess_row(row_data: Dict[str, Any]) → Dict[str, Any][source]

preprocess steps for a single input row, sub class should override it

pytext.data.contextual_intent_slot_data_handler module

class pytext.data.contextual_intent_slot_data_handler.ContextualIntentSlotModelDataHandler(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, **kwargs)[source]

Bases: pytext.data.joint_data_handler.JointModelDataHandler

Data Handler to build pipeline to process data and generate tensors to be consumed by ContextualIntentSlotModel. Columns of Input data includes:

  1. doc label for intent classification
  2. word label for slot tagging of the last utterance
  3. a sequence of utterances (e.g., a dialog)
  4. Optional dictionary feature contained in the last utterance
  5. Optional doc weight that stands for the weight of intent task in joint loss.
  6. Optional word weight that stands for the weight of slot task in joint loss.
raw_columns

columns to read from data source. In case of files, the order should match the data stored in that file. Raw columns include

[
    RawData.DOC_LABEL,
    RawData.WORD_LABEL,
    RawData.TEXT,
    RawData.DICT_FEAT (Optional),
    RawData.DOC_WEIGHT (Optional),
    RawData.WORD_WEIGHT (Optional),
]
labels

doc labels and word labels

features

embeddings generated from sequences of utterances and dictionary features of the last utterance

extra_fields

doc weights, word weights, and etc.

Config[source]

alias of ContextualIntentSlotModelDataHandler.Config

classmethod from_config(config: pytext.data.contextual_intent_slot_data_handler.ContextualIntentSlotModelDataHandler.Config, feature_config: pytext.config.contextual_intent_slot.ModelInputConfig, target_config: List[Union[pytext.config.field_config.DocLabelConfig, pytext.config.field_config.WordLabelConfig]], **kwargs)[source]

Factory method to construct an instance of ContextualIntentSlotModelDataHandler object from the module’s config, model input config and target config.

Parameters:
  • config (Config) – Configuration object specifying all the parameters of ContextualIntentSlotModelDataHandler.
  • feature_config (ModelInputConfig) – Configuration object specifying model input.
  • target_config (TargetConfig) – Configuration object specifying target.
Returns:

An instance of ContextualIntentSlotModelDataHandler.

Return type:

type

preprocess_row(row_data: Dict[str, Any]) → Dict[str, Any][source]

Preprocess steps for a single input row: 1. apply tokenization to a sequence of utterances; 2. process dictionary features to align with the last utterance. 3. align word labels with the last utterance.

Parameters:row_data (Dict[str, Any]) – Dict of one row data with column names as keys. Keys includes “doc_label”, “word_label”, “text”, “dict_feat”, “word weight” and “doc weight”.
Returns:Preprocessed dict of one row data includes:
”seq_word_feat” (list of list of string)
tokenized words of sequence of utterances
”word_feat” (list of string)
tokenized words of last utterance
”raw_word_label” (string)
raw word label
”token_range” (list of tuple)
token ranges of word labels, each tuple contains the start position index and the end position index
”utterance” (list of string)
raw utterances
”word_label” (list of string)
list of labels of words in last utterance
”doc_label” (string)
doc label for intent classification
”word_weight” (float)
weight of word label
”doc_weight” (float)
weight of document label
”dict_feat” (tuple, optional)
tuple of three lists, the first is the label of each words, the second is the weight of the feature, the third is the length of the feature.
Return type:Dict[str, Any]
class pytext.data.contextual_intent_slot_data_handler.RawData[source]

Bases: object

DICT_FEAT = 'dict_feat'
DOC_LABEL = 'doc_label'
DOC_WEIGHT = 'doc_weight'
TEXT = 'text'
WORD_LABEL = 'word_label'
WORD_WEIGHT = 'word_weight'

pytext.data.data_handler module

class pytext.data.data_handler.BatchIterator(batches, processor, include_input=True, include_target=True, include_context=True, is_train=True, num_batches=-1)[source]

Bases: object

BatchIterator is a wrapper of TorchText. Iterator that provide flexibility to map batched data to a tuple of (input, target, context) and other additional steps such as dealing with distributed training.

Parameters:
  • batches (Iterator[TorchText.Batch]) – iterator of TorchText.Batch, which shuffles/batches the data in __iter__ and return a batch of data in __next__
  • processor – function to run after getting batched data from TorchText.Iterator, the function should define a way to map to data into (input, target, context)
  • include_input (bool) – if input data should be returned, default is true
  • include_target (bool) – if target data should be returned, default is true
  • include_context (bool) – if context data should be returned, default is true
  • is_train (bool) – if the batch data is for training
  • num_batches (int) – total batches to generate, this param if for distributed training due to a limitation in PyTorch’s distributed training backend that enforces all the parallel workers to have the same number of batches we workaround it by adding dummy batches at the end
class pytext.data.data_handler.CommonMetadata[source]

Bases: object

class pytext.data.data_handler.DataHandler(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, **kwargs)[source]

Bases: pytext.config.component.Component

DataHandler is the central place to prepare data for model training/testing. The class is responsible of:

  • Define pipeline to process data and generate batch of tensors to be consumed by model. Each batch is a (input, target, extra_data) tuple, in which input can be feed directly into model.
  • Initialize global context, such as build vocab, load pretrained embeddings. Store the context as metadata, and provide function to serialize/deserialize the metadata

The data processing pipeline contains the following steps:

  • Read data from file into a list of raw data examples
  • Convert each row of row data to a TorchText Example. This logic happens in process_row function and will:
    • Invoke featurizer, which contains data processing steps to apply for both training and inference time, e.g: tokenization
    • Use the raw data and results from featurizer to do any preprocessing
  • Generate a TorchText.Dataset that contains the list of Example, the Dataset also has a list of TorchText.Field, which defines how to do padding and numericalization while batching data.
  • Return a BatchIterator which will give a tuple of (input, target, context) tensors for each iteration. By default the tensors have a 1:1 mapping to the TorchText.Field fields, but this behavior can be overwritten by _input_from_batch, _target_from_batch, _context_from_batch functions.
raw_columns

List[str] – columns to read from data source. The order should match the data stored in that file.

featurizer

Featurizer – perform data preprocessing that should be shared between training and inference

features

Dict[str, Field] – a dict of name -> field that used to process data as model input

labels

Dict[str, Field] – a dict of name -> field that used to process data as training target

extra_fields

Dict[str, Field] – fields that process any extra data used neither as model input nor target. This is None by default

text_feature_name

str – name of the text field, used to define the default sort key of data

shuffle

bool – if the dataset should be shuffled, true by default

sort_within_batch

bool – if data within same batch should be sorted, true by default

train_path

str – path of training data file

eval_path

str – path of evaluation data file

test_path

str – path of test data file

train_batch_size

int – training batch size, 128 by default

eval_batch_size

int – evaluation batch size, 128 by default

test_batch_size

int – test batch size, 128 by default

max_seq_len

int – maximum length of tokens to keep in sequence

pass_index

bool – if the original index of data in the batch should be passed along to downstream steps, default is true

Config[source]

alias of DataHandler.Config

gen_dataset(data: List[Dict[str, Any]], include_label_fields: bool = True) → torchtext.data.dataset.Dataset[source]

Generate torchtext Dataset from raw in memory data. :returns: dataset (TorchText.Dataset)

gen_dataset_from_path(path: str, include_label_fields: bool = True, use_cache: bool = True) → torchtext.data.dataset.Dataset[source]

Generate a dataset from file :returns: dataset (TorchText.Dataset)

get_eval_iter()[source]
get_predict_iter(data: List[Dict[str, Any]])[source]
get_test_iter()[source]
get_test_iter_from_path(test_path: str, batch_size: int) → pytext.data.data_handler.BatchIterator[source]
get_test_iter_from_raw_data(test_data: List[Dict[str, Any]], batch_size: int, rank: int = 0, world_size: int = 1) → pytext.data.data_handler.BatchIterator[source]
get_train_iter(rank: int = 0, world_size: int = 1)[source]
get_train_iter_from_path(train_path: str, batch_size: int, rank: int = 0, world_size: int = 1) → pytext.data.data_handler.BatchIterator[source]

Generate data batch iterator for training data. See _get_train_iter() for details

Parameters:
  • train_path (str) – file path of training data
  • batch_size (int) – batch size
  • rank (int) – used for distributed training, the rank of current Gpu, don’t set it to anything but 0 for non-distributed training
  • world_size (int) – used for distributed training, total number of Gpu
get_train_iter_from_raw_data(train_data: List[Dict[str, Any]], batch_size: int, rank: int = 0, world_size: int = 1) → pytext.data.data_handler.BatchIterator[source]
init_feature_metadata(train_data: torchtext.data.dataset.Dataset, eval_data: torchtext.data.dataset.Dataset, test_data: torchtext.data.dataset.Dataset)[source]
init_metadata()[source]

Initialize metadata using data from configured path

init_metadata_from_path(train_path, eval_path, test_path)[source]

Initialize metadata using data from file

init_metadata_from_raw_data(*data)[source]

Initialize metadata using in memory data

init_target_metadata(train_data: torchtext.data.dataset.Dataset, eval_data: torchtext.data.dataset.Dataset, test_data: torchtext.data.dataset.Dataset)[source]
load_metadata(metadata: pytext.data.data_handler.CommonMetadata)[source]

Load previously saved metadata

load_vocab(vocab_file, vocab_size, lowercase_tokens: bool = False)[source]

Loads items into a set from a file containing one item per line. Items are added to the set from top of the file to bottom. So, the items in the file should be ordered by a preference (if any), e.g., it makes sense to order tokens in descending order of frequency in corpus.

Parameters:
  • vocab_file (str) – vocab file to load
  • vocab_size (int) – maximum tokens to load, will only load the first n if the actual vocab size is larger than this parameter
  • lowercase_tokens (bool) – if the tokens should be lowercased
metadata_to_save()[source]

Save metadata, pretrained_embeds_weight should be excluded

preprocess(data: List[Dict[str, Any]])[source]

preprocess the raw data to create TorchText.Example, this is the second step in whole processing pipeline :returns: data (Generator[Dict[str, Any]])

preprocess_row(row_data: Dict[str, Any]) → Dict[str, Any][source]

preprocess steps for a single input row, sub class should override it

static read_from_file(file_name: str, columns_to_use: Union[Dict[str, int], List[str]]) → List[Dict[str, Any]][source]

Read data from csv file. Input file format is required to be tab-separated columns

Parameters:
  • file_name (str) – csv file name
  • columns_to_use (Union[Dict[str, int], List[str]]) – either a list of column names or a dict of column name -> column index in the file
sort_key(example: torchtext.data.example.Example) → Any[source]

How to sort data in every batch, default behavior is by the length of input text :param example: one torchtext example :type example: Example

pytext.data.disjoint_multitask_data_handler module

class pytext.data.disjoint_multitask_data_handler.DisjointMultitaskDataHandler(config: pytext.data.disjoint_multitask_data_handler.DisjointMultitaskDataHandler.Config, data_handlers: Dict[str, pytext.data.data_handler.DataHandler], *args, **kwargs)[source]

Bases: pytext.data.data_handler.DataHandler

Wrapper for doing multitask training using multiple data handlers. Takes a dictionary of data handlers, does round robin over their iterators using RoundRobinBatchIterator.

Parameters:
  • config (Config) – Configuration object of type DisjointMultitaskDataHandler.Config.
  • data_handlers (Dict[str, DataHandler]) – Data handlers to do roundrobin over.
  • *args (type) – Extra arguments to be passed down to sub data handlers.
  • **kwargs (type) – Extra arguments to be passed down to sub data handlers.
data_handlers

type – Data handlers to do roundrobin over.

epoch_size

type – Size of epoch in number of batches.

epoch_size

Optional[int] – Size of epoch in number of batches. If not set,

do a single pass over the training data.
Config[source]

alias of DisjointMultitaskDataHandler.Config

get_eval_iter() → pytext.data.data_handler.BatchIterator[source]
get_test_iter() → pytext.data.data_handler.BatchIterator[source]
get_train_iter(rank: int = 0, world_size: int = 1) → Tuple[pytext.data.data_handler.BatchIterator, ...][source]
init_metadata()[source]

Initialize metadata using data from configured path

load_metadata(metadata)[source]

Load previously saved metadata

metadata_to_save()[source]

Save metadata, pretrained_embeds_weight should be excluded

class pytext.data.disjoint_multitask_data_handler.RoundRobinBatchIterator(iterators: Dict[str, pytext.data.data_handler.BatchIterator], epoch_size: Optional[int] = None)[source]

Bases: pytext.data.data_handler.BatchIterator

We take a dictionary of BatchIterators and do round robin over them in a cycle. If epoch_size is specified, each iterator is also wrapped in a cycle so that they never run out. Otherwise, a single pass is done over each iterator at each epoch. Iterators that run out are filtered out. Currently there is no re-shuffling of data, data order is the same at each epoch.

e.g. Iterator 1: [A, B, C, D], Iterator 2: [a, b] Case 1, epoch size is set:

Output: [A, a, B, b, C, a, D, b, A, …] Here, tasks with less data are effectively upsampled and data is balanced across tasks.

Case 2, epoch size not set:

Output: [A, a, B, b, C, D, A, a, B, b, …]
Parameters:
  • iterators (Dict[str, BatchIterator]) – Iterators to do roundrobin over.
  • epoch_size (Optional[int]) – Size of epoch in number of batches. If not set,
  • a single pass over the training data. (do) –
iterators

type – Iterators to do roundrobin over.

epoch_size

type – Size of epoch in number of batches.

classmethod cycle(iterator)[source]

pytext.data.doc_classification_data_handler module

class pytext.data.doc_classification_data_handler.DocClassificationDataHandler(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, **kwargs)[source]

Bases: pytext.data.data_handler.DataHandler

The DocClassificationDataHandler prepares the data for document classification. Each sentence is read line by line with its label as the target.

Config[source]

alias of DocClassificationDataHandler.Config

classmethod from_config(config: pytext.data.doc_classification_data_handler.DocClassificationDataHandler.Config, model_input_config: pytext.config.doc_classification.ModelInputConfig, target_config: pytext.config.field_config.DocLabelConfig, **kwargs)[source]

Factory method to construct an instance of DocClassificationDataHandler from the module’s config object and feature config object.

Parameters:
  • config (DocClassificationDataHandler.Config) – Configuration object specifying all the parameters of DocClassificationDataHandler.
  • model_input_config (ModelInputConfig) – Configuration object specifying all the parameters of the model config.
  • target_config (TargetConfig) – Configuration object specifying all the parameters of the target.
Returns:

An instance of DocClassificationDataHandler.

Return type:

type

preprocess_row(row_data: Dict[str, Any]) → Dict[str, Any][source]

preprocess steps for a single input row, sub class should override it

class pytext.data.doc_classification_data_handler.RawData[source]

Bases: object

DICT_FEAT = 'dict_feat'
DOC_LABEL = 'doc_label'
TEXT = 'text'

pytext.data.joint_data_handler module

class pytext.data.joint_data_handler.JointModelDataHandler(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, **kwargs)[source]

Bases: pytext.data.data_handler.DataHandler

Config[source]

alias of JointModelDataHandler.Config

featurize(row_data: Dict[str, Any])[source]
classmethod from_config(config: pytext.data.joint_data_handler.JointModelDataHandler.Config, feature_config: pytext.config.field_config.FeatureConfig, label_configs: Union[pytext.config.field_config.DocLabelConfig, pytext.config.field_config.WordLabelConfig, List[Union[pytext.config.field_config.DocLabelConfig, pytext.config.field_config.WordLabelConfig]]], **kwargs)[source]
preprocess_row(row_data: Dict[str, Any]) → Dict[str, Any][source]

preprocess steps for a single input row, sub class should override it

pytext.data.kd_doc_classification_data_handler module

class pytext.data.kd_doc_classification_data_handler.KDDocClassificationDataHandler(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, **kwargs)[source]

Bases: pytext.data.doc_classification_data_handler.DocClassificationDataHandler

The KDDocClassificationDataHandler prepares the data for knowledge distillation document classififcation. Each sentence is read line by line with its label as the target.

Config[source]

alias of KDDocClassificationDataHandler.Config

classmethod from_config(config: pytext.data.kd_doc_classification_data_handler.KDDocClassificationDataHandler.Config, model_input_config: pytext.config.kd_doc_classification.ModelInputConfig, target_config: pytext.config.field_config.DocLabelConfig, **kwargs)[source]

Factory method to construct an instance of DocClassificationDataHandler from the module’s config object and feature config object.

Parameters:
  • config (DocClassificationDataHandler.Config) – Configuration object specifying all the parameters of DocClassificationDataHandler.
  • model_input_config (ModelInputConfig) – Configuration object specifying all the parameters of the model config.
  • target_config (TargetConfig) – Configuration object specifying all the parameters of the target.
Returns:

An instance of KDDocClassificationDataHandler.

Return type:

type

init_target_metadata(train_data: torchtext.data.dataset.Dataset, eval_data: torchtext.data.dataset.Dataset, test_data: torchtext.data.dataset.Dataset)[source]
preprocess_row(row_data: Dict[str, Any]) → Dict[str, Any][source]

preprocess steps for a single input row, sub class should override it

class pytext.data.kd_doc_classification_data_handler.RawData[source]

Bases: object

DICT_FEAT = 'dict_feat'
DOC_LABEL = 'doc_label'
TARGET_LABELS = 'target_labels'
TARGET_PROBS = 'target_probs'
TEXT = 'text'

pytext.data.language_model_data_handler module

class pytext.data.language_model_data_handler.LanguageModelDataHandler(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, **kwargs)[source]

Bases: pytext.data.data_handler.DataHandler

The LanguageModelDataHandler reads input sentences one line at a time and prepares the input and the target for language modeling. Each sentence is assumed to be independent of any other sentence.

Config[source]

alias of LanguageModelDataHandler.Config

classmethod from_config(config: pytext.data.language_model_data_handler.LanguageModelDataHandler.Config, feature_config: pytext.config.field_config.FeatureConfig, *args, **kwargs)[source]

Factory method to construct an instance of LanguageModelDataHandler from the module’s config object and feature config object.

Parameters:
  • config (LanguageModelDataHandler.Config) – Configuration object specifying all the parameters of LanguageModelDataHandler.
  • feature_config (FeatureConfig) – Configuration object specifying all the parameters of all input features.
Returns:

An instance of LanguageModelDataHandler.

Return type:

type

init_target_metadata(train_data: torchtext.data.dataset.Dataset, eval_data: torchtext.data.dataset.Dataset, test_data: torchtext.data.dataset.Dataset)[source]

Prepares the metadata for the language model target.

preprocess_row(row_data: Dict[str, Any]) → Dict[str, Any][source]

Preprocess steps for a single input row.

Parameters:row_data (Dict[str, Any]) – Dict representing the input row and columns.
Returns:
Dictionary with feature names as keys and feature
values.
Return type:Dict[str, Any]

pytext.data.pair_classification_data_handler module

class pytext.data.pair_classification_data_handler.PairClassificationDataHandler(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, **kwargs)[source]

Bases: pytext.data.data_handler.DataHandler

Config[source]

alias of PairClassificationDataHandler.Config

classmethod from_config(config: pytext.data.pair_classification_data_handler.PairClassificationDataHandler.Config, feature_config: pytext.config.pair_classification.ModelInputConfig, target_config: pytext.config.field_config.DocLabelConfig, **kwargs)[source]
preprocess_row(row_data: Dict[str, Any]) → Dict[str, Any][source]

preprocess steps for a single input row, sub class should override it

sort_key(example) → Any[source]

How to sort data in every batch, default behavior is by the length of input text :param example: one torchtext example :type example: Example

class pytext.data.pair_classification_data_handler.RawData[source]

Bases: object

DOC_LABEL = 'doc_label'
TEXT1 = 'text1'
TEXT2 = 'text2'

pytext.data.seq_data_handler module

class pytext.data.seq_data_handler.SeqModelDataHandler(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, **kwargs)[source]

Bases: pytext.data.joint_data_handler.JointModelDataHandler

Config[source]

alias of SeqModelDataHandler.Config

FULL_FEATURES = ['word_feat']
classmethod from_config(config: pytext.data.seq_data_handler.SeqModelDataHandler.Config, feature_config: pytext.config.field_config.FeatureConfig, label_config: pytext.config.field_config.DocLabelConfig, **kwargs)[source]
preprocess_row(row_data: Dict[str, Any]) → Dict[str, Any][source]

preprocess steps for a single input row, sub class should override it

Module contents

class pytext.data.BPTTLanguageModelDataHandler(bptt_len: int, **kwargs)[source]

Bases: pytext.data.data_handler.DataHandler

BPTTLanguageModelDataHandler treats data as a single document, concatenating all tokens together. BPTTIterator arranges the dataset into columns of batch size and subdivides the source data into chunks of length bptt_len. It enables hidden state of ith batch carried over to (i+1)th batch.

Parameters:bptt_len (int) – Input sequence length to backpropagate to.
Config[source]

alias of BPTTLanguageModelDataHandler.Config

classmethod from_config(config: pytext.data.bptt_lm_data_handler.BPTTLanguageModelDataHandler.Config, feature_config: pytext.config.field_config.FeatureConfig, label_config: pytext.config.field_config.WordLabelConfig, **kwargs)[source]

Factory method to construct an instance of BPTTLanguageModelDataHandler from the module’s config object and feature config object.

Parameters:
  • config (LanguageModelDataHandler.Config) – Configuration object specifying all the parameters of BPTTLanguageModelDataHandler.
  • feature_config (FeatureConfig) – Configuration object specifying all the parameters of all input features.
Returns:

An instance of BPTTLanguageModelDataHandler.

Return type:

type

get_test_iter(file_path: str, batch_size: int) → pytext.data.data_handler.BatchIterator[source]

Get test data iterator from test data file.

Parameters:
  • file_path (str) – Path to test data file.
  • batch_size (int) – Batch size
Returns:

An instance of BatchIterator to iterate over the

supplied test data file.

Return type:

BatchIterator

init_feature_metadata(train_data: torchtext.data.dataset.Dataset, eval_data: torchtext.data.dataset.Dataset, test_data: torchtext.data.dataset.Dataset)[source]

Prepares the metadata for the language model features.

init_target_metadata(train_data: torchtext.data.dataset.Dataset, eval_data: torchtext.data.dataset.Dataset, test_data: torchtext.data.dataset.Dataset)[source]

Prepares the metadata for the language model target.

preprocess(data: List[Dict[str, Any]])[source]

preprocess the raw data to create TorchText.Example, this is the second step in whole processing pipeline :returns: data (Generator[Dict[str, Any]])

preprocess_row(row_data: Dict[str, Any]) → List[str][source]

Preprocess steps for a single input row.

Parameters:row_data (Dict[str, Any]) – Dict representing the input row and columns.
Returns:List of tokens.
Return type:List[str]
class pytext.data.CompositionalDataHandler(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, **kwargs)[source]

Bases: pytext.data.data_handler.DataHandler

Config[source]

alias of CompositionalDataHandler.Config

FULL_FEATURES = ['word_feat', 'dict_feat', 'action_idx_feature']
classmethod from_config(config: pytext.data.compositional_data_handler.CompositionalDataHandler.Config, feature_config: pytext.config.field_config.FeatureConfig, *args, **kwargs)[source]
preprocess_row(row_data: Dict[str, Any]) → Dict[str, Any][source]

preprocess steps for a single input row, sub class should override it

class pytext.data.ContextualIntentSlotModelDataHandler(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, **kwargs)[source]

Bases: pytext.data.joint_data_handler.JointModelDataHandler

Data Handler to build pipeline to process data and generate tensors to be consumed by ContextualIntentSlotModel. Columns of Input data includes:

  1. doc label for intent classification
  2. word label for slot tagging of the last utterance
  3. a sequence of utterances (e.g., a dialog)
  4. Optional dictionary feature contained in the last utterance
  5. Optional doc weight that stands for the weight of intent task in joint loss.
  6. Optional word weight that stands for the weight of slot task in joint loss.
raw_columns

columns to read from data source. In case of files, the order should match the data stored in that file. Raw columns include

[
    RawData.DOC_LABEL,
    RawData.WORD_LABEL,
    RawData.TEXT,
    RawData.DICT_FEAT (Optional),
    RawData.DOC_WEIGHT (Optional),
    RawData.WORD_WEIGHT (Optional),
]
labels

doc labels and word labels

features

embeddings generated from sequences of utterances and dictionary features of the last utterance

extra_fields

doc weights, word weights, and etc.

Config[source]

alias of ContextualIntentSlotModelDataHandler.Config

classmethod from_config(config: pytext.data.contextual_intent_slot_data_handler.ContextualIntentSlotModelDataHandler.Config, feature_config: pytext.config.contextual_intent_slot.ModelInputConfig, target_config: List[Union[pytext.config.field_config.DocLabelConfig, pytext.config.field_config.WordLabelConfig]], **kwargs)[source]

Factory method to construct an instance of ContextualIntentSlotModelDataHandler object from the module’s config, model input config and target config.

Parameters:
  • config (Config) – Configuration object specifying all the parameters of ContextualIntentSlotModelDataHandler.
  • feature_config (ModelInputConfig) – Configuration object specifying model input.
  • target_config (TargetConfig) – Configuration object specifying target.
Returns:

An instance of ContextualIntentSlotModelDataHandler.

Return type:

type

preprocess_row(row_data: Dict[str, Any]) → Dict[str, Any][source]

Preprocess steps for a single input row: 1. apply tokenization to a sequence of utterances; 2. process dictionary features to align with the last utterance. 3. align word labels with the last utterance.

Parameters:row_data (Dict[str, Any]) – Dict of one row data with column names as keys. Keys includes “doc_label”, “word_label”, “text”, “dict_feat”, “word weight” and “doc weight”.
Returns:Preprocessed dict of one row data includes:
”seq_word_feat” (list of list of string)
tokenized words of sequence of utterances
”word_feat” (list of string)
tokenized words of last utterance
”raw_word_label” (string)
raw word label
”token_range” (list of tuple)
token ranges of word labels, each tuple contains the start position index and the end position index
”utterance” (list of string)
raw utterances
”word_label” (list of string)
list of labels of words in last utterance
”doc_label” (string)
doc label for intent classification
”word_weight” (float)
weight of word label
”doc_weight” (float)
weight of document label
”dict_feat” (tuple, optional)
tuple of three lists, the first is the label of each words, the second is the weight of the feature, the third is the length of the feature.
Return type:Dict[str, Any]
class pytext.data.BatchIterator(batches, processor, include_input=True, include_target=True, include_context=True, is_train=True, num_batches=-1)[source]

Bases: object

BatchIterator is a wrapper of TorchText. Iterator that provide flexibility to map batched data to a tuple of (input, target, context) and other additional steps such as dealing with distributed training.

Parameters:
  • batches (Iterator[TorchText.Batch]) – iterator of TorchText.Batch, which shuffles/batches the data in __iter__ and return a batch of data in __next__
  • processor – function to run after getting batched data from TorchText.Iterator, the function should define a way to map to data into (input, target, context)
  • include_input (bool) – if input data should be returned, default is true
  • include_target (bool) – if target data should be returned, default is true
  • include_context (bool) – if context data should be returned, default is true
  • is_train (bool) – if the batch data is for training
  • num_batches (int) – total batches to generate, this param if for distributed training due to a limitation in PyTorch’s distributed training backend that enforces all the parallel workers to have the same number of batches we workaround it by adding dummy batches at the end
class pytext.data.CommonMetadata[source]

Bases: object

class pytext.data.DataHandler(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, **kwargs)[source]

Bases: pytext.config.component.Component

DataHandler is the central place to prepare data for model training/testing. The class is responsible of:

  • Define pipeline to process data and generate batch of tensors to be consumed by model. Each batch is a (input, target, extra_data) tuple, in which input can be feed directly into model.
  • Initialize global context, such as build vocab, load pretrained embeddings. Store the context as metadata, and provide function to serialize/deserialize the metadata

The data processing pipeline contains the following steps:

  • Read data from file into a list of raw data examples
  • Convert each row of row data to a TorchText Example. This logic happens in process_row function and will:
    • Invoke featurizer, which contains data processing steps to apply for both training and inference time, e.g: tokenization
    • Use the raw data and results from featurizer to do any preprocessing
  • Generate a TorchText.Dataset that contains the list of Example, the Dataset also has a list of TorchText.Field, which defines how to do padding and numericalization while batching data.
  • Return a BatchIterator which will give a tuple of (input, target, context) tensors for each iteration. By default the tensors have a 1:1 mapping to the TorchText.Field fields, but this behavior can be overwritten by _input_from_batch, _target_from_batch, _context_from_batch functions.
raw_columns

List[str] – columns to read from data source. The order should match the data stored in that file.

featurizer

Featurizer – perform data preprocessing that should be shared between training and inference

features

Dict[str, Field] – a dict of name -> field that used to process data as model input

labels

Dict[str, Field] – a dict of name -> field that used to process data as training target

extra_fields

Dict[str, Field] – fields that process any extra data used neither as model input nor target. This is None by default

text_feature_name

str – name of the text field, used to define the default sort key of data

shuffle

bool – if the dataset should be shuffled, true by default

sort_within_batch

bool – if data within same batch should be sorted, true by default

train_path

str – path of training data file

eval_path

str – path of evaluation data file

test_path

str – path of test data file

train_batch_size

int – training batch size, 128 by default

eval_batch_size

int – evaluation batch size, 128 by default

test_batch_size

int – test batch size, 128 by default

max_seq_len

int – maximum length of tokens to keep in sequence

pass_index

bool – if the original index of data in the batch should be passed along to downstream steps, default is true

Config[source]

alias of DataHandler.Config

gen_dataset(data: List[Dict[str, Any]], include_label_fields: bool = True) → torchtext.data.dataset.Dataset[source]

Generate torchtext Dataset from raw in memory data. :returns: dataset (TorchText.Dataset)

gen_dataset_from_path(path: str, include_label_fields: bool = True, use_cache: bool = True) → torchtext.data.dataset.Dataset[source]

Generate a dataset from file :returns: dataset (TorchText.Dataset)

get_eval_iter()[source]
get_predict_iter(data: List[Dict[str, Any]])[source]
get_test_iter()[source]
get_test_iter_from_path(test_path: str, batch_size: int) → pytext.data.data_handler.BatchIterator[source]
get_test_iter_from_raw_data(test_data: List[Dict[str, Any]], batch_size: int, rank: int = 0, world_size: int = 1) → pytext.data.data_handler.BatchIterator[source]
get_train_iter(rank: int = 0, world_size: int = 1)[source]
get_train_iter_from_path(train_path: str, batch_size: int, rank: int = 0, world_size: int = 1) → pytext.data.data_handler.BatchIterator[source]

Generate data batch iterator for training data. See _get_train_iter() for details

Parameters:
  • train_path (str) – file path of training data
  • batch_size (int) – batch size
  • rank (int) – used for distributed training, the rank of current Gpu, don’t set it to anything but 0 for non-distributed training
  • world_size (int) – used for distributed training, total number of Gpu
get_train_iter_from_raw_data(train_data: List[Dict[str, Any]], batch_size: int, rank: int = 0, world_size: int = 1) → pytext.data.data_handler.BatchIterator[source]
init_feature_metadata(train_data: torchtext.data.dataset.Dataset, eval_data: torchtext.data.dataset.Dataset, test_data: torchtext.data.dataset.Dataset)[source]
init_metadata()[source]

Initialize metadata using data from configured path

init_metadata_from_path(train_path, eval_path, test_path)[source]

Initialize metadata using data from file

init_metadata_from_raw_data(*data)[source]

Initialize metadata using in memory data

init_target_metadata(train_data: torchtext.data.dataset.Dataset, eval_data: torchtext.data.dataset.Dataset, test_data: torchtext.data.dataset.Dataset)[source]
load_metadata(metadata: pytext.data.data_handler.CommonMetadata)[source]

Load previously saved metadata

load_vocab(vocab_file, vocab_size, lowercase_tokens: bool = False)[source]

Loads items into a set from a file containing one item per line. Items are added to the set from top of the file to bottom. So, the items in the file should be ordered by a preference (if any), e.g., it makes sense to order tokens in descending order of frequency in corpus.

Parameters:
  • vocab_file (str) – vocab file to load
  • vocab_size (int) – maximum tokens to load, will only load the first n if the actual vocab size is larger than this parameter
  • lowercase_tokens (bool) – if the tokens should be lowercased
metadata_to_save()[source]

Save metadata, pretrained_embeds_weight should be excluded

preprocess(data: List[Dict[str, Any]])[source]

preprocess the raw data to create TorchText.Example, this is the second step in whole processing pipeline :returns: data (Generator[Dict[str, Any]])

preprocess_row(row_data: Dict[str, Any]) → Dict[str, Any][source]

preprocess steps for a single input row, sub class should override it

static read_from_file(file_name: str, columns_to_use: Union[Dict[str, int], List[str]]) → List[Dict[str, Any]][source]

Read data from csv file. Input file format is required to be tab-separated columns

Parameters:
  • file_name (str) – csv file name
  • columns_to_use (Union[Dict[str, int], List[str]]) – either a list of column names or a dict of column name -> column index in the file
sort_key(example: torchtext.data.example.Example) → Any[source]

How to sort data in every batch, default behavior is by the length of input text :param example: one torchtext example :type example: Example

class pytext.data.JointModelDataHandler(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, **kwargs)[source]

Bases: pytext.data.data_handler.DataHandler

Config[source]

alias of JointModelDataHandler.Config

featurize(row_data: Dict[str, Any])[source]
classmethod from_config(config: pytext.data.joint_data_handler.JointModelDataHandler.Config, feature_config: pytext.config.field_config.FeatureConfig, label_configs: Union[pytext.config.field_config.DocLabelConfig, pytext.config.field_config.WordLabelConfig, List[Union[pytext.config.field_config.DocLabelConfig, pytext.config.field_config.WordLabelConfig]]], **kwargs)[source]
preprocess_row(row_data: Dict[str, Any]) → Dict[str, Any][source]

preprocess steps for a single input row, sub class should override it

class pytext.data.LanguageModelDataHandler(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, **kwargs)[source]

Bases: pytext.data.data_handler.DataHandler

The LanguageModelDataHandler reads input sentences one line at a time and prepares the input and the target for language modeling. Each sentence is assumed to be independent of any other sentence.

Config[source]

alias of LanguageModelDataHandler.Config

classmethod from_config(config: pytext.data.language_model_data_handler.LanguageModelDataHandler.Config, feature_config: pytext.config.field_config.FeatureConfig, *args, **kwargs)[source]

Factory method to construct an instance of LanguageModelDataHandler from the module’s config object and feature config object.

Parameters:
  • config (LanguageModelDataHandler.Config) – Configuration object specifying all the parameters of LanguageModelDataHandler.
  • feature_config (FeatureConfig) – Configuration object specifying all the parameters of all input features.
Returns:

An instance of LanguageModelDataHandler.

Return type:

type

init_target_metadata(train_data: torchtext.data.dataset.Dataset, eval_data: torchtext.data.dataset.Dataset, test_data: torchtext.data.dataset.Dataset)[source]

Prepares the metadata for the language model target.

preprocess_row(row_data: Dict[str, Any]) → Dict[str, Any][source]

Preprocess steps for a single input row.

Parameters:row_data (Dict[str, Any]) – Dict representing the input row and columns.
Returns:
Dictionary with feature names as keys and feature
values.
Return type:Dict[str, Any]
class pytext.data.PairClassificationDataHandler(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, **kwargs)[source]

Bases: pytext.data.data_handler.DataHandler

Config[source]

alias of PairClassificationDataHandler.Config

classmethod from_config(config: pytext.data.pair_classification_data_handler.PairClassificationDataHandler.Config, feature_config: pytext.config.pair_classification.ModelInputConfig, target_config: pytext.config.field_config.DocLabelConfig, **kwargs)[source]
preprocess_row(row_data: Dict[str, Any]) → Dict[str, Any][source]

preprocess steps for a single input row, sub class should override it

sort_key(example) → Any[source]

How to sort data in every batch, default behavior is by the length of input text :param example: one torchtext example :type example: Example

class pytext.data.SeqModelDataHandler(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, **kwargs)[source]

Bases: pytext.data.joint_data_handler.JointModelDataHandler

Config[source]

alias of SeqModelDataHandler.Config

FULL_FEATURES = ['word_feat']
classmethod from_config(config: pytext.data.seq_data_handler.SeqModelDataHandler.Config, feature_config: pytext.config.field_config.FeatureConfig, label_config: pytext.config.field_config.DocLabelConfig, **kwargs)[source]
preprocess_row(row_data: Dict[str, Any]) → Dict[str, Any][source]

preprocess steps for a single input row, sub class should override it

class pytext.data.DocClassificationDataHandler(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, **kwargs)[source]

Bases: pytext.data.data_handler.DataHandler

The DocClassificationDataHandler prepares the data for document classification. Each sentence is read line by line with its label as the target.

Config[source]

alias of DocClassificationDataHandler.Config

classmethod from_config(config: pytext.data.doc_classification_data_handler.DocClassificationDataHandler.Config, model_input_config: pytext.config.doc_classification.ModelInputConfig, target_config: pytext.config.field_config.DocLabelConfig, **kwargs)[source]

Factory method to construct an instance of DocClassificationDataHandler from the module’s config object and feature config object.

Parameters:
  • config (DocClassificationDataHandler.Config) – Configuration object specifying all the parameters of DocClassificationDataHandler.
  • model_input_config (ModelInputConfig) – Configuration object specifying all the parameters of the model config.
  • target_config (TargetConfig) – Configuration object specifying all the parameters of the target.
Returns:

An instance of DocClassificationDataHandler.

Return type:

type

preprocess_row(row_data: Dict[str, Any]) → Dict[str, Any][source]

preprocess steps for a single input row, sub class should override it

class pytext.data.RawData[source]

Bases: object

DICT_FEAT = 'dict_feat'
DOC_LABEL = 'doc_label'
TEXT = 'text'
class pytext.data.DisjointMultitaskDataHandler(config: pytext.data.disjoint_multitask_data_handler.DisjointMultitaskDataHandler.Config, data_handlers: Dict[str, pytext.data.data_handler.DataHandler], *args, **kwargs)[source]

Bases: pytext.data.data_handler.DataHandler

Wrapper for doing multitask training using multiple data handlers. Takes a dictionary of data handlers, does round robin over their iterators using RoundRobinBatchIterator.

Parameters:
  • config (Config) – Configuration object of type DisjointMultitaskDataHandler.Config.
  • data_handlers (Dict[str, DataHandler]) – Data handlers to do roundrobin over.
  • *args (type) – Extra arguments to be passed down to sub data handlers.
  • **kwargs (type) – Extra arguments to be passed down to sub data handlers.
data_handlers

type – Data handlers to do roundrobin over.

epoch_size

type – Size of epoch in number of batches.

epoch_size

Optional[int] – Size of epoch in number of batches. If not set,

do a single pass over the training data.
Config[source]

alias of DisjointMultitaskDataHandler.Config

get_eval_iter() → pytext.data.data_handler.BatchIterator[source]
get_test_iter() → pytext.data.data_handler.BatchIterator[source]
get_train_iter(rank: int = 0, world_size: int = 1) → Tuple[pytext.data.data_handler.BatchIterator, ...][source]
init_metadata()[source]

Initialize metadata using data from configured path

load_metadata(metadata)[source]

Load previously saved metadata

metadata_to_save()[source]

Save metadata, pretrained_embeds_weight should be excluded

class pytext.data.KDDocClassificationDataHandler(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, **kwargs)[source]

Bases: pytext.data.doc_classification_data_handler.DocClassificationDataHandler

The KDDocClassificationDataHandler prepares the data for knowledge distillation document classififcation. Each sentence is read line by line with its label as the target.

Config[source]

alias of KDDocClassificationDataHandler.Config

classmethod from_config(config: pytext.data.kd_doc_classification_data_handler.KDDocClassificationDataHandler.Config, model_input_config: pytext.config.kd_doc_classification.ModelInputConfig, target_config: pytext.config.field_config.DocLabelConfig, **kwargs)[source]

Factory method to construct an instance of DocClassificationDataHandler from the module’s config object and feature config object.

Parameters:
  • config (DocClassificationDataHandler.Config) – Configuration object specifying all the parameters of DocClassificationDataHandler.
  • model_input_config (ModelInputConfig) – Configuration object specifying all the parameters of the model config.
  • target_config (TargetConfig) – Configuration object specifying all the parameters of the target.
Returns:

An instance of KDDocClassificationDataHandler.

Return type:

type

init_target_metadata(train_data: torchtext.data.dataset.Dataset, eval_data: torchtext.data.dataset.Dataset, test_data: torchtext.data.dataset.Dataset)[source]
preprocess_row(row_data: Dict[str, Any]) → Dict[str, Any][source]

preprocess steps for a single input row, sub class should override it