matchzoo.dataloader

Package Contents

class matchzoo.dataloader.Dataset(data_pack:mz.DataPack, mode='point', num_dup:int=1, num_neg:int=1, callbacks:typing.List[BaseCallback]=None)

Bases: torch.utils.data.Dataset

Dataset that is built from a data pack.

Parameters:
  • data_pack – DataPack to build the dataset.
  • mode – One of “point”, “pair”, and “list”. (default: “point”)
  • num_dup – Number of duplications per instance, only effective when mode is “pair”. (default: 1)
  • num_neg – Number of negative samples per instance, only effective when mode is “pair”. (default: 1)
  • callbacks – Callbacks. See matchzoo.data_generator.callbacks for more details.

Examples

>>> import matchzoo as mz
>>> data_pack = mz.datasets.toy.load_data(stage='train')
>>> preprocessor = mz.preprocessors.CDSSMPreprocessor()
>>> data_processed = preprocessor.fit_transform(data_pack)
>>> dataset_point = mz.dataloader.Dataset(data_processed, mode='point')
>>> len(dataset_point)
100
>>> dataset_pair = mz.dataloader.Dataset(
...     data_processed, mode='pair', num_neg=2)
>>> len(dataset_pair)
5
data_pack

data_pack getter.

callbacks

callbacks getter.

num_neg

num_neg getter.

num_dup

num_dup getter.

mode

mode getter.

index_pool

index_pool getter.

__len__(self)

Get the total number of instances.

__getitem__(self, item:int)

Get a set of instances from index idx.

Parameters:item – the index of the instance.
_handle_callbacks_on_batch_data_pack(self, batch_data_pack)
_handle_callbacks_on_batch_unpacked(self, x, y)
get_index_pool(self)

Set the:attr:_index_pool.

Here the _index_pool records the index of all the instances.

sample(self)

Resample the instances from data pack.

shuffle(self)

Shuffle the instances.

sort(self)

Sort the instances by length_right.

classmethod _reorganize_pair_wise(cls, relation:pd.DataFrame, num_dup:int=1, num_neg:int=1)

Re-organize the data pack as pair-wise format.

class matchzoo.dataloader.DataLoader(dataset:data.Dataset, batch_size:int=32, device:typing.Optional[torch.device]=None, stage='train', resample:bool=True, shuffle:bool=False, sort:bool=True, callback:BaseCallback=None, pin_memory:bool=False, timeout:int=0, num_workers:int=0, worker_init_fn=None)

Bases: object

DataLoader that loads batches of data from a Dataset.

Parameters:
  • dataset – The Dataset object to load data from.
  • batch_size – Batch_size. (default: 32)
  • device – An instance of torch.device specifying which device the Variables are going to be created on.
  • stage – One of “train”, “dev”, and “test”. (default: “train”)
  • resample – Whether to resample data between epochs. only effective when mode of dataset is “pair”. (default: True)
  • shuffle – Whether to shuffle data between epochs. (default: False)
  • sort – Whether to sort data according to length_right. (default: True)
  • callback – BaseCallback. See matchzoo.engine.base_callback.BaseCallback for more details.
  • pin_momory – If set to True, tensors will be copied into pinned memory. (default: False)
  • timeout – The timeout value for collecting a batch from workers. ( default: 0)
  • num_workers – The number of subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 0)
  • worker_init_fn – If not None, this will be called on each worker subprocess with the worker id (an int in [0, num_workers - 1]) as input, after seeding and before data loading. (default: None)

Examples

>>> import matchzoo as mz
>>> data_pack = mz.datasets.toy.load_data(stage='train')
>>> preprocessor = mz.preprocessors.CDSSMPreprocessor()
>>> data_processed = preprocessor.fit_transform(data_pack)
>>> dataset = mz.dataloader.Dataset(data_processed, mode='point')
>>> padding_callback = mz.dataloader.callbacks.CDSSMPadding()
>>> dataloader = mz.dataloader.DataLoader(
...     dataset, stage='train', callback=padding_callback)
>>> len(dataloader)
4
id_left

id_left getter.

label

label getter.

__len__(self)

Get the total number of batches.

init_epoch(self)

Resample, shuffle or sort the dataset for a new epoch.

__iter__(self)

Iteration.

_handle_callbacks_on_batch_unpacked(self, x, y)
class matchzoo.dataloader.DataLoaderBuilder(**kwargs)

Bases: object

DataLoader Bulider. In essense a wrapped partial function.

Example

>>> import matchzoo as mz
>>> padding_callback = mz.dataloader.callbacks.CDSSMPadding()
>>> builder = mz.dataloader.DataLoaderBuilder(
...     stage='train', callback=padding_callback
... )
>>> data_pack = mz.datasets.toy.load_data()
>>> preprocessor = mz.preprocessors.CDSSMPreprocessor()
>>> data_processed = preprocessor.fit_transform(data_pack)
>>> dataset = mz.dataloader.Dataset(data_processed, mode='point')
>>> dataloder = builder.build(dataset)
>>> type(dataloder)
<class 'matchzoo.dataloader.dataloader.DataLoader'>
build(self, dataset, **kwargs)

Build a DataLoader.

Parameters:
  • dataset – Dataset to build upon.
  • kwargs – Additional keyword arguments to override the keyword arguments passed in __init__.
class matchzoo.dataloader.DatasetBuilder(**kwargs)

Bases: object

Dataset Bulider. In essense a wrapped partial function.

Example

>>> import matchzoo as mz
>>> builder = mz.dataloader.DatasetBuilder(
...     mode='point'
... )
>>> data = mz.datasets.toy.load_data()
>>> gen = builder.build(data)
>>> type(gen)
<class 'matchzoo.dataloader.dataset.Dataset'>
build(self, data_pack, **kwargs)

Build a Dataset.

Parameters:
  • data_pack – DataPack to build upon.
  • kwargs – Additional keyword arguments to override the keyword arguments passed in __init__.