matchzoo

Subpackages

Submodules

Package Contents

matchzoo.USER_DIR
matchzoo.USER_DATA_DIR
matchzoo.USER_TUNED_MODELS_DIR
matchzoo.__version__ = 0.0.1
class matchzoo.DataPack(relation:pd.DataFrame, left:pd.DataFrame, right:pd.DataFrame)

Bases: object

Matchzoo DataPack data structure, store dataframe and context.

DataPack is a MatchZoo native data structure that most MatchZoo data handling processes build upon. A DataPack consists of three parts: left, right and relation, each one of is a pandas.DataFrame.

Parameters:
  • relation – Store the relation between left document and right document use ids.
  • left – Store the content or features for id_left.
  • right – Store the content or features for id_right.

Example

>>> left = [
...     ['qid1', 'query 1'],
...     ['qid2', 'query 2']
... ]
>>> right = [
...     ['did1', 'document 1'],
...     ['did2', 'document 2']
... ]
>>> relation = [['qid1', 'did1', 1], ['qid2', 'did2', 1]]
>>> relation_df = pd.DataFrame(relation)
>>> left = pd.DataFrame(left)
>>> right = pd.DataFrame(right)
>>> dp = DataPack(
...     relation=relation_df,
...     left=left,
...     right=right,
... )
>>> len(dp)
2
class FrameView(data_pack:'DataPack')

Bases: object

FrameView.

__getitem__(self, index:typing.Union[int, slice, np.array])

Slicer.

__call__(self)
Returns:A full copy. Equivalant to frame[:].
DATA_FILENAME = data.dill
has_label

True if label column exists, False other wise.

Type:return
frame

View the data pack as a pandas.DataFrame.

Returned data frame is created by merging the left data frame, the right dataframe and the relation data frame. Use [] to access an item or a slice of items.

Returns:A matchzoo.DataPack.FrameView instance.

Example

>>> import matchzoo as mz
>>> data_pack = mz.datasets.toy.load_data()
>>> type(data_pack.frame)
<class 'matchzoo.data_pack.data_pack.DataPack.FrameView'>
>>> frame_slice = data_pack.frame[0:5]
>>> type(frame_slice)
<class 'pandas.core.frame.DataFrame'>
>>> list(frame_slice.columns)
['id_left', 'text_left', 'id_right', 'text_right', 'label']
>>> full_frame = data_pack.frame()
>>> len(full_frame) == len(data_pack)
True
relation

relation getter.

left

Get left() of DataPack.

right

Get right() of DataPack.

__len__(self)

Get numer of rows in the class:DataPack object.

unpack(self)

Unpack the data for training.

The return value can be directly feed to model.fit or model.fit_generator.

Returns:A tuple of (X, y). y is None if self has no label.

Example

>>> import matchzoo as mz
>>> data_pack = mz.datasets.toy.load_data()
>>> X, y = data_pack.unpack()
>>> type(X)
<class 'dict'>
>>> sorted(X.keys())
['id_left', 'id_right', 'text_left', 'text_right']
>>> type(y)
<class 'numpy.ndarray'>
>>> X, y = data_pack.drop_label().unpack()
>>> type(y)
<class 'NoneType'>
__getitem__(self, index:typing.Union[int, slice, np.array])

Get specific item(s) as a new DataPack.

The returned DataPack will be a copy of the subset of the original DataPack.

Parameters:index – Index of the item(s) to get.
Returns:An instance of DataPack.
copy(self)
Returns:A deep copy.
save(self, dirpath:typing.Union[str, Path])

Save the DataPack object.

A saved DataPack is represented as a directory with a DataPack object (transformed user input as features and context), it will be saved by pickle.

Parameters:dirpath – directory path of the saved DataPack.
_optional_inplace(func)

Decorator that adds inplace key word argument to a method.

Decorate any method that modifies inplace to make that inplace change optional.

shuffle(self)

Shuffle the data pack by shuffling the relation column.

Parameters:inplaceTrue to modify inplace, False to return a modified copy. (default: False)

Example

>>> import matchzoo as mz
>>> import numpy.random
>>> numpy.random.seed(0)
>>> data_pack = mz.datasets.toy.load_data()
>>> orig_ids = data_pack.relation['id_left']
>>> shuffled = data_pack.shuffle()
>>> (shuffled.relation['id_left'] != orig_ids).any()
True
drop_label(self)

Remove label column from the data pack.

Parameters:inplaceTrue to modify inplace, False to return a modified copy. (default: False)

Example

>>> import matchzoo as mz
>>> data_pack = mz.datasets.toy.load_data()
>>> data_pack.has_label
True
>>> data_pack.drop_label(inplace=True)
>>> data_pack.has_label
False
append_text_length(self, verbose=1)

Append length_left and length_right columns.

Parameters:
  • inplaceTrue to modify inplace, False to return a modified copy. (default: False)
  • verbose – Verbosity.

Example

>>> import matchzoo as mz
>>> data_pack = mz.datasets.toy.load_data()
>>> 'length_left' in data_pack.frame[0].columns
False
>>> new_data_pack = data_pack.append_text_length(verbose=0)
>>> 'length_left' in new_data_pack.frame[0].columns
True
>>> 'length_left' in data_pack.frame[0].columns
False
>>> data_pack.append_text_length(inplace=True, verbose=0)
>>> 'length_left' in data_pack.frame[0].columns
True
apply_on_text(self, func:typing.Callable, mode:str='both', rename:typing.Optional[str]=None, verbose:int=1)

Apply func to text columns based on mode.

Parameters:
  • func – The function to apply.
  • mode – One of “both”, “left” and “right”.
  • rename – If set, use new names for results instead of replacing the original columns. To set rename in “both” mode, use a tuple of str, e.g. (“text_left_new_name”, “text_right_new_name”).
  • inplaceTrue to modify inplace, False to return a modified copy. (default: False)
  • verbose – Verbosity.
Examples::
>>> import matchzoo as mz
>>> data_pack = mz.datasets.toy.load_data()
>>> frame = data_pack.frame
To apply len on the left text and add the result as ‘length_left’:
>>> data_pack.apply_on_text(len, mode='left',
...                         rename='length_left',
...                         inplace=True,
...                         verbose=0)
>>> list(frame[0].columns) # noqa: E501
['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'label']
To do the same to the right text:
>>> data_pack.apply_on_text(len, mode='right',
...                         rename='length_right',
...                         inplace=True,
...                         verbose=0)
>>> list(frame[0].columns) # noqa: E501
['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'length_right', 'label']
To do the same to the both texts at the same time:
>>> data_pack.apply_on_text(len, mode='both',
...                         rename=('extra_left', 'extra_right'),
...                         inplace=True,
...                         verbose=0)
>>> list(frame[0].columns) # noqa: E501
['id_left', 'text_left', 'length_left', 'extra_left', 'id_right', 'text_right', 'length_right', 'extra_right', 'label']
To suppress outputs:
>>> data_pack.apply_on_text(len, mode='both', verbose=0,
...                         inplace=True)
_apply_on_text_right(self, func, rename, verbose=1)
_apply_on_text_left(self, func, rename, verbose=1)
_apply_on_text_both(self, func, rename, verbose=1)
matchzoo.load_data_pack(dirpath:typing.Union[str, Path]) → DataPack

Load a DataPack. The reverse function of save().

Parameters:dirpath – directory path of the saved model.
Returns:a DataPack instance.
matchzoo.chain_transform(units:typing.List[Unit]) → typing.Callable

Compose unit transformations into a single function.

Parameters:units – List of matchzoo.StatelessUnit.
matchzoo.load_preprocessor(dirpath:typing.Union[str, Path]) → 'mz.DataPack'

Load the fitted context. The reverse function of save().

Parameters:dirpath – directory path of the saved model.
Returns:a DSSMPreprocessor instance.
class matchzoo.Param(name:str, value:typing.Any=None, hyper_space:typing.Optional[SpaceType]=None, validator:typing.Optional[typing.Callable[[typing.Any], bool]]=None, desc:typing.Optional[str]=None)

Bases: object

Parameter class.

Basic usages with a name and value:

>>> param = Param('my_param', 10)
>>> param.name
'my_param'
>>> param.value
10

Use with a validator to make sure the parameter always keeps a valid value.

>>> param = Param(
...     name='my_param',
...     value=5,
...     validator=lambda x: 0 < x < 20
... )
>>> param.validator  # doctest: +ELLIPSIS
<function <lambda> at 0x...>
>>> param.value
5
>>> param.value = 10
>>> param.value
10
>>> param.value = -1
Traceback (most recent call last):
    ...
ValueError: Validator not satifised.
The validator's definition is as follows:
validator=lambda x: 0 < x < 20

Use with a hyper space. Setting up a hyper space for a parameter makes the parameter tunable in a matchzoo.engine.Tuner.

>>> from matchzoo.engine.hyper_spaces import quniform
>>> param = Param(
...     name='positive_num',
...     value=1,
...     hyper_space=quniform(low=1, high=5)
... )
>>> param.hyper_space  # doctest: +ELLIPSIS
<matchzoo.engine.hyper_spaces.quniform object at ...>
>>> from hyperopt.pyll.stochastic import sample
>>> hyperopt_space = param.hyper_space.convert(param.name)
>>> samples = [sample(hyperopt_space) for _ in range(64)]
>>> set(samples) == {1, 2, 3, 4, 5}
True

The boolean value of a Param instance is only True when the value is not None. This is because some default falsy values like zero or an empty list are valid parameter values. In other words, the boolean value means to be “if the parameter value is filled”.

>>> param = Param('dropout')
>>> if param:
...     print('OK')
>>> param = Param('dropout', 0)
>>> if param:
...     print('OK')
OK

A _pre_assignment_hook is initialized as a data type convertor if the value is set as a number to keep data type consistency of the parameter. This conversion supports python built-in numbers, numpy numbers, and any number that inherits numbers.Number.

>>> param = Param('float_param', 0.5)
>>> param.value = 10
>>> param.value
10.0
>>> type(param.value)
<class 'float'>
name

Name of the parameter.

Type:return
value

Value of the parameter.

Type:return
hyper_space

Hyper space of the parameter.

Type:return
validator

Validator of the parameter.

Type:return
desc

Parameter description.

Type:return
_infer_pre_assignment_hook(self)
_validate(self, value)
__bool__(self)
Returns:False when the value is None, True otherwise.
set_default(self, val, verbose=1)

Set default value, has no effect if already has a value.

Parameters:
  • val – Default value to set.
  • verbose – Verbosity.
reset(self)

Set the parameter’s value to None, which means “not set”.

This method bypasses validator.

Example

>>> import matchzoo as mz
>>> param = mz.Param(
...     name='str', validator=lambda x: isinstance(x, str))
>>> param.value = 'hello'
>>> param.value = None
Traceback (most recent call last):
    ...
ValueError: Validator not satifised.
The validator's definition is as follows:
name='str', validator=lambda x: isinstance(x, str))
>>> param.reset()
>>> param.value is None
True
class matchzoo.ParamTable

Bases: object

Parameter table class.

Example

>>> params = ParamTable()
>>> params.add(Param('ham', 'Parma Ham'))
>>> params.add(Param('egg', 'Over Easy'))
>>> params['ham']
'Parma Ham'
>>> params['egg']
'Over Easy'
>>> print(params)
ham                           Parma Ham
egg                           Over Easy
>>> params.add(Param('egg', 'Sunny side Up'))
Traceback (most recent call last):
    ...
ValueError: Parameter named egg already exists.
To re-assign parameter egg value, use `params["egg"] = value` instead.
hyper_space

Hyper space of the table, a valid hyperopt graph.

Type:return
add(self, param:Param)
Parameters:param – parameter to add.
get(self, key)
Returns:The parameter in the table named key.
set(self, key, param:Param)

Set key to parameter param.

to_frame(self)

Convert the parameter table into a pandas data frame.

Returns:A pandas.DataFrame.

Example

>>> import matchzoo as mz
>>> table = mz.ParamTable()
>>> table.add(mz.Param(name='x', value=10, desc='my x'))
>>> table.add(mz.Param(name='y', value=20, desc='my y'))
>>> table.to_frame()
  Name Description  Value Hyper-Space
0    x        my x     10        None
1    y        my y     20        None
__getitem__(self, key:str)
Returns:The value of the parameter in the table named key.
__setitem__(self, key:str, value:typing.Any)

Set the value of the parameter named key.

Parameters:
  • key – Name of the parameter.
  • value – New value of the parameter to set.
__str__(self)
Returns:Pretty formatted parameter table.
__iter__(self)
Returns:A iterator that iterates over all parameter instances.
completed(self)
Returns:True if all params are filled, False otherwise.

Example

>>> import matchzoo
>>> model = matchzoo.models.DenseBaseline()
>>> model.params.completed()
False
keys(self)
Returns:Parameter table keys.
__contains__(self, item)
Returns:True if parameter in parameters.
update(self, other:dict)

Update self.

Update self with the key/value pairs from other, overwriting existing keys. Notice that this does not add new keys to self.

This method is usually used by models to obtain useful information from a preprocessor’s context.

Parameters:other – The dictionary used update.

Example

>>> import matchzoo as mz
>>> model = mz.models.DenseBaseline()
>>> prpr = model.get_default_preprocessor()
>>> _ = prpr.fit(mz.datasets.toy.load_data(), verbose=0)
>>> model.params.update(prpr.context)
class matchzoo.Embedding(data:dict, output_dim:int)

Bases: object

Embedding class.

Examples::
>>> import matchzoo as mz
>>> train_raw = mz.datasets.toy.load_data()
>>> pp = mz.preprocessors.NaivePreprocessor()
>>> train = pp.fit_transform(train_raw, verbose=0)
>>> vocab_unit = mz.build_vocab_unit(train, verbose=0)
>>> term_index = vocab_unit.state['term_index']
>>> embed_path = mz.datasets.embeddings.EMBED_RANK
To load from a file:
>>> embedding = mz.embedding.load_from_file(embed_path)
>>> matrix = embedding.build_matrix(term_index)
>>> matrix.shape[0] == len(term_index)
True
To build your own:
>>> data = {'A':[0, 1], 'B':[2, 3]}
>>> embedding = mz.Embedding(data, 2)
>>> matrix = embedding.build_matrix({'A': 2, 'B': 1, '_PAD': 0})
>>> matrix.shape == (3, 2)
True
build_matrix(self, term_index:typing.Union[dict, mz.preprocessors.units.Vocabulary.TermIndex], initializer=lambda: np.random.uniform(-0.2, 0.2))

Build a matrix using term_index.

Parameters:
  • term_index – A dict or TermIndex to build with.
  • initializer – A callable that returns a default value for missing terms in data. (default: a random uniform distribution in range) (-0.2, 0.2)).
Returns:

A matrix.

matchzoo.build_unit_from_data_pack(unit:StatefulUnit, data_pack:mz.DataPack, mode:str='both', flatten:bool=True, verbose:int=1) → StatefulUnit

Build a StatefulUnit from a DataPack object.

Parameters:
  • unitStatefulUnit object to be built.
  • data_pack – The input DataPack object.
  • mode – One of ‘left’, ‘right’, and ‘both’, to determine the source data for building the VocabularyUnit.
  • flatten – Flatten the datapack or not. True to organize the DataPack text as a list, and False to organize DataPack text as a list of list.
  • verbose – Verbosity.
Returns:

A built StatefulUnit object.

matchzoo.build_vocab_unit(data_pack:DataPack, mode:str='both', verbose:int=1) → Vocabulary

Build a preprocessor.units.Vocabulary given data_pack.

The data_pack should be preprocessed forehand, and each item in text_left and text_right columns of the data_pack should be a list of tokens.

Parameters:
  • data_pack – The DataPack to build vocabulary upon.
  • mode – One of ‘left’, ‘right’, and ‘both’, to determine the source

data for building the VocabularyUnit. :param verbose: Verbosity. :return: A built vocabulary unit.