`matchzoo`¶

Subpackages¶

Submodules¶

matchzoo.version

Package Contents¶

Classes¶

`DataPack`	Matchzoo `DataPack` data structure, store dataframe and context.
`Param`	Parameter class.
`ParamTable`	Parameter table class.
`Embedding`	Embedding class.

Functions¶

`load_data_pack`(dirpath: typing.Union[str, Path]) → DataPack	Load a `DataPack`. The reverse function of `save()`.
`chain_transform`(units: typing.List[Unit]) → typing.Callable	Compose unit transformations into a single function.
`load_preprocessor`(dirpath: typing.Union[str, Path]) → ‘mz.DataPack’	Load the fitted context. The reverse function of `save()`.
`build_unit_from_data_pack`(unit: StatefulUnit, data_pack: mz.DataPack, mode: str = ‘both’, flatten: bool = True, verbose: int = 1) → StatefulUnit	Build a `StatefulUnit` from a `DataPack` object.
`build_vocab_unit`(data_pack: DataPack, mode: str = ‘both’, verbose: int = 1) → Vocabulary	Build a `preprocessor.units.Vocabulary` given data_pack.

matchzoo.USER_DIR¶

matchzoo.USER_DATA_DIR¶

matchzoo.USER_TUNED_MODELS_DIR¶

matchzoo.__version__ = 1.1.1¶

class matchzoo.DataPack(relation: pd.DataFrame, left: pd.DataFrame, right: pd.DataFrame)¶

Bases: object

Matchzoo DataPack data structure, store dataframe and context.

DataPack is a MatchZoo native data structure that most MatchZoo data handling processes build upon. A DataPack consists of three parts: left, right and relation, each one of is a pandas.DataFrame.

Parameters

relation – Store the relation between left document and right document use ids.
left – Store the content or features for id_left.
right – Store the content or features for id_right.

Example

>>> left = [
...     ['qid1', 'query 1'],
...     ['qid2', 'query 2']
... ]
>>> right = [
...     ['did1', 'document 1'],
...     ['did2', 'document 2']
... ]
>>> relation = [['qid1', 'did1', 1], ['qid2', 'did2', 1]]
>>> relation_df = pd.DataFrame(relation)
>>> left = pd.DataFrame(left)
>>> right = pd.DataFrame(right)
>>> dp = DataPack(
...     relation=relation_df,
...     left=left,
...     right=right,
... )
>>> len(dp)
2

class FrameView(data_pack: DataPack)¶

Bases: object

FrameView.

__getitem__(self, index: typing.Union[int, slice, np.array]) → pd.DataFrame¶: Slicer.

__call__(self)¶

Returns: A full copy. Equivalant to frame[:].

DATA_FILENAME = data.dill¶

property has_label(self) → bool¶

Returns: True if label column exists, False other wise.

__len__(self) → int¶: Get numer of rows in the class:DataPack object.

property frame(self) → ’DataPack.FrameView’¶

View the data pack as a pandas.DataFrame.

Returned data frame is created by merging the left data frame, the right dataframe and the relation data frame. Use [] to access an item or a slice of items.

Returns: A matchzoo.DataPack.FrameView instance.

Example

>>> import matchzoo as mz
>>> data_pack = mz.datasets.toy.load_data()
>>> type(data_pack.frame)
<class 'matchzoo.data_pack.data_pack.DataPack.FrameView'>
>>> frame_slice = data_pack.frame[0:5]
>>> type(frame_slice)
<class 'pandas.core.frame.DataFrame'>
>>> list(frame_slice.columns)
['id_left', 'text_left', 'id_right', 'text_right', 'label']
>>> full_frame = data_pack.frame()
>>> len(full_frame) == len(data_pack)
True

unpack(self) → typing.Tuple[typing.Dict[str, np.array], typing.Optional[np.array]]¶

Unpack the data for training.

The return value can be directly feed to model.fit or model.fit_generator.

Returns: A tuple of (X, y). y is None if self has no label.

Example

>>> import matchzoo as mz
>>> data_pack = mz.datasets.toy.load_data()
>>> X, y = data_pack.unpack()
>>> type(X)
<class 'dict'>
>>> sorted(X.keys())
['id_left', 'id_right', 'text_left', 'text_right']
>>> type(y)
<class 'numpy.ndarray'>
>>> X, y = data_pack.drop_label().unpack()
>>> type(y)
<class 'NoneType'>

__getitem__(self, index: typing.Union[int, slice, np.array]) → ’DataPack’¶

Get specific item(s) as a new DataPack.

The returned DataPack will be a copy of the subset of the original DataPack.

Parameters: index – Index of the item(s) to get.
Returns: An instance of DataPack.

property relation(self)¶: relation getter.

property left(self) → pd.DataFrame¶: Get left() of DataPack.

property right(self) → pd.DataFrame¶: Get right() of DataPack.

copy(self) → ’DataPack’¶

Returns: A deep copy.

save(self, dirpath: typing.Union[str, Path])¶

Save the DataPack object.

A saved DataPack is represented as a directory with a DataPack object (transformed user input as features and context), it will be saved by pickle.

Parameters: dirpath – directory path of the saved DataPack.

_optional_inplace(func)¶

Decorator that adds inplace key word argument to a method.

Decorate any method that modifies inplace to make that inplace change optional.

drop_empty(self)¶

Process empty data by removing corresponding rows.

Parameters: inplace – True to modify inplace, False to return a modified copy. (default: False)

shuffle(self)¶

Shuffle the data pack by shuffling the relation column.

Parameters: inplace – True to modify inplace, False to return a modified copy. (default: False)

Example

>>> import matchzoo as mz
>>> import numpy.random
>>> numpy.random.seed(0)
>>> data_pack = mz.datasets.toy.load_data()
>>> orig_ids = data_pack.relation['id_left']
>>> shuffled = data_pack.shuffle()
>>> (shuffled.relation['id_left'] != orig_ids).any()
True

drop_label(self)¶

Remove label column from the data pack.

Parameters: inplace – True to modify inplace, False to return a modified copy. (default: False)

Example

>>> import matchzoo as mz
>>> data_pack = mz.datasets.toy.load_data()
>>> data_pack.has_label
True
>>> data_pack.drop_label(inplace=True)
>>> data_pack.has_label
False

append_text_length(self, verbose=1)¶

Append length_left and length_right columns.

Parameters

inplace – True to modify inplace, False to return a modified copy. (default: False)
verbose – Verbosity.

Example

>>> import matchzoo as mz
>>> data_pack = mz.datasets.toy.load_data()
>>> 'length_left' in data_pack.frame[0].columns
False
>>> new_data_pack = data_pack.append_text_length(verbose=0)
>>> 'length_left' in new_data_pack.frame[0].columns
True
>>> 'length_left' in data_pack.frame[0].columns
False
>>> data_pack.append_text_length(inplace=True, verbose=0)
>>> 'length_left' in data_pack.frame[0].columns
True

apply_on_text(self, func: typing.Callable, mode: str = 'both', rename: typing.Optional[str] = None, verbose: int = 1)¶

Apply func to text columns based on mode.

Parameters

func – The function to apply.
mode – One of “both”, “left” and “right”.
rename – If set, use new names for results instead of replacing the original columns. To set rename in “both” mode, use a tuple of str, e.g. (“text_left_new_name”, “text_right_new_name”).
inplace – True to modify inplace, False to return a modified copy. (default: False)
verbose – Verbosity.

Examples::

>>> import matchzoo as mz
>>> data_pack = mz.datasets.toy.load_data()
>>> frame = data_pack.frame

To apply len on the left text and add the result as ‘length_left’:

>>> data_pack.apply_on_text(len, mode='left',
...                         rename='length_left',
...                         inplace=True,
...                         verbose=0)
>>> list(frame[0].columns) # noqa: E501
['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'label']

To do the same to the right text:

>>> data_pack.apply_on_text(len, mode='right',
...                         rename='length_right',
...                         inplace=True,
...                         verbose=0)
>>> list(frame[0].columns) # noqa: E501
['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'length_right', 'label']

To do the same to the both texts at the same time:

>>> data_pack.apply_on_text(len, mode='both',
...                         rename=('extra_left', 'extra_right'),
...                         inplace=True,
...                         verbose=0)
>>> list(frame[0].columns) # noqa: E501
['id_left', 'text_left', 'length_left', 'extra_left', 'id_right', 'text_right', 'length_right', 'extra_right', 'label']

To suppress outputs:

>>> data_pack.apply_on_text(len, mode='both', verbose=0,
...                         inplace=True)

_apply_on_text_right(self, func, rename, verbose=1)¶

_apply_on_text_left(self, func, rename, verbose=1)¶

_apply_on_text_both(self, func, rename, verbose=1)¶

matchzoo.load_data_pack(dirpath: typing.Union[str, Path]) → DataPack ¶

Load a DataPack. The reverse function of save().

Parameters: dirpath – directory path of the saved model.
Returns: a DataPack instance.

matchzoo.chain_transform(units: typing.List[Unit]) → typing.Callable¶

Compose unit transformations into a single function.

Parameters: units – List of matchzoo.StatelessUnit.

matchzoo.load_preprocessor(dirpath: typing.Union[str, Path]) → ’mz.DataPack’¶

Load the fitted context. The reverse function of save().

Parameters: dirpath – directory path of the saved model.
Returns: a DSSMPreprocessor instance.

class matchzoo.Param(name: str, value: typing.Any = None, hyper_space: typing.Optional[SpaceType] = None, validator: typing.Optional[typing.Callable[[typing.Any], bool]] = None, desc: typing.Optional[str] = None)¶

Bases: object

Parameter class.

Basic usages with a name and value:

>>> param = Param('my_param', 10)
>>> param.name
'my_param'
>>> param.value
10

Use with a validator to make sure the parameter always keeps a valid value.

>>> param = Param(
...     name='my_param',
...     value=5,
...     validator=lambda x: 0 < x < 20
... )
>>> param.validator  
<function <lambda> at 0x...>
>>> param.value
5
>>> param.value = 10
>>> param.value
10
>>> param.value = -1
Traceback (most recent call last):
    ...
ValueError: Validator not satifised.
The validator's definition is as follows:
validator=lambda x: 0 < x < 20

Use with a hyper space. Setting up a hyper space for a parameter makes the parameter tunable in a matchzoo.engine.Tuner.

>>> from matchzoo.engine.hyper_spaces import quniform
>>> param = Param(
...     name='positive_num',
...     value=1,
...     hyper_space=quniform(low=1, high=5)
... )
>>> param.hyper_space  
<matchzoo.engine.hyper_spaces.quniform object at ...>
>>> from hyperopt.pyll.stochastic import sample
>>> hyperopt_space = param.hyper_space.convert(param.name)
>>> samples = [sample(hyperopt_space) for _ in range(64)]
>>> set(samples) == {1, 2, 3, 4, 5}
True

The boolean value of a Param instance is only True when the value is not None. This is because some default falsy values like zero or an empty list are valid parameter values. In other words, the boolean value means to be “if the parameter value is filled”.

>>> param = Param('dropout')
>>> if param:
...     print('OK')
>>> param = Param('dropout', 0)
>>> if param:
...     print('OK')
OK

A _pre_assignment_hook is initialized as a data type convertor if the value is set as a number to keep data type consistency of the parameter. This conversion supports python built-in numbers, numpy numbers, and any number that inherits numbers.Number.

>>> param = Param('float_param', 0.5)
>>> param.value = 10
>>> param.value
10.0
>>> type(param.value)
<class 'float'>

property name(self) → str¶

Returns: Name of the parameter.

property value(self) → typing.Any¶

Returns: Value of the parameter.

property hyper_space(self) → SpaceType¶

Returns: Hyper space of the parameter.

property validator(self) → typing.Callable[[typing.Any], bool]¶

Returns: Validator of the parameter.

property desc(self) → str¶

Returns: Parameter description.

_infer_pre_assignment_hook(self)¶

_validate(self, value)¶

__bool__(self)¶

Returns: False when the value is None, True otherwise.

set_default(self, val, verbose=1)¶

Set default value, has no effect if already has a value.

Parameters

val – Default value to set.
verbose – Verbosity.

reset(self)¶

Set the parameter’s value to None, which means “not set”.

This method bypasses validator.

Example

>>> import matchzoo as mz
>>> param = mz.Param(
...     name='str', validator=lambda x: isinstance(x, str))
>>> param.value = 'hello'
>>> param.value = None
Traceback (most recent call last):
    ...
ValueError: Validator not satifised.
The validator's definition is as follows:
name='str', validator=lambda x: isinstance(x, str))
>>> param.reset()
>>> param.value is None
True

class matchzoo.ParamTable¶

Bases: object

Parameter table class.

Example

>>> params = ParamTable()
>>> params.add(Param('ham', 'Parma Ham'))
>>> params.add(Param('egg', 'Over Easy'))
>>> params['ham']
'Parma Ham'
>>> params['egg']
'Over Easy'
>>> print(params)
ham                           Parma Ham
egg                           Over Easy
>>> params.add(Param('egg', 'Sunny side Up'))
Traceback (most recent call last):
    ...
ValueError: Parameter named egg already exists.
To re-assign parameter egg value, use `params["egg"] = value` instead.

add(self, param: Param)¶

Parameters: param – parameter to add.

get(self, key) → Param ¶

Returns: The parameter in the table named key.

set(self, key, param: Param)¶: Set key to parameter param.

property hyper_space(self) → dict¶

Returns: Hyper space of the table, a valid hyperopt graph.

to_frame(self) → pd.DataFrame¶

Convert the parameter table into a pandas data frame.

Returns: A pandas.DataFrame.

Example

>>> import matchzoo as mz
>>> table = mz.ParamTable()
>>> table.add(mz.Param(name='x', value=10, desc='my x'))
>>> table.add(mz.Param(name='y', value=20, desc='my y'))
>>> table.to_frame()
  Name Description  Value Hyper-Space
0    x        my x     10        None
1    y        my y     20        None

__getitem__(self, key: str) → typing.Any¶

Returns: The value of the parameter in the table named key.

__setitem__(self, key: str, value: typing.Any)¶

Set the value of the parameter named key.

Parameters

key – Name of the parameter.
value – New value of the parameter to set.

__str__(self)¶

Returns: Pretty formatted parameter table.

__iter__(self) → typing.Iterator¶

Returns: A iterator that iterates over all parameter instances.

completed(self, exclude: typing.Optional[list] = None) → bool¶

Check if all params are filled.

Parameters: exclude – List of names of parameters that was excluded from being computed.
Returns: True if all params are filled, False otherwise.

Example

>>> import matchzoo
>>> model = matchzoo.models.DenseBaseline()
>>> model.params.completed(
...     exclude=['task', 'out_activation_func', 'embedding',
...              'embedding_input_dim', 'embedding_output_dim']
... )
True

keys(self) → collections.abc.KeysView¶

Returns: Parameter table keys.

__contains__(self, item)¶

Returns: True if parameter in parameters.

update(self, other: dict)¶

Update self.

Update self with the key/value pairs from other, overwriting existing keys. Notice that this does not add new keys to self.

This method is usually used by models to obtain useful information from a preprocessor’s context.

Parameters: other – The dictionary used update.

Example

>>> import matchzoo as mz
>>> model = mz.models.DenseBaseline()
>>> prpr = model.get_default_preprocessor()
>>> _ = prpr.fit(mz.datasets.toy.load_data(), verbose=0)
>>> model.params.update(prpr.context)

class matchzoo.Embedding(data: dict, output_dim: int)¶

Bases: object

Embedding class.

Examples::

>>> import matchzoo as mz
>>> train_raw = mz.datasets.toy.load_data()
>>> pp = mz.preprocessors.NaivePreprocessor()
>>> train = pp.fit_transform(train_raw, verbose=0)
>>> vocab_unit = mz.build_vocab_unit(train, verbose=0)
>>> term_index = vocab_unit.state['term_index']
>>> embed_path = mz.datasets.embeddings.EMBED_RANK

To load from a file:

>>> embedding = mz.embedding.load_from_file(embed_path)
>>> matrix = embedding.build_matrix(term_index)
>>> matrix.shape[0] == len(term_index)
True

To build your own:

>>> data = {'A':[0, 1], 'B':[2, 3]}
>>> embedding = mz.Embedding(data, 2)
>>> matrix = embedding.build_matrix({'A': 2, 'B': 1, '_PAD': 0})
>>> matrix.shape == (3, 2)
True

build_matrix(self, term_index: typing.Union[dict, mz.preprocessors.units.Vocabulary.TermIndex]) → np.ndarray¶

Build a matrix using term_index.

Parameters

term_index – A dict or TermIndex to build with.
initializer – A callable that returns a default value for missing terms in data. (default: a random uniform distribution in range) (-0.2, 0.2)).

Returns

A matrix.

matchzoo.build_unit_from_data_pack(unit: StatefulUnit, data_pack: mz.DataPack, mode: str = 'both', flatten: bool = True, verbose: int = 1) → StatefulUnit¶

Build a StatefulUnit from a DataPack object.

Parameters

unit – StatefulUnit object to be built.
data_pack – The input DataPack object.
mode – One of ‘left’, ‘right’, and ‘both’, to determine the source data for building the VocabularyUnit.
flatten – Flatten the datapack or not. True to organize the DataPack text as a list, and False to organize DataPack text as a list of list.
verbose – Verbosity.

Returns

A built StatefulUnit object.

matchzoo.build_vocab_unit(data_pack: DataPack, mode: str = 'both', verbose: int = 1) → Vocabulary¶

Build a preprocessor.units.Vocabulary given data_pack.

The data_pack should be preprocessed forehand, and each item in text_left and text_right columns of the data_pack should be a list of tokens.

Parameters

data_pack – The DataPack to build vocabulary upon.
mode – One of ‘left’, ‘right’, and ‘both’, to determine the source

data for building the VocabularyUnit. :param verbose: Verbosity. :return: A built vocabulary unit.

matchzoo¶

Subpackages¶

Submodules¶

Package Contents¶

Classes¶

Functions¶

`matchzoo`¶