`matchzoo`¶

Subpackages¶

Submodules¶

matchzoo.version

Package Contents¶

matchzoo.USER_DIR¶

matchzoo.USER_DATA_DIR¶

matchzoo.USER_TUNED_MODELS_DIR¶

matchzoo.__version__ = 0.0.1¶

class matchzoo.DataPack(relation:pd.DataFrame, left:pd.DataFrame, right:pd.DataFrame)¶

Bases: object

Matchzoo DataPack data structure, store dataframe and context.

DataPack is a MatchZoo native data structure that most MatchZoo data handling processes build upon. A DataPack consists of three parts: left, right and relation, each one of is a pandas.DataFrame.

Parameters:	relation – Store the relation between left document and right document use ids. left – Store the content or features for id_left. right – Store the content or features for id_right.

Example

>>> left = [
...     ['qid1', 'query 1'],
...     ['qid2', 'query 2']
... ]
>>> right = [
...     ['did1', 'document 1'],
...     ['did2', 'document 2']
... ]
>>> relation = [['qid1', 'did1', 1], ['qid2', 'did2', 1]]
>>> relation_df = pd.DataFrame(relation)
>>> left = pd.DataFrame(left)
>>> right = pd.DataFrame(right)
>>> dp = DataPack(
...     relation=relation_df,
...     left=left,
...     right=right,
... )
>>> len(dp)
2

class FrameView(data_pack:'DataPack')¶

Bases: object

FrameView.

__getitem__(self, index:typing.Union[int, slice, np.array])¶: Slicer.

__call__(self)¶

Returns:	A full copy. Equivalant to frame[:].

DATA_FILENAME = data.dill¶

has_label¶

True if label column exists, False other wise.

Type:	return

frame¶

View the data pack as a pandas.DataFrame.

Returned data frame is created by merging the left data frame, the right dataframe and the relation data frame. Use [] to access an item or a slice of items.

Returns:	A `matchzoo.DataPack.FrameView` instance.

Example

>>> import matchzoo as mz
>>> data_pack = mz.datasets.toy.load_data()
>>> type(data_pack.frame)
<class 'matchzoo.data_pack.data_pack.DataPack.FrameView'>
>>> frame_slice = data_pack.frame[0:5]
>>> type(frame_slice)
<class 'pandas.core.frame.DataFrame'>
>>> list(frame_slice.columns)
['id_left', 'text_left', 'id_right', 'text_right', 'label']
>>> full_frame = data_pack.frame()
>>> len(full_frame) == len(data_pack)
True

relation¶: relation getter.

left¶: Get left() of DataPack.

right¶: Get right() of DataPack.

__len__(self)¶: Get numer of rows in the class:DataPack object.

unpack(self)¶

Unpack the data for training.

The return value can be directly feed to model.fit or model.fit_generator.

Returns:	A tuple of (X, y). y is None if self has no label.

Example

>>> import matchzoo as mz
>>> data_pack = mz.datasets.toy.load_data()
>>> X, y = data_pack.unpack()
>>> type(X)
<class 'dict'>
>>> sorted(X.keys())
['id_left', 'id_right', 'text_left', 'text_right']
>>> type(y)
<class 'numpy.ndarray'>
>>> X, y = data_pack.drop_label().unpack()
>>> type(y)
<class 'NoneType'>

__getitem__(self, index:typing.Union[int, slice, np.array])¶

Get specific item(s) as a new DataPack.

The returned DataPack will be a copy of the subset of the original DataPack.

Parameters:	index – Index of the item(s) to get.
Returns:	An instance of `DataPack`.

copy(self)¶

Returns:	A deep copy.

save(self, dirpath:typing.Union[str, Path])¶

Save the DataPack object.

A saved DataPack is represented as a directory with a DataPack object (transformed user input as features and context), it will be saved by pickle.

Parameters:	dirpath – directory path of the saved `DataPack`.

_optional_inplace(func)¶

Decorator that adds inplace key word argument to a method.

Decorate any method that modifies inplace to make that inplace change optional.

shuffle(self)¶

Shuffle the data pack by shuffling the relation column.

Parameters:	inplace – True to modify inplace, False to return a modified copy. (default: False)

Example

>>> import matchzoo as mz
>>> import numpy.random
>>> numpy.random.seed(0)
>>> data_pack = mz.datasets.toy.load_data()
>>> orig_ids = data_pack.relation['id_left']
>>> shuffled = data_pack.shuffle()
>>> (shuffled.relation['id_left'] != orig_ids).any()
True

drop_label(self)¶

Remove label column from the data pack.

Parameters:	inplace – True to modify inplace, False to return a modified copy. (default: False)

Example

>>> import matchzoo as mz
>>> data_pack = mz.datasets.toy.load_data()
>>> data_pack.has_label
True
>>> data_pack.drop_label(inplace=True)
>>> data_pack.has_label
False

append_text_length(self, verbose=1)¶

Append length_left and length_right columns.

Parameters:	inplace – True to modify inplace, False to return a modified copy. (default: False) verbose – Verbosity.

Example

>>> import matchzoo as mz
>>> data_pack = mz.datasets.toy.load_data()
>>> 'length_left' in data_pack.frame[0].columns
False
>>> new_data_pack = data_pack.append_text_length(verbose=0)
>>> 'length_left' in new_data_pack.frame[0].columns
True
>>> 'length_left' in data_pack.frame[0].columns
False
>>> data_pack.append_text_length(inplace=True, verbose=0)
>>> 'length_left' in data_pack.frame[0].columns
True

apply_on_text(self, func:typing.Callable, mode:str='both', rename:typing.Optional[str]=None, verbose:int=1)¶

Apply func to text columns based on mode.

Parameters:

func – The function to apply.
mode – One of “both”, “left” and “right”.
rename – If set, use new names for results instead of replacing the original columns. To set rename in “both” mode, use a tuple of str, e.g. (“text_left_new_name”, “text_right_new_name”).
inplace – True to modify inplace, False to return a modified copy. (default: False)
verbose – Verbosity.

Examples::

>>> import matchzoo as mz
>>> data_pack = mz.datasets.toy.load_data()
>>> frame = data_pack.frame

To apply len on the left text and add the result as ‘length_left’:

>>> data_pack.apply_on_text(len, mode='left',
...                         rename='length_left',
...                         inplace=True,
...                         verbose=0)
>>> list(frame[0].columns) # noqa: E501
['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'label']

To do the same to the right text:

>>> data_pack.apply_on_text(len, mode='right',
...                         rename='length_right',
...                         inplace=True,
...                         verbose=0)
>>> list(frame[0].columns) # noqa: E501
['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'length_right', 'label']

To do the same to the both texts at the same time:

>>> data_pack.apply_on_text(len, mode='both',
...                         rename=('extra_left', 'extra_right'),
...                         inplace=True,
...                         verbose=0)
>>> list(frame[0].columns) # noqa: E501
['id_left', 'text_left', 'length_left', 'extra_left', 'id_right', 'text_right', 'length_right', 'extra_right', 'label']

To suppress outputs:

>>> data_pack.apply_on_text(len, mode='both', verbose=0,
...                         inplace=True)

_apply_on_text_right(self, func, rename, verbose=1)¶

_apply_on_text_left(self, func, rename, verbose=1)¶

_apply_on_text_both(self, func, rename, verbose=1)¶

matchzoo.load_data_pack(dirpath:typing.Union[str, Path]) → DataPack¶

Load a DataPack. The reverse function of save().

Parameters:	dirpath – directory path of the saved model.
Returns:	a `DataPack` instance.

matchzoo.chain_transform(units:typing.List[Unit]) → typing.Callable¶

Compose unit transformations into a single function.

Parameters:	units – List of `matchzoo.StatelessUnit`.

matchzoo.load_preprocessor(dirpath:typing.Union[str, Path]) → 'mz.DataPack'¶

Load the fitted context. The reverse function of save().

Parameters:	dirpath – directory path of the saved model.
Returns:	a `DSSMPreprocessor` instance.

class matchzoo.Param(name:str, value:typing.Any=None, hyper_space:typing.Optional[SpaceType]=None, validator:typing.Optional[typing.Callable[[typing.Any], bool]]=None, desc:typing.Optional[str]=None)¶

Bases: object

Parameter class.

Basic usages with a name and value:

>>> param = Param('my_param', 10)
>>> param.name
'my_param'
>>> param.value
10

Use with a validator to make sure the parameter always keeps a valid value.

>>> param = Param(
...     name='my_param',
...     value=5,
...     validator=lambda x: 0 < x < 20
... )
>>> param.validator  # doctest: +ELLIPSIS
<function <lambda> at 0x...>
>>> param.value
5
>>> param.value = 10
>>> param.value
10
>>> param.value = -1
Traceback (most recent call last):
    ...
ValueError: Validator not satifised.
The validator's definition is as follows:
validator=lambda x: 0 < x < 20

Use with a hyper space. Setting up a hyper space for a parameter makes the parameter tunable in a matchzoo.engine.Tuner.

>>> from matchzoo.engine.hyper_spaces import quniform
>>> param = Param(
...     name='positive_num',
...     value=1,
...     hyper_space=quniform(low=1, high=5)
... )
>>> param.hyper_space  # doctest: +ELLIPSIS
<matchzoo.engine.hyper_spaces.quniform object at ...>
>>> from hyperopt.pyll.stochastic import sample
>>> hyperopt_space = param.hyper_space.convert(param.name)
>>> samples = [sample(hyperopt_space) for _ in range(64)]
>>> set(samples) == {1, 2, 3, 4, 5}
True

The boolean value of a Param instance is only True when the value is not None. This is because some default falsy values like zero or an empty list are valid parameter values. In other words, the boolean value means to be “if the parameter value is filled”.

>>> param = Param('dropout')
>>> if param:
...     print('OK')
>>> param = Param('dropout', 0)
>>> if param:
...     print('OK')
OK

A _pre_assignment_hook is initialized as a data type convertor if the value is set as a number to keep data type consistency of the parameter. This conversion supports python built-in numbers, numpy numbers, and any number that inherits numbers.Number.

>>> param = Param('float_param', 0.5)
>>> param.value = 10
>>> param.value
10.0
>>> type(param.value)
<class 'float'>

name¶

Name of the parameter.

Type:	return

value¶

Value of the parameter.

Type:	return

hyper_space¶

Hyper space of the parameter.

Type:	return

validator¶

Validator of the parameter.

Type:	return

desc¶

Parameter description.

Type:	return

_infer_pre_assignment_hook(self)¶

_validate(self, value)¶

__bool__(self)¶

Returns:	False when the value is None, True otherwise.

set_default(self, val, verbose=1)¶

Set default value, has no effect if already has a value.

Parameters:	val – Default value to set. verbose – Verbosity.

reset(self)¶

Set the parameter’s value to None, which means “not set”.

This method bypasses validator.

Example

>>> import matchzoo as mz
>>> param = mz.Param(
...     name='str', validator=lambda x: isinstance(x, str))
>>> param.value = 'hello'
>>> param.value = None
Traceback (most recent call last):
    ...
ValueError: Validator not satifised.
The validator's definition is as follows:
name='str', validator=lambda x: isinstance(x, str))
>>> param.reset()
>>> param.value is None
True

class matchzoo.ParamTable¶

Bases: object

Parameter table class.

Example

>>> params = ParamTable()
>>> params.add(Param('ham', 'Parma Ham'))
>>> params.add(Param('egg', 'Over Easy'))
>>> params['ham']
'Parma Ham'
>>> params['egg']
'Over Easy'
>>> print(params)
ham                           Parma Ham
egg                           Over Easy
>>> params.add(Param('egg', 'Sunny side Up'))
Traceback (most recent call last):
    ...
ValueError: Parameter named egg already exists.
To re-assign parameter egg value, use `params["egg"] = value` instead.

hyper_space¶

Hyper space of the table, a valid hyperopt graph.

Type:	return

add(self, param:Param)¶

Parameters:	param – parameter to add.

get(self, key)¶

Returns:	The parameter in the table named key.

set(self, key, param:Param)¶: Set key to parameter param.

to_frame(self)¶

Convert the parameter table into a pandas data frame.

Returns:	A pandas.DataFrame.

Example

>>> import matchzoo as mz
>>> table = mz.ParamTable()
>>> table.add(mz.Param(name='x', value=10, desc='my x'))
>>> table.add(mz.Param(name='y', value=20, desc='my y'))
>>> table.to_frame()
  Name Description  Value Hyper-Space
0    x        my x     10        None
1    y        my y     20        None

__getitem__(self, key:str)¶

Returns:	The value of the parameter in the table named key.

__setitem__(self, key:str, value:typing.Any)¶

Set the value of the parameter named key.

Parameters:	key – Name of the parameter. value – New value of the parameter to set.

__str__(self)¶

Returns:	Pretty formatted parameter table.

__iter__(self)¶

Returns:	A iterator that iterates over all parameter instances.

completed(self)¶

Returns:	True if all params are filled, False otherwise.

Example

>>> import matchzoo
>>> model = matchzoo.models.DenseBaseline()
>>> model.params.completed()
False

keys(self)¶

Returns:	Parameter table keys.

__contains__(self, item)¶

Returns:	True if parameter in parameters.

update(self, other:dict)¶

Update self.

Update self with the key/value pairs from other, overwriting existing keys. Notice that this does not add new keys to self.

This method is usually used by models to obtain useful information from a preprocessor’s context.

Parameters:	other – The dictionary used update.

Example

>>> import matchzoo as mz
>>> model = mz.models.DenseBaseline()
>>> prpr = model.get_default_preprocessor()
>>> _ = prpr.fit(mz.datasets.toy.load_data(), verbose=0)
>>> model.params.update(prpr.context)

class matchzoo.Embedding(data:dict, output_dim:int)¶

Bases: object

Embedding class.

Examples::

>>> import matchzoo as mz
>>> train_raw = mz.datasets.toy.load_data()
>>> pp = mz.preprocessors.NaivePreprocessor()
>>> train = pp.fit_transform(train_raw, verbose=0)
>>> vocab_unit = mz.build_vocab_unit(train, verbose=0)
>>> term_index = vocab_unit.state['term_index']
>>> embed_path = mz.datasets.embeddings.EMBED_RANK

To load from a file:

>>> embedding = mz.embedding.load_from_file(embed_path)
>>> matrix = embedding.build_matrix(term_index)
>>> matrix.shape[0] == len(term_index)
True

To build your own:

>>> data = {'A':[0, 1], 'B':[2, 3]}
>>> embedding = mz.Embedding(data, 2)
>>> matrix = embedding.build_matrix({'A': 2, 'B': 1, '_PAD': 0})
>>> matrix.shape == (3, 2)
True

build_matrix(self, term_index:typing.Union[dict, mz.preprocessors.units.Vocabulary.TermIndex], initializer=lambda: np.random.uniform(-0.2, 0.2))¶

Build a matrix using term_index.

Parameters:	term_index – A dict or TermIndex to build with. initializer – A callable that returns a default value for missing terms in data. (default: a random uniform distribution in range) (-0.2, 0.2)).
Returns:	A matrix.

matchzoo.build_unit_from_data_pack(unit:StatefulUnit, data_pack:mz.DataPack, mode:str='both', flatten:bool=True, verbose:int=1) → StatefulUnit¶

Build a StatefulUnit from a DataPack object.

Parameters:

unit – StatefulUnit object to be built.
data_pack – The input DataPack object.
mode – One of ‘left’, ‘right’, and ‘both’, to determine the source data for building the VocabularyUnit.
flatten – Flatten the datapack or not. True to organize the DataPack text as a list, and False to organize DataPack text as a list of list.
verbose – Verbosity.

Returns:

A built StatefulUnit object.

matchzoo.build_vocab_unit(data_pack:DataPack, mode:str='both', verbose:int=1) → Vocabulary¶

Build a preprocessor.units.Vocabulary given data_pack.

The data_pack should be preprocessed forehand, and each item in text_left and text_right columns of the data_pack should be a list of tokens.

Parameters:	data_pack – The `DataPack` to build vocabulary upon. mode – One of ‘left’, ‘right’, and ‘both’, to determine the source

data for building the VocabularyUnit. :param verbose: Verbosity. :return: A built vocabulary unit.

matchzoo¶

Subpackages¶

Submodules¶

Package Contents¶

`matchzoo`¶