matchzoo¶
Subpackages¶
matchzoo.automatchzoo.data_packmatchzoo.dataloadermatchzoo.datasetsmatchzoo.embeddingmatchzoo.enginematchzoo.lossesmatchzoo.metricsmatchzoo.modelsmatchzoo.models.anmmmatchzoo.models.arcimatchzoo.models.arciimatchzoo.models.bertmatchzoo.models.bimpmmatchzoo.models.cdssmmatchzoo.models.conv_knrmmatchzoo.models.dense_baselinematchzoo.models.diinmatchzoo.models.drmmmatchzoo.models.drmmtksmatchzoo.models.dssmmatchzoo.models.duetmatchzoo.models.esimmatchzoo.models.hbmpmatchzoo.models.knrmmatchzoo.models.match_pyramidmatchzoo.models.match_srnnmatchzoo.models.matchlstmmatchzoo.models.mvlstmmatchzoo.models.parameter_readme_generator
matchzoo.modulesmatchzoo.modules.attentionmatchzoo.modules.bert_modulematchzoo.modules.character_embeddingmatchzoo.modules.dense_netmatchzoo.modules.dropoutmatchzoo.modules.gaussian_kernelmatchzoo.modules.matchingmatchzoo.modules.matching_tensormatchzoo.modules.semantic_compositematchzoo.modules.spatial_grumatchzoo.modules.stacked_brnn
matchzoo.preprocessorsmatchzoo.preprocessors.unitsmatchzoo.preprocessors.units.character_indexmatchzoo.preprocessors.units.digit_removalmatchzoo.preprocessors.units.frequency_filtermatchzoo.preprocessors.units.lemmatizationmatchzoo.preprocessors.units.lowercasematchzoo.preprocessors.units.matching_histogrammatchzoo.preprocessors.units.ngram_lettermatchzoo.preprocessors.units.punc_removalmatchzoo.preprocessors.units.stateful_unitmatchzoo.preprocessors.units.stemmingmatchzoo.preprocessors.units.stop_removalmatchzoo.preprocessors.units.tokenizematchzoo.preprocessors.units.truncated_lengthmatchzoo.preprocessors.units.unitmatchzoo.preprocessors.units.vocabularymatchzoo.preprocessors.units.word_exact_matchmatchzoo.preprocessors.units.word_hashing
matchzoo.preprocessors.basic_preprocessormatchzoo.preprocessors.bert_preprocessormatchzoo.preprocessors.build_unit_from_data_packmatchzoo.preprocessors.build_vocab_unitmatchzoo.preprocessors.chain_transformmatchzoo.preprocessors.naive_preprocessor
matchzoo.tasksmatchzoo.trainersmatchzoo.utils
Submodules¶
Package Contents¶
Classes¶
Matchzoo |
|
Parameter class. |
|
Parameter table class. |
|
Embedding class. |
Functions¶
|
Load a |
|
Compose unit transformations into a single function. |
|
Load the fitted context. The reverse function of |
|
Build a |
|
Build a |
-
matchzoo.USER_DIR¶
-
matchzoo.USER_DATA_DIR¶
-
matchzoo.USER_TUNED_MODELS_DIR¶
-
matchzoo.__version__= 1.1.1¶
-
class
matchzoo.DataPack(relation: pd.DataFrame, left: pd.DataFrame, right: pd.DataFrame)¶ Bases:
objectMatchzoo
DataPackdata structure, store dataframe and context.DataPack is a MatchZoo native data structure that most MatchZoo data handling processes build upon. A DataPack consists of three parts: left, right and relation, each one of is a pandas.DataFrame.
- Parameters
relation – Store the relation between left document and right document use ids.
left – Store the content or features for id_left.
right – Store the content or features for id_right.
Example
>>> left = [ ... ['qid1', 'query 1'], ... ['qid2', 'query 2'] ... ] >>> right = [ ... ['did1', 'document 1'], ... ['did2', 'document 2'] ... ] >>> relation = [['qid1', 'did1', 1], ['qid2', 'did2', 1]] >>> relation_df = pd.DataFrame(relation) >>> left = pd.DataFrame(left) >>> right = pd.DataFrame(right) >>> dp = DataPack( ... relation=relation_df, ... left=left, ... right=right, ... ) >>> len(dp) 2
-
class
FrameView(data_pack: DataPack)¶ Bases:
objectFrameView.
-
__getitem__(self, index: typing.Union[int, slice, np.array]) → pd.DataFrame¶ Slicer.
-
__call__(self)¶ - Returns
A full copy. Equivalant to frame[:].
-
-
DATA_FILENAME= data.dill¶
-
property
has_label(self) → bool¶ - Returns
True if label column exists, False other wise.
-
__len__(self) → int¶ Get numer of rows in the class:DataPack object.
-
property
frame(self) → ’DataPack.FrameView’¶ View the data pack as a
pandas.DataFrame.Returned data frame is created by merging the left data frame, the right dataframe and the relation data frame. Use [] to access an item or a slice of items.
- Returns
A
matchzoo.DataPack.FrameViewinstance.
Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> type(data_pack.frame) <class 'matchzoo.data_pack.data_pack.DataPack.FrameView'> >>> frame_slice = data_pack.frame[0:5] >>> type(frame_slice) <class 'pandas.core.frame.DataFrame'> >>> list(frame_slice.columns) ['id_left', 'text_left', 'id_right', 'text_right', 'label'] >>> full_frame = data_pack.frame() >>> len(full_frame) == len(data_pack) True
-
unpack(self) → typing.Tuple[typing.Dict[str, np.array], typing.Optional[np.array]]¶ Unpack the data for training.
The return value can be directly feed to model.fit or model.fit_generator.
- Returns
A tuple of (X, y). y is None if self has no label.
Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> X, y = data_pack.unpack() >>> type(X) <class 'dict'> >>> sorted(X.keys()) ['id_left', 'id_right', 'text_left', 'text_right'] >>> type(y) <class 'numpy.ndarray'> >>> X, y = data_pack.drop_label().unpack() >>> type(y) <class 'NoneType'>
-
__getitem__(self, index: typing.Union[int, slice, np.array]) → ’DataPack’¶ Get specific item(s) as a new
DataPack.The returned
DataPackwill be a copy of the subset of the originalDataPack.- Parameters
index – Index of the item(s) to get.
- Returns
An instance of
DataPack.
-
property
relation(self)¶ relation getter.
-
copy(self) → ’DataPack’¶ - Returns
A deep copy.
-
save(self, dirpath: typing.Union[str, Path])¶ Save the
DataPackobject.A saved
DataPackis represented as a directory with aDataPackobject (transformed user input as features and context), it will be saved by pickle.- Parameters
dirpath – directory path of the saved
DataPack.
-
_optional_inplace(func)¶ Decorator that adds inplace key word argument to a method.
Decorate any method that modifies inplace to make that inplace change optional.
-
drop_empty(self)¶ Process empty data by removing corresponding rows.
- Parameters
inplace – True to modify inplace, False to return a modified copy. (default: False)
-
shuffle(self)¶ Shuffle the data pack by shuffling the relation column.
- Parameters
inplace – True to modify inplace, False to return a modified copy. (default: False)
Example
>>> import matchzoo as mz >>> import numpy.random >>> numpy.random.seed(0) >>> data_pack = mz.datasets.toy.load_data() >>> orig_ids = data_pack.relation['id_left'] >>> shuffled = data_pack.shuffle() >>> (shuffled.relation['id_left'] != orig_ids).any() True
-
drop_label(self)¶ Remove label column from the data pack.
- Parameters
inplace – True to modify inplace, False to return a modified copy. (default: False)
Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> data_pack.has_label True >>> data_pack.drop_label(inplace=True) >>> data_pack.has_label False
-
append_text_length(self, verbose=1)¶ Append length_left and length_right columns.
- Parameters
inplace – True to modify inplace, False to return a modified copy. (default: False)
verbose – Verbosity.
Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> 'length_left' in data_pack.frame[0].columns False >>> new_data_pack = data_pack.append_text_length(verbose=0) >>> 'length_left' in new_data_pack.frame[0].columns True >>> 'length_left' in data_pack.frame[0].columns False >>> data_pack.append_text_length(inplace=True, verbose=0) >>> 'length_left' in data_pack.frame[0].columns True
-
apply_on_text(self, func: typing.Callable, mode: str = 'both', rename: typing.Optional[str] = None, verbose: int = 1)¶ Apply func to text columns based on mode.
- Parameters
func – The function to apply.
mode – One of “both”, “left” and “right”.
rename – If set, use new names for results instead of replacing the original columns. To set rename in “both” mode, use a tuple of str, e.g. (“text_left_new_name”, “text_right_new_name”).
inplace – True to modify inplace, False to return a modified copy. (default: False)
verbose – Verbosity.
- Examples::
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> frame = data_pack.frame
- To apply len on the left text and add the result as ‘length_left’:
>>> data_pack.apply_on_text(len, mode='left', ... rename='length_left', ... inplace=True, ... verbose=0) >>> list(frame[0].columns) # noqa: E501 ['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'label']
- To do the same to the right text:
>>> data_pack.apply_on_text(len, mode='right', ... rename='length_right', ... inplace=True, ... verbose=0) >>> list(frame[0].columns) # noqa: E501 ['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'length_right', 'label']
- To do the same to the both texts at the same time:
>>> data_pack.apply_on_text(len, mode='both', ... rename=('extra_left', 'extra_right'), ... inplace=True, ... verbose=0) >>> list(frame[0].columns) # noqa: E501 ['id_left', 'text_left', 'length_left', 'extra_left', 'id_right', 'text_right', 'length_right', 'extra_right', 'label']
- To suppress outputs:
>>> data_pack.apply_on_text(len, mode='both', verbose=0, ... inplace=True)
-
_apply_on_text_right(self, func, rename, verbose=1)¶
-
_apply_on_text_left(self, func, rename, verbose=1)¶
-
_apply_on_text_both(self, func, rename, verbose=1)¶
-
matchzoo.load_data_pack(dirpath: typing.Union[str, Path]) → DataPack¶ Load a
DataPack. The reverse function ofsave().- Parameters
dirpath – directory path of the saved model.
- Returns
a
DataPackinstance.
-
matchzoo.chain_transform(units: typing.List[Unit]) → typing.Callable¶ Compose unit transformations into a single function.
- Parameters
units – List of
matchzoo.StatelessUnit.
-
matchzoo.load_preprocessor(dirpath: typing.Union[str, Path]) → ’mz.DataPack’¶ Load the fitted context. The reverse function of
save().- Parameters
dirpath – directory path of the saved model.
- Returns
a
DSSMPreprocessorinstance.
-
class
matchzoo.Param(name: str, value: typing.Any = None, hyper_space: typing.Optional[SpaceType] = None, validator: typing.Optional[typing.Callable[[typing.Any], bool]] = None, desc: typing.Optional[str] = None)¶ Bases:
objectParameter class.
Basic usages with a name and value:
>>> param = Param('my_param', 10) >>> param.name 'my_param' >>> param.value 10
Use with a validator to make sure the parameter always keeps a valid value.
>>> param = Param( ... name='my_param', ... value=5, ... validator=lambda x: 0 < x < 20 ... ) >>> param.validator <function <lambda> at 0x...> >>> param.value 5 >>> param.value = 10 >>> param.value 10 >>> param.value = -1 Traceback (most recent call last): ... ValueError: Validator not satifised. The validator's definition is as follows: validator=lambda x: 0 < x < 20
Use with a hyper space. Setting up a hyper space for a parameter makes the parameter tunable in a
matchzoo.engine.Tuner.>>> from matchzoo.engine.hyper_spaces import quniform >>> param = Param( ... name='positive_num', ... value=1, ... hyper_space=quniform(low=1, high=5) ... ) >>> param.hyper_space <matchzoo.engine.hyper_spaces.quniform object at ...> >>> from hyperopt.pyll.stochastic import sample >>> hyperopt_space = param.hyper_space.convert(param.name) >>> samples = [sample(hyperopt_space) for _ in range(64)] >>> set(samples) == {1, 2, 3, 4, 5} True
The boolean value of a
Paraminstance is only True when the value is not None. This is because some default falsy values like zero or an empty list are valid parameter values. In other words, the boolean value means to be “if the parameter value is filled”.>>> param = Param('dropout') >>> if param: ... print('OK') >>> param = Param('dropout', 0) >>> if param: ... print('OK') OK
A _pre_assignment_hook is initialized as a data type convertor if the value is set as a number to keep data type consistency of the parameter. This conversion supports python built-in numbers, numpy numbers, and any number that inherits
numbers.Number.>>> param = Param('float_param', 0.5) >>> param.value = 10 >>> param.value 10.0 >>> type(param.value) <class 'float'>
-
property
name(self) → str¶ - Returns
Name of the parameter.
-
property
value(self) → typing.Any¶ - Returns
Value of the parameter.
-
property
hyper_space(self) → SpaceType¶ - Returns
Hyper space of the parameter.
-
property
validator(self) → typing.Callable[[typing.Any], bool]¶ - Returns
Validator of the parameter.
-
property
desc(self) → str¶ - Returns
Parameter description.
-
_infer_pre_assignment_hook(self)¶
-
_validate(self, value)¶
-
__bool__(self)¶ - Returns
False when the value is None, True otherwise.
-
set_default(self, val, verbose=1)¶ Set default value, has no effect if already has a value.
- Parameters
val – Default value to set.
verbose – Verbosity.
-
reset(self)¶ Set the parameter’s value to None, which means “not set”.
This method bypasses validator.
Example
>>> import matchzoo as mz >>> param = mz.Param( ... name='str', validator=lambda x: isinstance(x, str)) >>> param.value = 'hello' >>> param.value = None Traceback (most recent call last): ... ValueError: Validator not satifised. The validator's definition is as follows: name='str', validator=lambda x: isinstance(x, str)) >>> param.reset() >>> param.value is None True
-
property
-
class
matchzoo.ParamTable¶ Bases:
objectParameter table class.
Example
>>> params = ParamTable() >>> params.add(Param('ham', 'Parma Ham')) >>> params.add(Param('egg', 'Over Easy')) >>> params['ham'] 'Parma Ham' >>> params['egg'] 'Over Easy' >>> print(params) ham Parma Ham egg Over Easy >>> params.add(Param('egg', 'Sunny side Up')) Traceback (most recent call last): ... ValueError: Parameter named egg already exists. To re-assign parameter egg value, use `params["egg"] = value` instead.
-
property
hyper_space(self) → dict¶ - Returns
Hyper space of the table, a valid hyperopt graph.
-
to_frame(self) → pd.DataFrame¶ Convert the parameter table into a pandas data frame.
- Returns
A pandas.DataFrame.
Example
>>> import matchzoo as mz >>> table = mz.ParamTable() >>> table.add(mz.Param(name='x', value=10, desc='my x')) >>> table.add(mz.Param(name='y', value=20, desc='my y')) >>> table.to_frame() Name Description Value Hyper-Space 0 x my x 10 None 1 y my y 20 None
-
__getitem__(self, key: str) → typing.Any¶ - Returns
The value of the parameter in the table named key.
-
__setitem__(self, key: str, value: typing.Any)¶ Set the value of the parameter named key.
- Parameters
key – Name of the parameter.
value – New value of the parameter to set.
-
__str__(self)¶ - Returns
Pretty formatted parameter table.
-
__iter__(self) → typing.Iterator¶ - Returns
A iterator that iterates over all parameter instances.
-
completed(self, exclude: typing.Optional[list] = None) → bool¶ Check if all params are filled.
- Parameters
exclude – List of names of parameters that was excluded from being computed.
- Returns
True if all params are filled, False otherwise.
Example
>>> import matchzoo >>> model = matchzoo.models.DenseBaseline() >>> model.params.completed( ... exclude=['task', 'out_activation_func', 'embedding', ... 'embedding_input_dim', 'embedding_output_dim'] ... ) True
-
keys(self) → collections.abc.KeysView¶ - Returns
Parameter table keys.
-
__contains__(self, item)¶ - Returns
True if parameter in parameters.
-
update(self, other: dict)¶ Update self.
Update self with the key/value pairs from other, overwriting existing keys. Notice that this does not add new keys to self.
This method is usually used by models to obtain useful information from a preprocessor’s context.
- Parameters
other – The dictionary used update.
Example
>>> import matchzoo as mz >>> model = mz.models.DenseBaseline() >>> prpr = model.get_default_preprocessor() >>> _ = prpr.fit(mz.datasets.toy.load_data(), verbose=0) >>> model.params.update(prpr.context)
-
property
-
class
matchzoo.Embedding(data: dict, output_dim: int)¶ Bases:
objectEmbedding class.
- Examples::
>>> import matchzoo as mz >>> train_raw = mz.datasets.toy.load_data() >>> pp = mz.preprocessors.NaivePreprocessor() >>> train = pp.fit_transform(train_raw, verbose=0) >>> vocab_unit = mz.build_vocab_unit(train, verbose=0) >>> term_index = vocab_unit.state['term_index'] >>> embed_path = mz.datasets.embeddings.EMBED_RANK
- To load from a file:
>>> embedding = mz.embedding.load_from_file(embed_path) >>> matrix = embedding.build_matrix(term_index) >>> matrix.shape[0] == len(term_index) True
- To build your own:
>>> data = {'A':[0, 1], 'B':[2, 3]} >>> embedding = mz.Embedding(data, 2) >>> matrix = embedding.build_matrix({'A': 2, 'B': 1, '_PAD': 0}) >>> matrix.shape == (3, 2) True
-
build_matrix(self, term_index: typing.Union[dict, mz.preprocessors.units.Vocabulary.TermIndex]) → np.ndarray¶ Build a matrix using term_index.
- Parameters
term_index – A dict or TermIndex to build with.
initializer – A callable that returns a default value for missing terms in data. (default: a random uniform distribution in range) (-0.2, 0.2)).
- Returns
A matrix.
-
matchzoo.build_unit_from_data_pack(unit: StatefulUnit, data_pack: mz.DataPack, mode: str = 'both', flatten: bool = True, verbose: int = 1) → StatefulUnit¶ Build a
StatefulUnitfrom aDataPackobject.- Parameters
unit –
StatefulUnitobject to be built.data_pack – The input
DataPackobject.mode – One of ‘left’, ‘right’, and ‘both’, to determine the source data for building the
VocabularyUnit.flatten – Flatten the datapack or not. True to organize the
DataPacktext as a list, and False to organizeDataPacktext as a list of list.verbose – Verbosity.
- Returns
A built
StatefulUnitobject.
-
matchzoo.build_vocab_unit(data_pack: DataPack, mode: str = 'both', verbose: int = 1) → Vocabulary¶ Build a
preprocessor.units.Vocabularygiven data_pack.The data_pack should be preprocessed forehand, and each item in text_left and text_right columns of the data_pack should be a list of tokens.
- Parameters
data_pack – The
DataPackto build vocabulary upon.mode – One of ‘left’, ‘right’, and ‘both’, to determine the source
data for building the
VocabularyUnit. :param verbose: Verbosity. :return: A built vocabulary unit.