matchzoo
¶
Subpackages¶
matchzoo.auto
matchzoo.data_pack
matchzoo.dataloader
matchzoo.datasets
matchzoo.embedding
matchzoo.engine
matchzoo.losses
matchzoo.metrics
matchzoo.models
matchzoo.modules
matchzoo.preprocessors
matchzoo.preprocessors.units
matchzoo.preprocessors.units.character_index
matchzoo.preprocessors.units.digit_removal
matchzoo.preprocessors.units.frequency_filter
matchzoo.preprocessors.units.lemmatization
matchzoo.preprocessors.units.lowercase
matchzoo.preprocessors.units.matching_histogram
matchzoo.preprocessors.units.ngram_letter
matchzoo.preprocessors.units.punc_removal
matchzoo.preprocessors.units.stateful_unit
matchzoo.preprocessors.units.stemming
matchzoo.preprocessors.units.stop_removal
matchzoo.preprocessors.units.tokenize
matchzoo.preprocessors.units.truncated_length
matchzoo.preprocessors.units.unit
matchzoo.preprocessors.units.vocabulary
matchzoo.preprocessors.units.word_exact_match
matchzoo.preprocessors.units.word_hashing
matchzoo.preprocessors.basic_preprocessor
matchzoo.preprocessors.bert_preprocessor
matchzoo.preprocessors.build_unit_from_data_pack
matchzoo.preprocessors.build_vocab_unit
matchzoo.preprocessors.cdssm_preprocessor
matchzoo.preprocessors.chain_transform
matchzoo.preprocessors.diin_preprocessor
matchzoo.preprocessors.dssm_preprocessor
matchzoo.preprocessors.naive_preprocessor
matchzoo.tasks
matchzoo.trainers
matchzoo.utils
Submodules¶
Package Contents¶
-
matchzoo.
USER_DIR
¶
-
matchzoo.
USER_DATA_DIR
¶
-
matchzoo.
USER_TUNED_MODELS_DIR
¶
-
matchzoo.
__version__
= 0.0.1¶
-
class
matchzoo.
DataPack
(relation:pd.DataFrame, left:pd.DataFrame, right:pd.DataFrame)¶ Bases:
object
Matchzoo
DataPack
data structure, store dataframe and context.DataPack is a MatchZoo native data structure that most MatchZoo data handling processes build upon. A DataPack consists of three parts: left, right and relation, each one of is a pandas.DataFrame.
Parameters: - relation – Store the relation between left document and right document use ids.
- left – Store the content or features for id_left.
- right – Store the content or features for id_right.
Example
>>> left = [ ... ['qid1', 'query 1'], ... ['qid2', 'query 2'] ... ] >>> right = [ ... ['did1', 'document 1'], ... ['did2', 'document 2'] ... ] >>> relation = [['qid1', 'did1', 1], ['qid2', 'did2', 1]] >>> relation_df = pd.DataFrame(relation) >>> left = pd.DataFrame(left) >>> right = pd.DataFrame(right) >>> dp = DataPack( ... relation=relation_df, ... left=left, ... right=right, ... ) >>> len(dp) 2
-
class
FrameView
(data_pack:'DataPack')¶ Bases:
object
FrameView.
-
__getitem__
(self, index:typing.Union[int, slice, np.array])¶ Slicer.
-
__call__
(self)¶ Returns: A full copy. Equivalant to frame[:].
-
-
DATA_FILENAME
= data.dill¶
-
has_label
¶ True if label column exists, False other wise.
Type: return
-
frame
¶ View the data pack as a
pandas.DataFrame
.Returned data frame is created by merging the left data frame, the right dataframe and the relation data frame. Use [] to access an item or a slice of items.
Returns: A matchzoo.DataPack.FrameView
instance.Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> type(data_pack.frame) <class 'matchzoo.data_pack.data_pack.DataPack.FrameView'> >>> frame_slice = data_pack.frame[0:5] >>> type(frame_slice) <class 'pandas.core.frame.DataFrame'> >>> list(frame_slice.columns) ['id_left', 'text_left', 'id_right', 'text_right', 'label'] >>> full_frame = data_pack.frame() >>> len(full_frame) == len(data_pack) True
-
relation
¶ relation getter.
-
__len__
(self)¶ Get numer of rows in the class:DataPack object.
-
unpack
(self)¶ Unpack the data for training.
The return value can be directly feed to model.fit or model.fit_generator.
Returns: A tuple of (X, y). y is None if self has no label. Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> X, y = data_pack.unpack() >>> type(X) <class 'dict'> >>> sorted(X.keys()) ['id_left', 'id_right', 'text_left', 'text_right'] >>> type(y) <class 'numpy.ndarray'> >>> X, y = data_pack.drop_label().unpack() >>> type(y) <class 'NoneType'>
-
__getitem__
(self, index:typing.Union[int, slice, np.array])¶ Get specific item(s) as a new
DataPack
.The returned
DataPack
will be a copy of the subset of the originalDataPack
.Parameters: index – Index of the item(s) to get. Returns: An instance of DataPack
.
-
copy
(self)¶ Returns: A deep copy.
-
save
(self, dirpath:typing.Union[str, Path])¶ Save the
DataPack
object.A saved
DataPack
is represented as a directory with aDataPack
object (transformed user input as features and context), it will be saved by pickle.Parameters: dirpath – directory path of the saved DataPack
.
-
_optional_inplace
(func)¶ Decorator that adds inplace key word argument to a method.
Decorate any method that modifies inplace to make that inplace change optional.
-
shuffle
(self)¶ Shuffle the data pack by shuffling the relation column.
Parameters: inplace – True to modify inplace, False to return a modified copy. (default: False) Example
>>> import matchzoo as mz >>> import numpy.random >>> numpy.random.seed(0) >>> data_pack = mz.datasets.toy.load_data() >>> orig_ids = data_pack.relation['id_left'] >>> shuffled = data_pack.shuffle() >>> (shuffled.relation['id_left'] != orig_ids).any() True
-
drop_label
(self)¶ Remove label column from the data pack.
Parameters: inplace – True to modify inplace, False to return a modified copy. (default: False) Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> data_pack.has_label True >>> data_pack.drop_label(inplace=True) >>> data_pack.has_label False
-
append_text_length
(self, verbose=1)¶ Append length_left and length_right columns.
Parameters: - inplace – True to modify inplace, False to return a modified copy. (default: False)
- verbose – Verbosity.
Example
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> 'length_left' in data_pack.frame[0].columns False >>> new_data_pack = data_pack.append_text_length(verbose=0) >>> 'length_left' in new_data_pack.frame[0].columns True >>> 'length_left' in data_pack.frame[0].columns False >>> data_pack.append_text_length(inplace=True, verbose=0) >>> 'length_left' in data_pack.frame[0].columns True
-
apply_on_text
(self, func:typing.Callable, mode:str='both', rename:typing.Optional[str]=None, verbose:int=1)¶ Apply func to text columns based on mode.
Parameters: - func – The function to apply.
- mode – One of “both”, “left” and “right”.
- rename – If set, use new names for results instead of replacing the original columns. To set rename in “both” mode, use a tuple of str, e.g. (“text_left_new_name”, “text_right_new_name”).
- inplace – True to modify inplace, False to return a modified copy. (default: False)
- verbose – Verbosity.
- Examples::
>>> import matchzoo as mz >>> data_pack = mz.datasets.toy.load_data() >>> frame = data_pack.frame
- To apply len on the left text and add the result as ‘length_left’:
>>> data_pack.apply_on_text(len, mode='left', ... rename='length_left', ... inplace=True, ... verbose=0) >>> list(frame[0].columns) # noqa: E501 ['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'label']
- To do the same to the right text:
>>> data_pack.apply_on_text(len, mode='right', ... rename='length_right', ... inplace=True, ... verbose=0) >>> list(frame[0].columns) # noqa: E501 ['id_left', 'text_left', 'length_left', 'id_right', 'text_right', 'length_right', 'label']
- To do the same to the both texts at the same time:
>>> data_pack.apply_on_text(len, mode='both', ... rename=('extra_left', 'extra_right'), ... inplace=True, ... verbose=0) >>> list(frame[0].columns) # noqa: E501 ['id_left', 'text_left', 'length_left', 'extra_left', 'id_right', 'text_right', 'length_right', 'extra_right', 'label']
- To suppress outputs:
>>> data_pack.apply_on_text(len, mode='both', verbose=0, ... inplace=True)
-
_apply_on_text_right
(self, func, rename, verbose=1)¶
-
_apply_on_text_left
(self, func, rename, verbose=1)¶
-
_apply_on_text_both
(self, func, rename, verbose=1)¶
-
matchzoo.
load_data_pack
(dirpath:typing.Union[str, Path]) → DataPack¶ Load a
DataPack
. The reverse function ofsave()
.Parameters: dirpath – directory path of the saved model. Returns: a DataPack
instance.
-
matchzoo.
chain_transform
(units:typing.List[Unit]) → typing.Callable¶ Compose unit transformations into a single function.
Parameters: units – List of matchzoo.StatelessUnit
.
-
matchzoo.
load_preprocessor
(dirpath:typing.Union[str, Path]) → 'mz.DataPack'¶ Load the fitted context. The reverse function of
save()
.Parameters: dirpath – directory path of the saved model. Returns: a DSSMPreprocessor
instance.
-
class
matchzoo.
Param
(name:str, value:typing.Any=None, hyper_space:typing.Optional[SpaceType]=None, validator:typing.Optional[typing.Callable[[typing.Any], bool]]=None, desc:typing.Optional[str]=None)¶ Bases:
object
Parameter class.
Basic usages with a name and value:
>>> param = Param('my_param', 10) >>> param.name 'my_param' >>> param.value 10
Use with a validator to make sure the parameter always keeps a valid value.
>>> param = Param( ... name='my_param', ... value=5, ... validator=lambda x: 0 < x < 20 ... ) >>> param.validator # doctest: +ELLIPSIS <function <lambda> at 0x...> >>> param.value 5 >>> param.value = 10 >>> param.value 10 >>> param.value = -1 Traceback (most recent call last): ... ValueError: Validator not satifised. The validator's definition is as follows: validator=lambda x: 0 < x < 20
Use with a hyper space. Setting up a hyper space for a parameter makes the parameter tunable in a
matchzoo.engine.Tuner
.>>> from matchzoo.engine.hyper_spaces import quniform >>> param = Param( ... name='positive_num', ... value=1, ... hyper_space=quniform(low=1, high=5) ... ) >>> param.hyper_space # doctest: +ELLIPSIS <matchzoo.engine.hyper_spaces.quniform object at ...> >>> from hyperopt.pyll.stochastic import sample >>> hyperopt_space = param.hyper_space.convert(param.name) >>> samples = [sample(hyperopt_space) for _ in range(64)] >>> set(samples) == {1, 2, 3, 4, 5} True
The boolean value of a
Param
instance is only True when the value is not None. This is because some default falsy values like zero or an empty list are valid parameter values. In other words, the boolean value means to be “if the parameter value is filled”.>>> param = Param('dropout') >>> if param: ... print('OK') >>> param = Param('dropout', 0) >>> if param: ... print('OK') OK
A _pre_assignment_hook is initialized as a data type convertor if the value is set as a number to keep data type consistency of the parameter. This conversion supports python built-in numbers, numpy numbers, and any number that inherits
numbers.Number
.>>> param = Param('float_param', 0.5) >>> param.value = 10 >>> param.value 10.0 >>> type(param.value) <class 'float'>
-
name
¶ Name of the parameter.
Type: return
-
value
¶ Value of the parameter.
Type: return
-
hyper_space
¶ Hyper space of the parameter.
Type: return
-
validator
¶ Validator of the parameter.
Type: return
-
desc
¶ Parameter description.
Type: return
-
_infer_pre_assignment_hook
(self)¶
-
_validate
(self, value)¶
-
__bool__
(self)¶ Returns: False when the value is None, True otherwise.
-
set_default
(self, val, verbose=1)¶ Set default value, has no effect if already has a value.
Parameters: - val – Default value to set.
- verbose – Verbosity.
-
reset
(self)¶ Set the parameter’s value to None, which means “not set”.
This method bypasses validator.
Example
>>> import matchzoo as mz >>> param = mz.Param( ... name='str', validator=lambda x: isinstance(x, str)) >>> param.value = 'hello' >>> param.value = None Traceback (most recent call last): ... ValueError: Validator not satifised. The validator's definition is as follows: name='str', validator=lambda x: isinstance(x, str)) >>> param.reset() >>> param.value is None True
-
-
class
matchzoo.
ParamTable
¶ Bases:
object
Parameter table class.
Example
>>> params = ParamTable() >>> params.add(Param('ham', 'Parma Ham')) >>> params.add(Param('egg', 'Over Easy')) >>> params['ham'] 'Parma Ham' >>> params['egg'] 'Over Easy' >>> print(params) ham Parma Ham egg Over Easy >>> params.add(Param('egg', 'Sunny side Up')) Traceback (most recent call last): ... ValueError: Parameter named egg already exists. To re-assign parameter egg value, use `params["egg"] = value` instead.
-
hyper_space
¶ Hyper space of the table, a valid hyperopt graph.
Type: return
-
add
(self, param:Param)¶ Parameters: param – parameter to add.
-
get
(self, key)¶ Returns: The parameter in the table named key.
-
set
(self, key, param:Param)¶ Set key to parameter param.
-
to_frame
(self)¶ Convert the parameter table into a pandas data frame.
Returns: A pandas.DataFrame. Example
>>> import matchzoo as mz >>> table = mz.ParamTable() >>> table.add(mz.Param(name='x', value=10, desc='my x')) >>> table.add(mz.Param(name='y', value=20, desc='my y')) >>> table.to_frame() Name Description Value Hyper-Space 0 x my x 10 None 1 y my y 20 None
-
__getitem__
(self, key:str)¶ Returns: The value of the parameter in the table named key.
-
__setitem__
(self, key:str, value:typing.Any)¶ Set the value of the parameter named key.
Parameters: - key – Name of the parameter.
- value – New value of the parameter to set.
-
__str__
(self)¶ Returns: Pretty formatted parameter table.
-
__iter__
(self)¶ Returns: A iterator that iterates over all parameter instances.
-
completed
(self)¶ Returns: True if all params are filled, False otherwise. Example
>>> import matchzoo >>> model = matchzoo.models.DenseBaseline() >>> model.params.completed() False
-
keys
(self)¶ Returns: Parameter table keys.
-
__contains__
(self, item)¶ Returns: True if parameter in parameters.
-
update
(self, other:dict)¶ Update self.
Update self with the key/value pairs from other, overwriting existing keys. Notice that this does not add new keys to self.
This method is usually used by models to obtain useful information from a preprocessor’s context.
Parameters: other – The dictionary used update. Example
>>> import matchzoo as mz >>> model = mz.models.DenseBaseline() >>> prpr = model.get_default_preprocessor() >>> _ = prpr.fit(mz.datasets.toy.load_data(), verbose=0) >>> model.params.update(prpr.context)
-
-
class
matchzoo.
Embedding
(data:dict, output_dim:int)¶ Bases:
object
Embedding class.
- Examples::
>>> import matchzoo as mz >>> train_raw = mz.datasets.toy.load_data() >>> pp = mz.preprocessors.NaivePreprocessor() >>> train = pp.fit_transform(train_raw, verbose=0) >>> vocab_unit = mz.build_vocab_unit(train, verbose=0) >>> term_index = vocab_unit.state['term_index'] >>> embed_path = mz.datasets.embeddings.EMBED_RANK
- To load from a file:
>>> embedding = mz.embedding.load_from_file(embed_path) >>> matrix = embedding.build_matrix(term_index) >>> matrix.shape[0] == len(term_index) True
- To build your own:
>>> data = {'A':[0, 1], 'B':[2, 3]} >>> embedding = mz.Embedding(data, 2) >>> matrix = embedding.build_matrix({'A': 2, 'B': 1, '_PAD': 0}) >>> matrix.shape == (3, 2) True
-
build_matrix
(self, term_index:typing.Union[dict, mz.preprocessors.units.Vocabulary.TermIndex], initializer=lambda: np.random.uniform(-0.2, 0.2))¶ Build a matrix using term_index.
Parameters: - term_index – A dict or TermIndex to build with.
- initializer – A callable that returns a default value for missing terms in data. (default: a random uniform distribution in range) (-0.2, 0.2)).
Returns: A matrix.
-
matchzoo.
build_unit_from_data_pack
(unit:StatefulUnit, data_pack:mz.DataPack, mode:str='both', flatten:bool=True, verbose:int=1) → StatefulUnit¶ Build a
StatefulUnit
from aDataPack
object.Parameters: - unit –
StatefulUnit
object to be built. - data_pack – The input
DataPack
object. - mode – One of ‘left’, ‘right’, and ‘both’, to determine the source
data for building the
VocabularyUnit
. - flatten – Flatten the datapack or not. True to organize the
DataPack
text as a list, and False to organizeDataPack
text as a list of list. - verbose – Verbosity.
Returns: A built
StatefulUnit
object.- unit –
-
matchzoo.
build_vocab_unit
(data_pack:DataPack, mode:str='both', verbose:int=1) → Vocabulary¶ Build a
preprocessor.units.Vocabulary
given data_pack.The data_pack should be preprocessed forehand, and each item in text_left and text_right columns of the data_pack should be a list of tokens.
Parameters: - data_pack – The
DataPack
to build vocabulary upon. - mode – One of ‘left’, ‘right’, and ‘both’, to determine the source
data for building the
VocabularyUnit
. :param verbose: Verbosity. :return: A built vocabulary unit.- data_pack – The