matchzoo.preprocessors
¶
Subpackages¶
matchzoo.preprocessors.units
matchzoo.preprocessors.units.character_index
matchzoo.preprocessors.units.digit_removal
matchzoo.preprocessors.units.frequency_filter
matchzoo.preprocessors.units.lemmatization
matchzoo.preprocessors.units.lowercase
matchzoo.preprocessors.units.matching_histogram
matchzoo.preprocessors.units.ngram_letter
matchzoo.preprocessors.units.punc_removal
matchzoo.preprocessors.units.stateful_unit
matchzoo.preprocessors.units.stemming
matchzoo.preprocessors.units.stop_removal
matchzoo.preprocessors.units.tokenize
matchzoo.preprocessors.units.truncated_length
matchzoo.preprocessors.units.unit
matchzoo.preprocessors.units.vocabulary
matchzoo.preprocessors.units.word_exact_match
matchzoo.preprocessors.units.word_hashing
Submodules¶
matchzoo.preprocessors.basic_preprocessor
matchzoo.preprocessors.bert_preprocessor
matchzoo.preprocessors.build_unit_from_data_pack
matchzoo.preprocessors.build_vocab_unit
matchzoo.preprocessors.cdssm_preprocessor
matchzoo.preprocessors.chain_transform
matchzoo.preprocessors.diin_preprocessor
matchzoo.preprocessors.dssm_preprocessor
matchzoo.preprocessors.naive_preprocessor
Package Contents¶
-
class
matchzoo.preprocessors.
DSSMPreprocessor
(with_word_hashing:bool=True)¶ Bases:
matchzoo.engine.base_preprocessor.BasePreprocessor
DSSM Model preprocessor.
-
with_word_hashing
¶ with_word_hashing getter.
-
fit
(self, data_pack:DataPack, verbose:int=1)¶ Fit pre-processing context for transformation.
Parameters: - verbose – Verbosity.
- data_pack – data_pack to be preprocessed.
Returns: class:DSSMPreprocessor instance.
-
transform
(self, data_pack:DataPack, verbose:int=1)¶ Apply transformation on data, create tri-letter representation.
Parameters: - data_pack – Inputs to be preprocessed.
- verbose – Verbosity.
Returns: Transformed data as
DataPack
object.
-
classmethod
_default_units
(cls)¶ Prepare needed process units.
-
-
class
matchzoo.preprocessors.
NaivePreprocessor
¶ Bases:
matchzoo.engine.base_preprocessor.BasePreprocessor
Naive preprocessor.
Example
>>> import matchzoo as mz >>> train_data = mz.datasets.toy.load_data() >>> test_data = mz.datasets.toy.load_data(stage='test') >>> preprocessor = mz.preprocessors.NaivePreprocessor() >>> train_data_processed = preprocessor.fit_transform(train_data, ... verbose=0) >>> type(train_data_processed) <class 'matchzoo.data_pack.data_pack.DataPack'> >>> test_data_transformed = preprocessor.transform(test_data, ... verbose=0) >>> type(test_data_transformed) <class 'matchzoo.data_pack.data_pack.DataPack'>
-
fit
(self, data_pack:DataPack, verbose:int=1)¶ Fit pre-processing context for transformation.
Parameters: - data_pack – data_pack to be preprocessed.
- verbose – Verbosity.
Returns: class:NaivePreprocessor instance.
-
transform
(self, data_pack:DataPack, verbose:int=1)¶ Apply transformation on data, create truncated length representation.
Parameters: - data_pack – Inputs to be preprocessed.
- verbose – Verbosity.
Returns: Transformed data as
DataPack
object.
-
-
class
matchzoo.preprocessors.
BasicPreprocessor
(truncated_mode:str='pre', truncated_length_left:int=30, truncated_length_right:int=30, filter_mode:str='df', filter_low_freq:float=1, filter_high_freq:float=float('inf'), remove_stop_words:bool=False)¶ Bases:
matchzoo.engine.base_preprocessor.BasePreprocessor
Baisc preprocessor helper.
Parameters: - truncated_mode – String, mode used by
TruncatedLength
. Can be ‘pre’ or ‘post’. - truncated_length_left – Integer, maximize length of
left
in the data_pack. - truncated_length_right – Integer, maximize length of
right
in the data_pack. - filter_mode – String, mode used by
FrequenceFilterUnit
. Can be ‘df’, ‘cf’, and ‘idf’. - filter_low_freq – Float, lower bound value used by
FrequenceFilterUnit
. - filter_high_freq – Float, upper bound value used by
FrequenceFilterUnit
. - remove_stop_words – Bool, use
StopRemovalUnit
unit or not.
Example
>>> import matchzoo as mz >>> train_data = mz.datasets.toy.load_data('train') >>> test_data = mz.datasets.toy.load_data('test') >>> preprocessor = mz.preprocessors.BasicPreprocessor( ... truncated_length_left=10, ... truncated_length_right=20, ... filter_mode='df', ... filter_low_freq=2, ... filter_high_freq=1000, ... remove_stop_words=True ... ) >>> preprocessor = preprocessor.fit(train_data, verbose=0) >>> preprocessor.context['vocab_size'] 226 >>> processed_train_data = preprocessor.transform(train_data, ... verbose=0) >>> type(processed_train_data) <class 'matchzoo.data_pack.data_pack.DataPack'> >>> test_data_transformed = preprocessor.transform(test_data, ... verbose=0) >>> type(test_data_transformed) <class 'matchzoo.data_pack.data_pack.DataPack'>
-
fit
(self, data_pack:DataPack, verbose:int=1)¶ Fit pre-processing context for transformation.
Parameters: - data_pack – data_pack to be preprocessed.
- verbose – Verbosity.
Returns: class:BasicPreprocessor instance.
-
transform
(self, data_pack:DataPack, verbose:int=1)¶ Apply transformation on data, create truncated length representation.
Parameters: - data_pack – Inputs to be preprocessed.
- verbose – Verbosity.
Returns: Transformed data as
DataPack
object.
- truncated_mode – String, mode used by
-
class
matchzoo.preprocessors.
CDSSMPreprocessor
(truncated_mode:str='pre', truncated_length_left:int=10, truncated_length_right:int=40, with_word_hashing:bool=True)¶ Bases:
matchzoo.engine.base_preprocessor.BasePreprocessor
CDSSM Model preprocessor.
-
with_word_hashing
¶ with_word_hashing getter.
-
fit
(self, data_pack:DataPack, verbose:int=1)¶ Fit pre-processing context for transformation.
Parameters: - verbose – Verbosity.
- data_pack – Data_pack to be preprocessed.
Returns: class:CDSSMPreprocessor instance.
-
transform
(self, data_pack:DataPack, verbose:int=1)¶ Apply transformation on data, create letter-ngram representation.
Parameters: - data_pack – Inputs to be preprocessed.
- verbose – Verbosity.
Returns: Transformed data as
DataPack
object.
-
classmethod
_default_units
(cls)¶ Prepare needed process units.
-
-
class
matchzoo.preprocessors.
DIINPreprocessor
(truncated_mode:str='pre', truncated_length_left:int=30, truncated_length_right:int=50)¶ Bases:
matchzoo.engine.base_preprocessor.BasePreprocessor
DIIN Model preprocessor.
-
fit
(self, data_pack:DataPack, verbose:int=1)¶ Fit pre-processing context for transformation.
Parameters: - data_pack – data_pack to be preprocessed.
- verbose – Verbosity.
Returns: class:’DIINPreprocessor’ instance.
-
transform
(self, data_pack:DataPack, verbose:int=1)¶ Apply transformation on data.
Parameters: - data_pack – Inputs to be preprocessed.
- verbose – Verbosity.
Returns: Transformed data as :class:’DataPack’ object.
-
-
class
matchzoo.preprocessors.
BertPreprocessor
(mode:str='bert-base-uncased')¶ Bases:
matchzoo.engine.base_preprocessor.BasePreprocessor
Baisc preprocessor helper.
Parameters: mode – String, supported mode can be referred https://huggingface.co/pytorch-transformers/pretrained_models.html. -
fit
(self, data_pack:DataPack, verbose:int=1)¶ Tokenizer is all BertPreprocessor’s need.
-
transform
(self, data_pack:DataPack, verbose:int=1)¶ Apply transformation on data.
Parameters: - data_pack – Inputs to be preprocessed.
- verbose – Verbosity.
Returns: Transformed data as
DataPack
object.
-
-
matchzoo.preprocessors.
list_available
() → list¶