matchzoo.preprocessors.basic_preprocessor
¶
Basic Preprocessor.
Module Contents¶
Classes¶
Baisc preprocessor helper. |
-
class
matchzoo.preprocessors.basic_preprocessor.
BasicPreprocessor
(truncated_mode: str = 'pre', truncated_length_left: int = None, truncated_length_right: int = None, filter_mode: str = 'df', filter_low_freq: float = 1, filter_high_freq: float = float('inf'), remove_stop_words: bool = False, ngram_size: typing.Optional[int] = None)¶ Bases:
matchzoo.engine.base_preprocessor.BasePreprocessor
Baisc preprocessor helper.
- Parameters
truncated_mode – String, mode used by
TruncatedLength
. Can be ‘pre’ or ‘post’.truncated_length_left – Integer, maximize length of
left
in the data_pack.truncated_length_right – Integer, maximize length of
right
in the data_pack.filter_mode – String, mode used by
FrequenceFilterUnit
. Can be ‘df’, ‘cf’, and ‘idf’.filter_low_freq – Float, lower bound value used by
FrequenceFilterUnit
.filter_high_freq – Float, upper bound value used by
FrequenceFilterUnit
.remove_stop_words – Bool, use
StopRemovalUnit
unit or not.
Example
>>> import matchzoo as mz >>> train_data = mz.datasets.toy.load_data('train') >>> test_data = mz.datasets.toy.load_data('test') >>> preprocessor = mz.preprocessors.BasicPreprocessor( ... truncated_length_left=10, ... truncated_length_right=20, ... filter_mode='df', ... filter_low_freq=2, ... filter_high_freq=1000, ... remove_stop_words=True ... ) >>> preprocessor = preprocessor.fit(train_data, verbose=0) >>> preprocessor.context['vocab_size'] 226 >>> processed_train_data = preprocessor.transform(train_data, ... verbose=0) >>> type(processed_train_data) <class 'matchzoo.data_pack.data_pack.DataPack'> >>> test_data_transformed = preprocessor.transform(test_data, ... verbose=0) >>> type(test_data_transformed) <class 'matchzoo.data_pack.data_pack.DataPack'>
-
fit
(self, data_pack: DataPack, verbose: int = 1)¶ Fit pre-processing context for transformation.
- Parameters
data_pack – data_pack to be preprocessed.
verbose – Verbosity.
- Returns
class:BasicPreprocessor instance.
-
transform
(self, data_pack: DataPack, verbose: int = 1) → DataPack¶ Apply transformation on data, create truncated length representation.
- Parameters
data_pack – Inputs to be preprocessed.
verbose – Verbosity.
- Returns
Transformed data as
DataPack
object.