matchzoo.preprocessors.basic_preprocessor
¶
Basic Preprocessor.
Module Contents¶
-
class
matchzoo.preprocessors.basic_preprocessor.
BasicPreprocessor
(truncated_mode:str='pre', truncated_length_left:int=30, truncated_length_right:int=30, filter_mode:str='df', filter_low_freq:float=1, filter_high_freq:float=float('inf'), remove_stop_words:bool=False)¶ Bases:
matchzoo.engine.base_preprocessor.BasePreprocessor
Baisc preprocessor helper.
Parameters: - truncated_mode – String, mode used by
TruncatedLength
. Can be ‘pre’ or ‘post’. - truncated_length_left – Integer, maximize length of
left
in the data_pack. - truncated_length_right – Integer, maximize length of
right
in the data_pack. - filter_mode – String, mode used by
FrequenceFilterUnit
. Can be ‘df’, ‘cf’, and ‘idf’. - filter_low_freq – Float, lower bound value used by
FrequenceFilterUnit
. - filter_high_freq – Float, upper bound value used by
FrequenceFilterUnit
. - remove_stop_words – Bool, use
StopRemovalUnit
unit or not.
Example
>>> import matchzoo as mz >>> train_data = mz.datasets.toy.load_data('train') >>> test_data = mz.datasets.toy.load_data('test') >>> preprocessor = mz.preprocessors.BasicPreprocessor( ... truncated_length_left=10, ... truncated_length_right=20, ... filter_mode='df', ... filter_low_freq=2, ... filter_high_freq=1000, ... remove_stop_words=True ... ) >>> preprocessor = preprocessor.fit(train_data, verbose=0) >>> preprocessor.context['vocab_size'] 226 >>> processed_train_data = preprocessor.transform(train_data, ... verbose=0) >>> type(processed_train_data) <class 'matchzoo.data_pack.data_pack.DataPack'> >>> test_data_transformed = preprocessor.transform(test_data, ... verbose=0) >>> type(test_data_transformed) <class 'matchzoo.data_pack.data_pack.DataPack'>
-
fit
(self, data_pack:DataPack, verbose:int=1)¶ Fit pre-processing context for transformation.
Parameters: - data_pack – data_pack to be preprocessed.
- verbose – Verbosity.
Returns: class:BasicPreprocessor instance.
-
transform
(self, data_pack:DataPack, verbose:int=1)¶ Apply transformation on data, create truncated length representation.
Parameters: - data_pack – Inputs to be preprocessed.
- verbose – Verbosity.
Returns: Transformed data as
DataPack
object.
- truncated_mode – String, mode used by