matchzoo.preprocessors.units.frequency_filter

Module Contents

class matchzoo.preprocessors.units.frequency_filter.FrequencyFilter(low:float=0, high:float=float('inf'), mode:str='df')

Bases: matchzoo.preprocessors.units.stateful_unit.StatefulUnit

Frequency filter unit.

Parameters:
  • low – Lower bound, inclusive.
  • high – Upper bound, exclusive.
  • mode – One of tf (term frequency), df (document frequency), and idf (inverse document frequency).
Examples::
>>> import matchzoo as mz
To filter based on term frequency (tf):
>>> tf_filter = mz.preprocessors.units.FrequencyFilter(
...     low=2, mode='tf')
>>> tf_filter.fit([['A', 'B', 'B'], ['C', 'C', 'C']])
>>> tf_filter.transform(['A', 'B', 'C'])
['B', 'C']
To filter based on document frequency (df):
>>> tf_filter = mz.preprocessors.units.FrequencyFilter(
...     low=2, mode='df')
>>> tf_filter.fit([['A', 'B'], ['B', 'C']])
>>> tf_filter.transform(['A', 'B', 'C'])
['B']
To filter based on inverse document frequency (idf):
>>> idf_filter = mz.preprocessors.units.FrequencyFilter(
...     low=1.2, mode='idf')
>>> idf_filter.fit([['A', 'B'], ['B', 'C', 'D']])
>>> idf_filter.transform(['A', 'B', 'C'])
['A', 'C']
fit(self, list_of_tokens:typing.List[typing.List[str]])

Fit list_of_tokens by calculating mode states.

transform(self, input_:list)

Transform a list of tokens by filtering out unwanted words.

classmethod _tf(cls, list_of_tokens:list)
classmethod _df(cls, list_of_tokens:list)
classmethod _idf(cls, list_of_tokens:list)