matchzoo.preprocessors.units

Package Contents

Classes

Unit

Process unit do not persive state (i.e. do not need fit).

DigitRemoval

Process unit to remove digits.

FrequencyFilter

Frequency filter unit.

Lemmatization

Process unit for token lemmatization.

Lowercase

Process unit for text lower case.

MatchingHistogram

MatchingHistogramUnit Class.

NgramLetter

Process unit for n-letter generation.

PuncRemoval

Process unit for remove punctuations.

StatefulUnit

Unit with inner state.

Stemming

Process unit for token stemming.

StopRemoval

Process unit to remove stop words.

Tokenize

Process unit for text tokenization.

Vocabulary

Vocabulary class.

WordHashing

Word-hashing layer for DSSM-based models.

CharacterIndex

CharacterIndexUnit for DIIN model.

WordExactMatch

WordExactUnit Class.

TruncatedLength

TruncatedLengthUnit Class.

Functions

list_available() → list

class matchzoo.preprocessors.units.Unit

Process unit do not persive state (i.e. do not need fit).

abstract transform(self, input_: typing.Any)

Abstract base method, need to be implemented in subclass.

class matchzoo.preprocessors.units.DigitRemoval

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit to remove digits.

transform(self, input_: list) → list

Remove digits from list of tokens.

Parameters

input – list of tokens to be filtered.

Return tokens

tokens of tokens without digits.

class matchzoo.preprocessors.units.FrequencyFilter(low: float = 0, high: float = float('inf'), mode: str = 'df')

Bases: matchzoo.preprocessors.units.stateful_unit.StatefulUnit

Frequency filter unit.

Parameters
  • low – Lower bound, inclusive.

  • high – Upper bound, exclusive.

  • mode – One of tf (term frequency), df (document frequency), and idf (inverse document frequency).

Examples::
>>> import matchzoo as mz
To filter based on term frequency (tf):
>>> tf_filter = mz.preprocessors.units.FrequencyFilter(
...     low=2, mode='tf')
>>> tf_filter.fit([['A', 'B', 'B'], ['C', 'C', 'C']])
>>> tf_filter.transform(['A', 'B', 'C'])
['B', 'C']
To filter based on document frequency (df):
>>> tf_filter = mz.preprocessors.units.FrequencyFilter(
...     low=2, mode='df')
>>> tf_filter.fit([['A', 'B'], ['B', 'C']])
>>> tf_filter.transform(['A', 'B', 'C'])
['B']
To filter based on inverse document frequency (idf):
>>> idf_filter = mz.preprocessors.units.FrequencyFilter(
...     low=1.2, mode='idf')
>>> idf_filter.fit([['A', 'B'], ['B', 'C', 'D']])
>>> idf_filter.transform(['A', 'B', 'C'])
['A', 'C']
fit(self, list_of_tokens: typing.List[typing.List[str]])

Fit list_of_tokens by calculating mode states.

transform(self, input_: list) → list

Transform a list of tokens by filtering out unwanted words.

classmethod _tf(cls, list_of_tokens: list) → dict
classmethod _df(cls, list_of_tokens: list) → dict
classmethod _idf(cls, list_of_tokens: list) → dict
class matchzoo.preprocessors.units.Lemmatization

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit for token lemmatization.

transform(self, input_: list) → list

Lemmatization a sequence of tokens.

Parameters

input – list of tokens to be lemmatized.

Return tokens

list of lemmatizd tokens.

class matchzoo.preprocessors.units.Lowercase

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit for text lower case.

transform(self, input_: list) → list

Convert list of tokens to lower case.

Parameters

input – list of tokens.

Return tokens

lower-cased list of tokens.

class matchzoo.preprocessors.units.MatchingHistogram(bin_size: int = 30, embedding_matrix=None, normalize=True, mode: str = 'LCH')

Bases: matchzoo.preprocessors.units.unit.Unit

MatchingHistogramUnit Class.

Parameters
  • bin_size – The number of bins of the matching histogram.

  • embedding_matrix – The word embedding matrix applied to calculate the matching histogram.

  • normalize – Boolean, normalize the embedding or not.

  • mode – The type of the historgram, it should be one of ‘CH’, ‘NG’, or ‘LCH’.

Examples

>>> embedding_matrix = np.array([[1.0, -1.0], [1.0, 2.0], [1.0, 3.0]])
>>> text_left = [0, 1]
>>> text_right = [1, 2]
>>> histogram = MatchingHistogram(3, embedding_matrix, True, 'CH')
>>> histogram.transform([text_left, text_right])
[[3.0, 1.0, 1.0], [1.0, 2.0, 2.0]]
_normalize_embedding(self)

Normalize the embedding matrix.

transform(self, input_: list) → list

Transform the input text.

class matchzoo.preprocessors.units.NgramLetter(ngram: int = 3, reduce_dim: bool = True)

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit for n-letter generation.

Triletter is used in DSSMModel. This processor is expected to execute before Vocab has been created.

Examples

>>> triletter = NgramLetter()
>>> rv = triletter.transform(['hello', 'word'])
>>> len(rv)
9
>>> rv
['#he', 'hel', 'ell', 'llo', 'lo#', '#wo', 'wor', 'ord', 'rd#']
>>> triletter = NgramLetter(reduce_dim=False)
>>> rv = triletter.transform(['hello', 'word'])
>>> len(rv)
2
>>> rv
[['#he', 'hel', 'ell', 'llo', 'lo#'], ['#wo', 'wor', 'ord', 'rd#']]
transform(self, input_: list) → list

Transform token into tri-letter.

For example, word should be represented as #wo, wor, ord and rd#.

Parameters

input – list of tokens to be transformed.

Return n_letters

generated n_letters.

class matchzoo.preprocessors.units.PuncRemoval

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit for remove punctuations.

_MATCH_PUNC
transform(self, input_: list) → list

Remove punctuations from list of tokens.

Parameters

input – list of toekns.

Return rv

tokens without punctuation.

class matchzoo.preprocessors.units.StatefulUnit

Bases: matchzoo.preprocessors.units.unit.Unit

Unit with inner state.

Usually need to be fit before transforming. All information gathered in the fit phrase will be stored into its context.

property state(self)

Get current context. Same as unit.context.

Deprecated since v2.2.0, and will be removed in the future. Used unit.context instead.

property context(self)

Get current context. Same as unit.state.

abstract fit(self, input_: typing.Any)

Abstract base method, need to be implemented in subclass.

class matchzoo.preprocessors.units.Stemming(stemmer='porter')

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit for token stemming.

Parameters

stemmer – stemmer to use, porter or lancaster.

transform(self, input_: list) → list

Reducing inflected words to their word stem, base or root form.

Parameters

input – list of string to be stemmed.

class matchzoo.preprocessors.units.StopRemoval(lang: str = 'english')

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit to remove stop words.

Example

>>> unit = StopRemoval()
>>> unit.transform(['a', 'the', 'test'])
['test']
>>> type(unit.stopwords)
<class 'list'>
transform(self, input_: list) → list

Remove stopwords from list of tokenized tokens.

Parameters
  • input – list of tokenized tokens.

  • lang – language code for stopwords.

Return tokens

list of tokenized tokens without stopwords.

property stopwords(self) → list

Get stopwords based on language.

Params lang

language code.

Returns

list of stop words.

class matchzoo.preprocessors.units.Tokenize

Bases: matchzoo.preprocessors.units.unit.Unit

Process unit for text tokenization.

transform(self, input_: str) → list

Process input data from raw terms to list of tokens.

Parameters

input – raw textual input.

Return tokens

tokenized tokens as a list.

class matchzoo.preprocessors.units.Vocabulary(pad_value: str = '<PAD>', oov_value: str = '<OOV>')

Bases: matchzoo.preprocessors.units.stateful_unit.StatefulUnit

Vocabulary class.

Parameters
  • pad_value – The string value for the padding position.

  • oov_value – The string value for the out-of-vocabulary terms.

Examples

>>> vocab = Vocabulary(pad_value='[PAD]', oov_value='[OOV]')
>>> vocab.fit(['A', 'B', 'C', 'D', 'E'])
>>> term_index = vocab.state['term_index']
>>> term_index  
{'[PAD]': 0, '[OOV]': 1, 'D': 2, 'A': 3, 'B': 4, 'C': 5, 'E': 6}
>>> index_term = vocab.state['index_term']
>>> index_term  
{0: '[PAD]', 1: '[OOV]', 2: 'D', 3: 'A', 4: 'B', 5: 'C', 6: 'E'}
>>> term_index['out-of-vocabulary-term']
1
>>> index_term[0]
'[PAD]'
>>> index_term[42]
Traceback (most recent call last):
    ...
KeyError: 42
>>> a_index = term_index['A']
>>> c_index = term_index['C']
>>> vocab.transform(['C', 'A', 'C']) == [c_index, a_index, c_index]
True
>>> vocab.transform(['C', 'A', '[OOV]']) == [c_index, a_index, 1]
True
>>> indices = vocab.transform(list('ABCDDZZZ'))
>>> ' '.join(vocab.state['index_term'][i] for i in indices)
'A B C D D [OOV] [OOV] [OOV]'
class TermIndex

Bases: dict

Map term to index.

__missing__(self, key)

Map out-of-vocabulary terms to index 1.

fit(self, tokens: list)

Build a TermIndex and a IndexTerm.

transform(self, input_: list) → list

Transform a list of tokens to corresponding indices.

class matchzoo.preprocessors.units.WordHashing(term_index: dict)

Bases: matchzoo.preprocessors.units.unit.Unit

Word-hashing layer for DSSM-based models.

The input of WordHashingUnit should be a list of word sub-letter list extracted from one document. The output of is the word-hashing representation of this document.

NgramLetterUnit and VocabularyUnit are two essential prerequisite of WordHashingUnit.

Examples

>>> letters = [['#te', 'tes','est', 'st#'], ['oov']]
>>> word_hashing = WordHashing(
...     term_index={
...      '_PAD': 0, 'OOV': 1, 'st#': 2, '#te': 3, 'est': 4, 'tes': 5
...      })
>>> hashing = word_hashing.transform(letters)
>>> hashing[0]
[0.0, 0.0, 1.0, 1.0, 1.0, 1.0]
>>> hashing[1]
[0.0, 1.0, 0.0, 0.0, 0.0, 0.0]
transform(self, input_: list) → list

Transform list of letters into word hashing layer.

Parameters

input – list of tri_letters generated by NgramLetterUnit.

Returns

Word hashing representation of tri-letters.

class matchzoo.preprocessors.units.CharacterIndex(char_index: dict)

Bases: matchzoo.preprocessors.units.unit.Unit

CharacterIndexUnit for DIIN model.

The input of :class:’CharacterIndexUnit’ should be a list of word character list extracted from a text. The output is the character index representation of this text.

NgramLetterUnit and VocabularyUnit are two essential prerequisite of CharacterIndexUnit.

Examples

>>> input_ = [['#', 'a', '#'],['#', 'o', 'n', 'e', '#']]
>>> character_index = CharacterIndex(
...     char_index={
...      '<PAD>': 0, '<OOV>': 1, 'a': 2, 'n': 3, 'e':4, '#':5})
>>> index = character_index.transform(input_)
>>> index
[[5, 2, 5], [5, 1, 3, 4, 5]]
transform(self, input_: list) → list

Transform list of characters to corresponding indices.

Parameters

input – list of characters generated by :class:’NgramLetterUnit’.

Returns

character index representation of a text.

class matchzoo.preprocessors.units.WordExactMatch(match: str, to_match: str)

Bases: matchzoo.preprocessors.units.unit.Unit

WordExactUnit Class.

Process unit to get a binary match list of two word index lists. The word index list is the word representation of a text.

Examples

>>> import pandas
>>> input_ = pandas.DataFrame({
...  'text_left':[[1, 2, 3],[4, 5, 7, 9]],
...  'text_right':[[5, 3, 2, 7],[2, 3, 5]]}
... )
>>> left_word_exact_match = WordExactMatch(
...     match='text_left', to_match='text_right'
... )
>>> left_out = input_.apply(left_word_exact_match.transform, axis=1)
>>> left_out[0]
[0, 1, 1]
>>> left_out[1]
[0, 1, 0, 0]
>>> right_word_exact_match = WordExactMatch(
...     match='text_right', to_match='text_left'
... )
>>> right_out = input_.apply(right_word_exact_match.transform, axis=1)
>>> right_out[0]
[0, 1, 1, 0]
>>> right_out[1]
[0, 0, 1]
transform(self, input_) → list

Transform two word index lists into a binary match list.

Parameters

input – a dataframe include ‘match’ column and ‘to_match’ column.

Returns

a binary match result list of two word index lists.

class matchzoo.preprocessors.units.TruncatedLength(text_length: int, truncate_mode: str = 'pre')

Bases: matchzoo.preprocessors.units.unit.Unit

TruncatedLengthUnit Class.

Process unit to truncate the text that exceeds the set length.

Examples

>>> from matchzoo.preprocessors.units import TruncatedLength
>>> truncatedlen = TruncatedLength(3)
>>> truncatedlen.transform(list(range(1, 6))) == [3, 4, 5]
True
>>> truncatedlen.transform(list(range(2))) == [0, 1]
True
transform(self, input_: list) → list

Truncate the text that exceeds the specified maximum length.

Parameters

input – list of tokenized tokens.

Return tokens

list of tokenized tokens in fixed length if its origin length larger than text_length.

matchzoo.preprocessors.units.list_available() → list