matchzoo.preprocessors.units
¶
Submodules¶
matchzoo.preprocessors.units.character_index
matchzoo.preprocessors.units.digit_removal
matchzoo.preprocessors.units.frequency_filter
matchzoo.preprocessors.units.lemmatization
matchzoo.preprocessors.units.lowercase
matchzoo.preprocessors.units.matching_histogram
matchzoo.preprocessors.units.ngram_letter
matchzoo.preprocessors.units.punc_removal
matchzoo.preprocessors.units.stateful_unit
matchzoo.preprocessors.units.stemming
matchzoo.preprocessors.units.stop_removal
matchzoo.preprocessors.units.tokenize
matchzoo.preprocessors.units.truncated_length
matchzoo.preprocessors.units.unit
matchzoo.preprocessors.units.vocabulary
matchzoo.preprocessors.units.word_exact_match
matchzoo.preprocessors.units.word_hashing
Package Contents¶
Classes¶
Process unit do not persive state (i.e. do not need fit). |
|
Process unit to remove digits. |
|
Frequency filter unit. |
|
Process unit for token lemmatization. |
|
Process unit for text lower case. |
|
MatchingHistogramUnit Class. |
|
Process unit for n-letter generation. |
|
Process unit for remove punctuations. |
|
Unit with inner state. |
|
Process unit for token stemming. |
|
Process unit to remove stop words. |
|
Process unit for text tokenization. |
|
Vocabulary class. |
|
Word-hashing layer for DSSM-based models. |
|
CharacterIndexUnit for DIIN model. |
|
WordExactUnit Class. |
|
TruncatedLengthUnit Class. |
Functions¶
|
-
class
matchzoo.preprocessors.units.
Unit
¶ Process unit do not persive state (i.e. do not need fit).
-
abstract
transform
(self, input_: typing.Any)¶ Abstract base method, need to be implemented in subclass.
-
abstract
-
class
matchzoo.preprocessors.units.
DigitRemoval
¶ Bases:
matchzoo.preprocessors.units.unit.Unit
Process unit to remove digits.
-
transform
(self, input_: list) → list¶ Remove digits from list of tokens.
- Parameters
input – list of tokens to be filtered.
- Return tokens
tokens of tokens without digits.
-
-
class
matchzoo.preprocessors.units.
FrequencyFilter
(low: float = 0, high: float = float('inf'), mode: str = 'df')¶ Bases:
matchzoo.preprocessors.units.stateful_unit.StatefulUnit
Frequency filter unit.
- Parameters
low – Lower bound, inclusive.
high – Upper bound, exclusive.
mode – One of tf (term frequency), df (document frequency), and idf (inverse document frequency).
- Examples::
>>> import matchzoo as mz
- To filter based on term frequency (tf):
>>> tf_filter = mz.preprocessors.units.FrequencyFilter( ... low=2, mode='tf') >>> tf_filter.fit([['A', 'B', 'B'], ['C', 'C', 'C']]) >>> tf_filter.transform(['A', 'B', 'C']) ['B', 'C']
- To filter based on document frequency (df):
>>> tf_filter = mz.preprocessors.units.FrequencyFilter( ... low=2, mode='df') >>> tf_filter.fit([['A', 'B'], ['B', 'C']]) >>> tf_filter.transform(['A', 'B', 'C']) ['B']
- To filter based on inverse document frequency (idf):
>>> idf_filter = mz.preprocessors.units.FrequencyFilter( ... low=1.2, mode='idf') >>> idf_filter.fit([['A', 'B'], ['B', 'C', 'D']]) >>> idf_filter.transform(['A', 'B', 'C']) ['A', 'C']
-
fit
(self, list_of_tokens: typing.List[typing.List[str]])¶ Fit list_of_tokens by calculating mode states.
-
transform
(self, input_: list) → list¶ Transform a list of tokens by filtering out unwanted words.
-
classmethod
_tf
(cls, list_of_tokens: list) → dict¶
-
classmethod
_df
(cls, list_of_tokens: list) → dict¶
-
classmethod
_idf
(cls, list_of_tokens: list) → dict¶
-
class
matchzoo.preprocessors.units.
Lemmatization
¶ Bases:
matchzoo.preprocessors.units.unit.Unit
Process unit for token lemmatization.
-
transform
(self, input_: list) → list¶ Lemmatization a sequence of tokens.
- Parameters
input – list of tokens to be lemmatized.
- Return tokens
list of lemmatizd tokens.
-
-
class
matchzoo.preprocessors.units.
Lowercase
¶ Bases:
matchzoo.preprocessors.units.unit.Unit
Process unit for text lower case.
-
transform
(self, input_: list) → list¶ Convert list of tokens to lower case.
- Parameters
input – list of tokens.
- Return tokens
lower-cased list of tokens.
-
-
class
matchzoo.preprocessors.units.
MatchingHistogram
(bin_size: int = 30, embedding_matrix=None, normalize=True, mode: str = 'LCH')¶ Bases:
matchzoo.preprocessors.units.unit.Unit
MatchingHistogramUnit Class.
- Parameters
bin_size – The number of bins of the matching histogram.
embedding_matrix – The word embedding matrix applied to calculate the matching histogram.
normalize – Boolean, normalize the embedding or not.
mode – The type of the historgram, it should be one of ‘CH’, ‘NG’, or ‘LCH’.
Examples
>>> embedding_matrix = np.array([[1.0, -1.0], [1.0, 2.0], [1.0, 3.0]]) >>> text_left = [0, 1] >>> text_right = [1, 2] >>> histogram = MatchingHistogram(3, embedding_matrix, True, 'CH') >>> histogram.transform([text_left, text_right]) [[3.0, 1.0, 1.0], [1.0, 2.0, 2.0]]
-
_normalize_embedding
(self)¶ Normalize the embedding matrix.
-
transform
(self, input_: list) → list¶ Transform the input text.
-
class
matchzoo.preprocessors.units.
NgramLetter
(ngram: int = 3, reduce_dim: bool = True)¶ Bases:
matchzoo.preprocessors.units.unit.Unit
Process unit for n-letter generation.
Triletter is used in
DSSMModel
. This processor is expected to execute before Vocab has been created.Examples
>>> triletter = NgramLetter() >>> rv = triletter.transform(['hello', 'word']) >>> len(rv) 9 >>> rv ['#he', 'hel', 'ell', 'llo', 'lo#', '#wo', 'wor', 'ord', 'rd#'] >>> triletter = NgramLetter(reduce_dim=False) >>> rv = triletter.transform(['hello', 'word']) >>> len(rv) 2 >>> rv [['#he', 'hel', 'ell', 'llo', 'lo#'], ['#wo', 'wor', 'ord', 'rd#']]
-
transform
(self, input_: list) → list¶ Transform token into tri-letter.
For example, word should be represented as #wo, wor, ord and rd#.
- Parameters
input – list of tokens to be transformed.
- Return n_letters
generated n_letters.
-
-
class
matchzoo.preprocessors.units.
PuncRemoval
¶ Bases:
matchzoo.preprocessors.units.unit.Unit
Process unit for remove punctuations.
-
_MATCH_PUNC
¶
-
transform
(self, input_: list) → list¶ Remove punctuations from list of tokens.
- Parameters
input – list of toekns.
- Return rv
tokens without punctuation.
-
-
class
matchzoo.preprocessors.units.
StatefulUnit
¶ Bases:
matchzoo.preprocessors.units.unit.Unit
Unit with inner state.
Usually need to be fit before transforming. All information gathered in the fit phrase will be stored into its context.
-
property
state
(self)¶ Get current context. Same as unit.context.
Deprecated since v2.2.0, and will be removed in the future. Used unit.context instead.
-
property
context
(self)¶ Get current context. Same as unit.state.
-
abstract
fit
(self, input_: typing.Any)¶ Abstract base method, need to be implemented in subclass.
-
property
-
class
matchzoo.preprocessors.units.
Stemming
(stemmer='porter')¶ Bases:
matchzoo.preprocessors.units.unit.Unit
Process unit for token stemming.
- Parameters
stemmer – stemmer to use, porter or lancaster.
-
transform
(self, input_: list) → list¶ Reducing inflected words to their word stem, base or root form.
- Parameters
input – list of string to be stemmed.
-
class
matchzoo.preprocessors.units.
StopRemoval
(lang: str = 'english')¶ Bases:
matchzoo.preprocessors.units.unit.Unit
Process unit to remove stop words.
Example
>>> unit = StopRemoval() >>> unit.transform(['a', 'the', 'test']) ['test'] >>> type(unit.stopwords) <class 'list'>
-
transform
(self, input_: list) → list¶ Remove stopwords from list of tokenized tokens.
- Parameters
input – list of tokenized tokens.
lang – language code for stopwords.
- Return tokens
list of tokenized tokens without stopwords.
-
property
stopwords
(self) → list¶ Get stopwords based on language.
- Params lang
language code.
- Returns
list of stop words.
-
-
class
matchzoo.preprocessors.units.
Tokenize
¶ Bases:
matchzoo.preprocessors.units.unit.Unit
Process unit for text tokenization.
-
transform
(self, input_: str) → list¶ Process input data from raw terms to list of tokens.
- Parameters
input – raw textual input.
- Return tokens
tokenized tokens as a list.
-
-
class
matchzoo.preprocessors.units.
Vocabulary
(pad_value: str = '<PAD>', oov_value: str = '<OOV>')¶ Bases:
matchzoo.preprocessors.units.stateful_unit.StatefulUnit
Vocabulary class.
- Parameters
pad_value – The string value for the padding position.
oov_value – The string value for the out-of-vocabulary terms.
Examples
>>> vocab = Vocabulary(pad_value='[PAD]', oov_value='[OOV]') >>> vocab.fit(['A', 'B', 'C', 'D', 'E']) >>> term_index = vocab.state['term_index'] >>> term_index {'[PAD]': 0, '[OOV]': 1, 'D': 2, 'A': 3, 'B': 4, 'C': 5, 'E': 6} >>> index_term = vocab.state['index_term'] >>> index_term {0: '[PAD]', 1: '[OOV]', 2: 'D', 3: 'A', 4: 'B', 5: 'C', 6: 'E'}
>>> term_index['out-of-vocabulary-term'] 1 >>> index_term[0] '[PAD]' >>> index_term[42] Traceback (most recent call last): ... KeyError: 42 >>> a_index = term_index['A'] >>> c_index = term_index['C'] >>> vocab.transform(['C', 'A', 'C']) == [c_index, a_index, c_index] True >>> vocab.transform(['C', 'A', '[OOV]']) == [c_index, a_index, 1] True >>> indices = vocab.transform(list('ABCDDZZZ')) >>> ' '.join(vocab.state['index_term'][i] for i in indices) 'A B C D D [OOV] [OOV] [OOV]'
-
class
TermIndex
¶ Bases:
dict
Map term to index.
-
__missing__
(self, key)¶ Map out-of-vocabulary terms to index 1.
-
-
transform
(self, input_: list) → list¶ Transform a list of tokens to corresponding indices.
-
class
matchzoo.preprocessors.units.
WordHashing
(term_index: dict)¶ Bases:
matchzoo.preprocessors.units.unit.Unit
Word-hashing layer for DSSM-based models.
The input of
WordHashingUnit
should be a list of word sub-letter list extracted from one document. The output of is the word-hashing representation of this document.NgramLetterUnit
andVocabularyUnit
are two essential prerequisite ofWordHashingUnit
.Examples
>>> letters = [['#te', 'tes','est', 'st#'], ['oov']] >>> word_hashing = WordHashing( ... term_index={ ... '_PAD': 0, 'OOV': 1, 'st#': 2, '#te': 3, 'est': 4, 'tes': 5 ... }) >>> hashing = word_hashing.transform(letters) >>> hashing[0] [0.0, 0.0, 1.0, 1.0, 1.0, 1.0] >>> hashing[1] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0]
-
transform
(self, input_: list) → list¶ Transform list of
letters
into word hashing layer.- Parameters
input – list of tri_letters generated by
NgramLetterUnit
.- Returns
Word hashing representation of tri-letters.
-
-
class
matchzoo.preprocessors.units.
CharacterIndex
(char_index: dict)¶ Bases:
matchzoo.preprocessors.units.unit.Unit
CharacterIndexUnit for DIIN model.
The input of :class:’CharacterIndexUnit’ should be a list of word character list extracted from a text. The output is the character index representation of this text.
NgramLetterUnit
andVocabularyUnit
are two essential prerequisite ofCharacterIndexUnit
.Examples
>>> input_ = [['#', 'a', '#'],['#', 'o', 'n', 'e', '#']] >>> character_index = CharacterIndex( ... char_index={ ... '<PAD>': 0, '<OOV>': 1, 'a': 2, 'n': 3, 'e':4, '#':5}) >>> index = character_index.transform(input_) >>> index [[5, 2, 5], [5, 1, 3, 4, 5]]
-
transform
(self, input_: list) → list¶ Transform list of characters to corresponding indices.
- Parameters
input – list of characters generated by :class:’NgramLetterUnit’.
- Returns
character index representation of a text.
-
-
class
matchzoo.preprocessors.units.
WordExactMatch
(match: str, to_match: str)¶ Bases:
matchzoo.preprocessors.units.unit.Unit
WordExactUnit Class.
Process unit to get a binary match list of two word index lists. The word index list is the word representation of a text.
Examples
>>> import pandas >>> input_ = pandas.DataFrame({ ... 'text_left':[[1, 2, 3],[4, 5, 7, 9]], ... 'text_right':[[5, 3, 2, 7],[2, 3, 5]]} ... ) >>> left_word_exact_match = WordExactMatch( ... match='text_left', to_match='text_right' ... ) >>> left_out = input_.apply(left_word_exact_match.transform, axis=1) >>> left_out[0] [0, 1, 1] >>> left_out[1] [0, 1, 0, 0] >>> right_word_exact_match = WordExactMatch( ... match='text_right', to_match='text_left' ... ) >>> right_out = input_.apply(right_word_exact_match.transform, axis=1) >>> right_out[0] [0, 1, 1, 0] >>> right_out[1] [0, 0, 1]
-
transform
(self, input_) → list¶ Transform two word index lists into a binary match list.
- Parameters
input – a dataframe include ‘match’ column and ‘to_match’ column.
- Returns
a binary match result list of two word index lists.
-
-
class
matchzoo.preprocessors.units.
TruncatedLength
(text_length: int, truncate_mode: str = 'pre')¶ Bases:
matchzoo.preprocessors.units.unit.Unit
TruncatedLengthUnit Class.
Process unit to truncate the text that exceeds the set length.
Examples
>>> from matchzoo.preprocessors.units import TruncatedLength >>> truncatedlen = TruncatedLength(3) >>> truncatedlen.transform(list(range(1, 6))) == [3, 4, 5] True >>> truncatedlen.transform(list(range(2))) == [0, 1] True
-
transform
(self, input_: list) → list¶ Truncate the text that exceeds the specified maximum length.
- Parameters
input – list of tokenized tokens.
- Return tokens
list of tokenized tokens in fixed length if its origin length larger than
text_length
.
-
-
matchzoo.preprocessors.units.
list_available
() → list¶