Text Count Vectorizer¶
Convert a collection of text documents to a matrix of token counts
Documentation¶
Attributes¶
- stop_words_
Terms that were ignored because they either:
occurred in too many documents (max_df)
occurred in too few documents (min_df)
were cut off by feature selection (max_features).
This is only available if no vocabulary was given.
- vocabulary_
A mapping of terms to feature indices.
Definition¶
Output ports¶
- model model
Model
Configuration¶
- Analyzer (analyzer)
Whether the feature should be made of word n-gram or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.
If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.
Changed in version 0.21.
Since v0.21, if
input
isfilename
orfile
, the data is first read from the file and then passed to the given callable analyzer.- Binary (binary)
If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.
- Decoding error behavior (decode_error)
Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.
- Encoding (encoding)
If bytes or files are given to analyze, this encoding is used to decode.
- Lowercase (lowercase)
Convert all characters to lowercase before tokenizing.
- Maximum document frequency (max_df)
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
- Maximum features (max_features)
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. Otherwise, all features are used.
This parameter is ignored if vocabulary is not None.
- Minimum document frequency (min_df)
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
- N-gram range (ngram_range)
The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an
ngram_range
of(1, 1)
means only unigrams,(1, 2)
means unigrams and bigrams, and(2, 2)
means only bigrams. Only applies ifanalyzer
is not callable.- Stop words (stop_words)
If ‘english’, a built-in stop word list for English is used. There are several known issues with ‘english’ and you should consider an alternative (see stop_words).
If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if
analyzer == 'word'
.If None, no stop words will be used. In this case, setting max_df to a higher value, such as in the range (0.7, 1.0), can automatically detect and filter stop words based on intra corpus document frequency of terms.
- Strip accents (strip_accents)
Remove accents and perform other character normalization during the preprocessing step. ‘ascii’ is a fast method that only works on characters that have a direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any characters. None (default) does nothing.
Both ‘ascii’ and ‘unicode’ use NFKD normalization from
unicodedata.normalize()
.
Implementation¶
- class node_text.CountVectorizer[source]