.. _`Text Count Vectorizer`: .. _`org.sysess.sympathy.machinelearning.count_vectorizer`: Text Count Vectorizer ~~~~~~~~~~~~~~~~~~~~~ .. image:: count_vectorizer.svg :width: 48 Convert a collection of text documents to a matrix of token counts :Configuration: - *encoding* If bytes or files are given to analyze, this encoding is used to decode. - *decode_error* Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given `encoding`. By default, it is 'strict', meaning that a UnicodeDecodeError will be raised. Other values are 'ignore' and 'replace'. - *strip_accents* Remove accents during the preprocessing step. 'ascii' is a fast method that only works on characters that have an direct ASCII mapping. 'unicode' is a slightly slower method that works on any characters. None (default) does nothing. - *analyzer* Whether the feature should be made of word or character n-grams. Option 'char_wb' creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space. If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input. - *ngram_range* The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. - *stop_words* If 'english', a built-in stop word list for English is used. If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if ``analyzer == 'word'``. If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms. - *lowercase* Convert all characters to lowercase before tokenizing. - *max_df* When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None. - *min_df* When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None. - *max_features* If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None. - *binary* If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. :Attributes: - *vocabulary_* A mapping of terms to feature indices. - *stop_words_* Terms that were ignored because they either: - occurred in too many documents (`max_df`) - occurred in too few documents (`min_df`) - were cut off by feature selection (`max_features`). This is only available if no vocabulary was given. :Inputs: :Outputs: **model** : model Model *Ports*: **Outputs**: :model: model Model *Configuration*: **encoding** If bytes or files are given to analyze, this encoding is used to decode. **decode_error** Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given `encoding`. By default, it is 'strict', meaning that a UnicodeDecodeError will be raised. Other values are 'ignore' and 'replace'. **strip_accents** Remove accents during the preprocessing step. 'ascii' is a fast method that only works on characters that have an direct ASCII mapping. 'unicode' is a slightly slower method that works on any characters. None (default) does nothing. **analyzer** Whether the feature should be made of word or character n-grams. Option 'char_wb' creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space. If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input. **ngram_range** The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. **stop_words** If 'english', a built-in stop word list for English is used. If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if ``analyzer == 'word'``. If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms. **lowercase** Convert all characters to lowercase before tokenizing. **max_df** When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None. **min_df** When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None. **max_features** If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None. **binary** If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. .. automodule:: node_text .. class:: CountVectorizer