Text Count Vectorizer

../../../../_images/count_vectorizer.svg

Some of the docstrings for this module have been automatically extracted from the scikit-learn library and are covered by their respective licenses.

class node_text.CountVectorizer[source]

Convert a collection of text documents to a matrix of token counts

Configuration:
  • encoding

    If bytes or files are given to analyze, this encoding is used to decode.

  • decode_error

    Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.

  • strip_accents

    Remove accents during the preprocessing step. ‘ascii’ is a fast method that only works on characters that have an direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any characters. None (default) does nothing.

  • analyzer

    Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

    If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.

  • ngram_range

    The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

  • stop_words

    If ‘english’, a built-in stop word list for English is used.

    If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'.

    If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

  • lowercase

    Convert all characters to lowercase before tokenizing.

  • max_df

    When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

  • min_df

    When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

  • max_features

    If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

    This parameter is ignored if vocabulary is not None.

  • binary

    If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.

Attributes:
  • vocabulary_

    A mapping of terms to feature indices.

  • stop_words_

    Terms that were ignored because they either:

    • occurred in too many documents (max_df)
    • occurred in too few documents (min_df)
    • were cut off by feature selection (max_features).

    This is only available if no vocabulary was given.

Inputs:
Outputs:
model : model

Model