Corpus¶
-
class
orangecontrib.text.corpus.Corpus(X=None, Y=None, metas=None, domain=None, text_features=None)[source]¶ Internal class for storing a corpus.
-
__init__(X=None, Y=None, metas=None, domain=None, text_features=None)[source]¶ Parameters: - X (numpy.ndarray) – attributes
- Y (numpy.ndarray) – class variables
- metas (numpy.ndarray) – meta attributes; e.g. text
- domain (Orange.data.Domain) – the domain for this Corpus
- text_features (list) – meta attributes that are used for text mining. Infer them if None.
-
dictionary¶ corpora.Dictionary: A token to id mapper.
-
documents¶ - Returns: a list of strings representing documents — created by joining
- selected text features.
-
documents_from_features(feats)[source]¶ Parameters: feats (list) – A list fo features to join. Returns: a list of strings constructed by joining feats.
-
extend_attributes(X, feature_names, var_attrs=None)[source]¶ Append features to corpus.
Parameters: - X (numpy.ndarray or scipy.sparse.csr_matrix) – Features to append
- feature_names (list) – List of string containing feature names
- var_attrs (dict) – Additional attributes appended to variable.attributes.
-
extend_corpus(metadata, Y)[source]¶ Append documents to corpus.
Parameters: - metadata (numpy.ndarray) – Meta data
- Y (numpy.ndarray) – Class variables
-
ngrams¶ generator: Ngram representations of documents.
-
set_text_features(feats)[source]¶ Select which meta-attributes to include when mining text.
Parameters: feats (list) – list of text features to include.
-
store_tokens(tokens, dictionary=None)[source]¶ Parameters: tokens (list) – List of lists containing tokens.
-
tokens¶ np.ndarray: A list of lists containing tokens. If tokens are not yet present, run default preprocessor and save tokens.
-