Corpus

class orangecontrib.text.corpus.Corpus(X=None, Y=None, metas=None, domain=None, text_features=None)[source]

Internal class for storing a corpus.

__init__(X=None, Y=None, metas=None, domain=None, text_features=None)[source]
Parameters:
  • X (numpy.ndarray) – attributes
  • Y (numpy.ndarray) – class variables
  • metas (numpy.ndarray) – meta attributes; e.g. text
  • domain (Orange.data.Domain) – the domain for this Corpus
  • text_features (list) – meta attributes that are used for text mining. Infer them if None.
static __new__(*args, **kwargs)[source]

Bypass Table.__new__.

copy()[source]

Return a copy of the table.

dictionary

corpora.Dictionary: A token to id mapper.

documents
Returns: a list of strings representing documents — created by joining
selected text features.
documents_from_features(feats)[source]
Parameters:feats (list) – A list fo features to join.

Returns: a list of strings constructed by joining feats.

extend_attributes(X, feature_names, var_attrs=None)[source]

Append features to corpus.

Parameters:
  • X (numpy.ndarray or scipy.sparse.csr_matrix) – Features to append
  • feature_names (list) – List of string containing feature names
  • var_attrs (dict) – Additional attributes appended to variable.attributes.
extend_corpus(metadata, Y)[source]

Append documents to corpus.

Parameters:
  • metadata (numpy.ndarray) – Meta data
  • Y (numpy.ndarray) – Class variables
has_tokens()[source]

Return whether corpus is preprocessed or not.

ngrams

generator: Ngram representations of documents.

set_text_features(feats)[source]

Select which meta-attributes to include when mining text.

Parameters:feats (list) – list of text features to include.
store_tokens(tokens, dictionary=None)[source]
Parameters:tokens (list) – List of lists containing tokens.
tokens

np.ndarray: A list of lists containing tokens. If tokens are not yet present, run default preprocessor and save tokens.