Text

class niaarm.text.Corpus(documents=None)

Bases: object

The text corpus class.

Parameters:

documents (Optional[list[Document]]) – List of documents. If None, an empty list will be created. Default: None.

append(document)

Add a document to the corpus.

Parameters:

document (Document) – Document to append.

classmethod from_directory(path, encoding='utf-8', language='english', remove_stopwords=True, lowercase=True)

Construct corpus from a directory containing plain text files.

Parameters:
  • path (str) – Path to directory.

  • encoding (str) – Encoding of the files. Default: ‘utf-8’.

  • language (str) – Language of the files. Default: ‘english’.

  • remove_stopwords (bool) – If True, remove stopwords from text. Default: True.

  • lowercase (bool) – If True, make text lowercase. Default: True.

Returns:

The constructed corpus.

Return type:

Corpus

classmethod from_list(lst, language='english', remove_stopwords=True, lowercase=True)

Construct corpus from a list of strings.

Parameters:
  • lst (list[str]) – List of documents as strings.

  • language (str) – Language of the file. Default: ‘english’.

  • remove_stopwords (bool) – If True, remove stopwords from text. Default: True.

  • lowercase (bool) – If True, make text lowercase. Default: True.

Returns:

The constructed corpus.

Return type:

Corpus

terms()

Get a list of unique terms in the corpus

Returns:

List of unique terms in the corpus.

Return type:

list[str]

tf_idf_matrix(smooth=True, norm=2)

Get the tf-idf weights matrix as a pandas DataFrame.

Parameters:
  • smooth (bool) – Smooth idf by adding one to the numerator and the denominator to prevent division by 0 errors. Default: True.

  • norm (int) – Order of the norm to normalize the matrix with. Default: 2.

Returns:

The tf-idf matrix.

Return type:

pd.DataFrame

class niaarm.text.Document(text, language='english', remove_stopwords=True, lowercase=True)

Bases: object

A text document class.

Parameters:
  • text (str) – Document text.

  • language (str) – Document language. Used for tokenization and stopword removal. Default: ‘english’.

  • remove_stopwords (bool) – If True, remove stopwords from text. Default: True.

  • lowercase (bool) – If True, make text lowercase. Default: True.

frequency(term)

Get the frequency of a term,

Parameters:

term (str) – Term to get frequency of.

Returns:

Frequency of the term.

Return type:

float

classmethod from_file(path, encoding='utf-8', language='english', remove_stopwords=True, lowercase=True)

Construct document from a plain text file.

Parameters:
  • path (str) – Path to file.

  • encoding (str) – Encoding of the file. Default: ‘utf-8’.

  • language (str) – Language of the file. Default: ‘english’.

  • remove_stopwords (bool) – If True, remove stopwords from text. Default: True.

  • lowercase (bool) – If True, make text lowercase. Default: True.

Returns:

The constructed document.

Return type:

Document

class niaarm.text.NiaARTM(max_terms, terms, transactions, metrics, threshold=0, logging=False)

Bases: NiaARM

Representation of Association Rule Text Mining as an optimization problem.

The implementation is composed of ideas found in the following paper:

  • I. Fister, S. Deb, I. Fister, „Population-based metaheuristics for Association Rule Text Mining“, in Proceedings of the 2020 4th International Conference on Intelligent Systems, Metaheuristics & Swarm Intelligence, New York, NY, USA, mar. 2020, pp. 19–23. doi: 10.1145/3396474.3396493.

Parameters:
  • max_terms (int) – Maximum number of terms in association rule..

  • features (list[str]) – List of unique terms in the corpus.

  • transactions (pandas.Dataframe) – The tf-idf matrix.

  • metrics (Union[Dict[str, float], Sequence[str]]) – Metrics to take into account when computing the fitness. Metrics can either be passed as a Dict of pairs {‘metric_name’: <weight of metric>} or a sequence of metrics as strings, in which case, the weights of the metrics will be set to 1.

  • threshold (Optional[float]) – Threshold of tf-idf weights. If a weight is less than or equal to the threshold, the term is not included in the transaction. Default: 0.

  • logging (bool) – Enable logging of fitness improvements. Default: False.

rules

A list of mined text rules.

Type:

RuleList

class niaarm.text.TextRule(antecedent, consequent, fitness=0.0, transactions=None, threshold=0)

Bases: Rule

Class representing a text association rule.

The class contains all the metrics in Rule, except for amplitude, which returns nan.

Parameters:
  • antecedent (list[str]) – A list of antecedent terms of the text rule.

  • consequent (list[str]) – A list of consequent terms of the text rule.

  • fitness (Optional[float]) – Fitness value of the text rule.

  • transactions (Optional[pandas.DataFrame]) – The tf-idf matrix as a pandas DataFrame.

  • threshold (Optional[float]) – Threshold of tf-idf weights. If a weight is less than or equal to the threshold, the term is not included in the transaction. Default: 0.

aws

The sum of tf-idf values for all the terms in the rule.

See also

niaarm.rule.Rule