Text

class niaarm.text.Corpus(documents=None)

Bases: object

The text corpus class.

Parameters:: documents (Optional[list[Document]]) – List of documents. If None, an empty list will be created. Default: None.

append(document)

Add a document to the corpus.

Parameters:: document (Document) – Document to append.

classmethod from_directory(path, encoding='utf-8', language='english', remove_stopwords=True, lowercase=True)

Construct corpus from a directory containing plain text files.

Parameters:

path (str) – Path to directory.
encoding (str) – Encoding of the files. Default: ‘utf-8’.
language (str) – Language of the files. Default: ‘english’.
remove_stopwords (bool) – If True, remove stopwords from text. Default: True.
lowercase (bool) – If True, make text lowercase. Default: True.

Returns:

The constructed corpus.

Return type:

Corpus

classmethod from_list(lst, language='english', remove_stopwords=True, lowercase=True)

Construct corpus from a list of strings.

Parameters:

lst (list[str]) – List of documents as strings.
language (str) – Language of the file. Default: ‘english’.
remove_stopwords (bool) – If True, remove stopwords from text. Default: True.
lowercase (bool) – If True, make text lowercase. Default: True.

Returns:

The constructed corpus.

Return type:

Corpus

terms()

Get a list of unique terms in the corpus

Returns:: List of unique terms in the corpus.
Return type:: list[str]

tf_idf_matrix(smooth=True, norm=2)

Get the tf-idf weights matrix as a pandas DataFrame.

Parameters:

smooth (bool) – Smooth idf by adding one to the numerator and the denominator to prevent division by 0 errors. Default: True.
norm (int) – Order of the norm to normalize the matrix with. Default: 2.

Returns:

The tf-idf matrix.

Return type:

pd.DataFrame

class niaarm.text.Document(text, language='english', remove_stopwords=True, lowercase=True)

Bases: object

A text document class.

Parameters:

text (str) – Document text.
language (str) – Document language. Used for tokenization and stopword removal. Default: ‘english’.
remove_stopwords (bool) – If True, remove stopwords from text. Default: True.
lowercase (bool) – If True, make text lowercase. Default: True.

frequency(term)

Get the frequency of a term,

Parameters:: term (str) – Term to get frequency of.
Returns:: Frequency of the term.
Return type:: float

classmethod from_file(path, encoding='utf-8', language='english', remove_stopwords=True, lowercase=True)

Construct document from a plain text file.

Parameters:

path (str) – Path to file.
encoding (str) – Encoding of the file. Default: ‘utf-8’.
language (str) – Language of the file. Default: ‘english’.
remove_stopwords (bool) – If True, remove stopwords from text. Default: True.
lowercase (bool) – If True, make text lowercase. Default: True.

Returns:

The constructed document.

Return type:

Document

class niaarm.text.NiaARTM(max_terms, terms, transactions, metrics, threshold=0, logging=False)

Bases: NiaARM

Representation of Association Rule Text Mining as an optimization problem.

The implementation is composed of ideas found in the following paper:

I. Fister, S. Deb, I. Fister, „Population-based metaheuristics for Association Rule Text Mining“, in Proceedings of the 2020 4th International Conference on Intelligent Systems, Metaheuristics & Swarm Intelligence, New York, NY, USA, mar. 2020, pp. 19–23. doi: 10.1145/3396474.3396493.

Parameters:

max_terms (int) – Maximum number of terms in association rule..
features (list[str]) – List of unique terms in the corpus.
transactions (pandas.Dataframe) – The tf-idf matrix.
metrics (Union[Dict[str, float], Sequence[str]]) – Metrics to take into account when computing the fitness. Metrics can either be passed as a Dict of pairs {‘metric_name’: <weight of metric>} or a sequence of metrics as strings, in which case, the weights of the metrics will be set to 1.
threshold (Optional[float]) – Threshold of tf-idf weights. If a weight is less than or equal to the threshold, the term is not included in the transaction. Default: 0.
logging (bool) – Enable logging of fitness improvements. Default: False.

rules

A list of mined text rules.

Type:: RuleList

class niaarm.text.TextRule(antecedent, consequent, fitness=0.0, transactions=None, threshold=0)

Bases: Rule

Class representing a text association rule.

The class contains all the metrics in Rule, except for amplitude, which returns nan.

Parameters:

antecedent (list[str]) – A list of antecedent terms of the text rule.
consequent (list[str]) – A list of consequent terms of the text rule.
fitness (Optional[float]) – Fitness value of the text rule.
transactions (Optional[pandas.DataFrame]) – The tf-idf matrix as a pandas DataFrame.
threshold (Optional[float]) – Threshold of tf-idf weights. If a weight is less than or equal to the threshold, the term is not included in the transaction. Default: 0.

aws: The sum of tf-idf values for all the terms in the rule.