Text
- class niaarm.text.Corpus(documents=None)
Bases:
objectThe text corpus class.
- Parameters:
documents (list[Document] | None) – List of documents. If
None, an empty list will be created. Default:None.
- append(document)
Add a document to the corpus.
- Parameters:
document (Document) – Document to append.
- classmethod from_directory(path, encoding='utf-8', language='english', remove_stopwords=True, lowercase=True)
Construct corpus from a directory containing plain text files.
- Parameters:
path (str) – Path to directory.
encoding (str) – Encoding of the files. Default: ‘utf-8’.
language (str) – Language of the files. Default: ‘english’.
remove_stopwords (bool) – Remove stopwords from text. Default:
True.lowercase (bool) – Make text lowercase. Default:
True.
- Returns:
The constructed corpus.
- Return type:
- classmethod from_list(lst, language='english', remove_stopwords=True, lowercase=True)
Construct corpus from a list of strings.
- Parameters:
lst (list[str]) – List of documents as strings.
language (str) – Language of the file. Default: ‘english’.
remove_stopwords (bool) – Remove stopwords from text. Default:
True.lowercase (bool) – Make text lowercase. Default:
True.
- Returns:
The constructed corpus.
- Return type:
- terms()
Get a list of unique terms in the corpus
- Returns:
List of unique terms in the corpus.
- Return type:
list[str]
- tf_idf_matrix(smooth=True, norm=2)
Get the tf-idf weights matrix as a pandas DataFrame.
- Parameters:
smooth (bool) – Apply smoothing. Default:
True.norm (int) – Order of the norm to normalize the matrix with. Default: 2.
- Returns:
The tf-idf matrix.
- Return type:
pd.DataFrame
- class niaarm.text.Document(text, language='english', remove_stopwords=True, lowercase=True)
Bases:
objectA text document class.
- Parameters:
text (str) – Document text.
language (str) – Document language. Default: ‘english’.
remove_stopwords (bool) – Remove stopwords from text. Default:
True.lowercase (bool) – Make text lowercase. Default:
True.
- frequency(term)
Get the frequency of a term,
- Parameters:
term (str) – Term to get frequency of.
- Returns:
Frequency of the term.
- Return type:
float
- classmethod from_file(path, encoding='utf-8', language='english', remove_stopwords=True, lowercase=True)
Construct document from a plain text file.
- Parameters:
path (str) – Path to file.
encoding (str) – Encoding of the file. Default: ‘utf-8’.
language (str) – Language of the file. Default: ‘english’.
remove_stopwords (bool) – Remove stopwords from text. Default:
True.lowercase (bool) – Make text lowercase. Default:
True.
- Returns:
The constructed document.
- Return type:
- class niaarm.text.NiaARTM(max_terms, terms, transactions, metrics, threshold=0, logging=False)
Bases:
NiaARMRepresentation of Association Rule Text Mining as an optimization problem.
The implementation is composed of ideas found in the following paper:
I. Fister, S. Deb, I. Fister, „Population-based metaheuristics for Association Rule Text Mining“, in Proceedings of the 2020 4th International Conference on Intelligent Systems, Metaheuristics & Swarm Intelligence, New York, NY, USA, mar. 2020, pp. 19–23. doi: 10.1145/3396474.3396493.
- Parameters:
max_terms (int) – Maximum number of terms in association rule.
terms (list[str]) – List of unique terms in the corpus.
transactions (pandas.Dataframe) – The tf-idf matrix.
metrics (dict[str, float] | Sequence[str]) – Metrics to take into account when computing the fitness. Metrics can either be passed as a dict of {<metric name>: <weight>} or a sequence of metrics as strings, in which case, the weights will default to 1.
threshold (float | None) – Threshold of tf-idf weights. If a weight is less than or equal to the threshold, the term is not included in the transaction. Default: 0.
logging (bool) – Enable logging of fitness improvements. Default:
False.
- class niaarm.text.TextRule(antecedent, consequent, fitness=0.0, transactions=None, threshold=0)
Bases:
RuleClass representing a text association rule.
- Parameters:
antecedent (list[str]) – A list of antecedent terms of the text rule.
consequent (list[str]) – A list of consequent terms of the text rule.
fitness (float | None) – Fitness value of the text rule.
transactions (pandas.DataFrame | None) – The tf-idf matrix.
threshold (float | None) – Threshold of tf-idf weights. If a weight is less than or equal to the threshold, the term is not included in the transaction. Default: 0.
- aws
The sum of tf-idf values for all the terms in the rule.
See also