Text
- class niaarm.text.Corpus(documents=None)
Bases:
objectThe text corpus class.
- Parameters:
documents (Optional[list[Document]]) – List of documents. If
None, an empty list will be created. Default:None.
- append(document)
Add a document to the corpus.
- Parameters:
document (Document) – Document to append.
- classmethod from_directory(path, encoding='utf-8', language='english', remove_stopwords=True, lowercase=True)
Construct corpus from a directory containing plain text files.
- Parameters:
path (str) – Path to directory.
encoding (str) – Encoding of the files. Default: ‘utf-8’.
language (str) – Language of the files. Default: ‘english’.
remove_stopwords (bool) – If
True, remove stopwords from text. Default:True.lowercase (bool) – If
True, make text lowercase. Default:True.
- Returns:
The constructed corpus.
- Return type:
- classmethod from_list(lst, language='english', remove_stopwords=True, lowercase=True)
Construct corpus from a list of strings.
- Parameters:
lst (list[str]) – List of documents as strings.
language (str) – Language of the file. Default: ‘english’.
remove_stopwords (bool) – If
True, remove stopwords from text. Default:True.lowercase (bool) – If
True, make text lowercase. Default:True.
- Returns:
The constructed corpus.
- Return type:
- terms()
Get a list of unique terms in the corpus
- Returns:
List of unique terms in the corpus.
- Return type:
list[str]
- tf_idf_matrix(smooth=True, norm=2)
Get the tf-idf weights matrix as a pandas DataFrame.
- Parameters:
smooth (bool) – Smooth idf by adding one to the numerator and the denominator to prevent division by 0 errors. Default:
True.norm (int) – Order of the norm to normalize the matrix with. Default: 2.
- Returns:
The tf-idf matrix.
- Return type:
pd.DataFrame
- class niaarm.text.Document(text, language='english', remove_stopwords=True, lowercase=True)
Bases:
objectA text document class.
- Parameters:
text (str) – Document text.
language (str) – Document language. Used for tokenization and stopword removal. Default: ‘english’.
remove_stopwords (bool) – If
True, remove stopwords from text. Default:True.lowercase (bool) – If
True, make text lowercase. Default:True.
- frequency(term)
Get the frequency of a term,
- Parameters:
term (str) – Term to get frequency of.
- Returns:
Frequency of the term.
- Return type:
float
- classmethod from_file(path, encoding='utf-8', language='english', remove_stopwords=True, lowercase=True)
Construct document from a plain text file.
- Parameters:
path (str) – Path to file.
encoding (str) – Encoding of the file. Default: ‘utf-8’.
language (str) – Language of the file. Default: ‘english’.
remove_stopwords (bool) – If
True, remove stopwords from text. Default:True.lowercase (bool) – If
True, make text lowercase. Default:True.
- Returns:
The constructed document.
- Return type:
- class niaarm.text.NiaARTM(max_terms, terms, transactions, metrics, threshold=0, logging=False)
Bases:
NiaARMRepresentation of Association Rule Text Mining as an optimization problem.
The implementation is composed of ideas found in the following paper:
I. Fister, S. Deb, I. Fister, „Population-based metaheuristics for Association Rule Text Mining“, in Proceedings of the 2020 4th International Conference on Intelligent Systems, Metaheuristics & Swarm Intelligence, New York, NY, USA, mar. 2020, pp. 19–23. doi: 10.1145/3396474.3396493.
- Parameters:
max_terms (int) – Maximum number of terms in association rule..
features (list[str]) – List of unique terms in the corpus.
transactions (pandas.Dataframe) – The tf-idf matrix.
metrics (Union[Dict[str, float], Sequence[str]]) – Metrics to take into account when computing the fitness. Metrics can either be passed as a Dict of pairs {‘metric_name’: <weight of metric>} or a sequence of metrics as strings, in which case, the weights of the metrics will be set to 1.
threshold (Optional[float]) – Threshold of tf-idf weights. If a weight is less than or equal to the threshold, the term is not included in the transaction. Default: 0.
logging (bool) – Enable logging of fitness improvements. Default:
False.
- class niaarm.text.TextRule(antecedent, consequent, fitness=0.0, transactions=None, threshold=0)
Bases:
RuleClass representing a text association rule.
The class contains all the metrics in
Rule, except for amplitude, which returns nan.- Parameters:
antecedent (list[str]) – A list of antecedent terms of the text rule.
consequent (list[str]) – A list of consequent terms of the text rule.
fitness (Optional[float]) – Fitness value of the text rule.
transactions (Optional[pandas.DataFrame]) – The tf-idf matrix as a pandas DataFrame.
threshold (Optional[float]) – Threshold of tf-idf weights. If a weight is less than or equal to the threshold, the term is not included in the transaction. Default: 0.
- aws
The sum of tf-idf values for all the terms in the rule.
See also