| Title: | Topic Modeling with 'BERTopic' |
|---|---|
| Description: | Interface to the Python package 'BERTopic' <[https://maartengr.github.io/BERTopic/index.html]https://maartengr.github.io/BERTopic/index.html> for transformer-based topic modeling. Provides R wrappers to fit BERTopic models, transform new documents, update and reduce topics, extract topic- and document-level information, and generate interactive visualizations. 'Python' backends and dependencies are managed via the 'reticulate' package. |
| Authors: | Biying Zhou [aut, cre] |
| Maintainer: | Biying Zhou <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-05-27 09:23:41 UTC |
| Source: | https://github.com/feng-ji-lab/bertopic |
Coerce to data.frame
## S3 method for class 'bertopic_r' as.data.frame(x, ...)## S3 method for class 'bertopic_r' as.data.frame(x, ...)
x |
A "bertopic_r" model. |
... |
Unused. |
A data.frame equal to bertopic_topics().
Extract the document-topic probabilities as a matrix. If probabilities were not computed during fitting, returns NULL (with a warning).
bertopic_as_document_topic_matrix(model, sparse = TRUE, prefix = TRUE)bertopic_as_document_topic_matrix(model, sparse = TRUE, prefix = TRUE)
model |
A "bertopic_r" model object. |
sparse |
Logical; if TRUE and Matrix is available, returns a sparse matrix. |
prefix |
Logical; if TRUE, prefix columns as topic ids. |
A matrix or sparse Matrix of size n_docs x n_topics, or NULL.
Checks whether the active Python (as initialized by reticulate) can import the key modules needed for BERTopic.
bertopic_available()bertopic_available()
Logical scalar.
## Not run: bertopic_available() ## End(Not run)## Not run: bertopic_available() ## End(Not run)
Use BERTopic.find_topics() to retrieve the closest topics for a query
string. Augments topic IDs/scores with topic labels when available.
bertopic_find_topics(model, query_text, top_n = 5L)bertopic_find_topics(model, query_text, top_n = 5L)
model |
A "bertopic_r" model. |
query_text |
A length-1 character query. |
top_n |
Number of nearest topics to return. |
A tibble with columns topic, score, and label.
A high-level wrapper around Python 'BERTopic'. Python dependencies are checked at runtime.
bertopic_fit(text, embeddings = NULL, ...)bertopic_fit(text, embeddings = NULL, ...)
text |
Character vector of documents. |
embeddings |
Optional numeric matrix (n_docs x dim). If supplied, passed through to Python. |
... |
Additional arguments forwarded to |
An S3 object of class "bertopic_r" containing:
.py: the underlying Python model (reticulate object)
topics: integer vector of topic assignments
probs: numeric matrix/data frame of topic probabilities (if available)
## Not run: if (reticulate::py_module_available("bertopic")) { m <- bertopic_fit(c("a doc", "another doc")) print(class(m)) } ## End(Not run)## Not run: if (reticulate::py_module_available("bertopic")) { m <- bertopic_fit(c("a doc", "another doc")) print(class(m)) } ## End(Not run)
Retrieve document-level information for the provided documents.
bertopic_get_document_info(model, docs)bertopic_get_document_info(model, docs)
model |
A "bertopic_r" model. |
docs |
Character vector of documents to query (required). |
A tibble with document-level information.
Retrieve representative documents for a given topic using
BERTopic.get_representative_docs(). Falls back across signature variants.
bertopic_get_representative_docs(model, topic_id, top_n = 5L)bertopic_get_representative_docs(model, topic_id, top_n = 5L)
model |
A "bertopic_r" model. |
topic_id |
Integer topic id. |
top_n |
Number of representative documents to return. |
A tibble with columns rank and document. If scores are available
in the current BERTopic version, a score column is included.
Does the model have a usable embedding model?
bertopic_has_embedding_model(model)bertopic_has_embedding_model(model)
model |
A "bertopic_r" model. |
Logical; TRUE if embedding_model is present and not None.
Load a BERTopic model from disk that was saved with bertopic_save().
bertopic_load(path)bertopic_load(path)
path |
Path used in |
A "bertopic_r" object with the loaded Python model.
Wrapper over Python reduce_topics, compatible with multiple signatures.
bertopic_reduce_topics( model, nr_topics = "auto", representation_model = NULL, docs = NULL )bertopic_reduce_topics( model, nr_topics = "auto", representation_model = NULL, docs = NULL )
model |
A "bertopic_r" model. |
nr_topics |
Target number (integer) or "auto". |
representation_model |
Optional Python representation model. |
docs |
Optional character vector of training docs (used if required by backend). |
The input model (invisibly).
Save a fitted BERTopic model to disk. Depending on the serialization method, this may produce either a single file (e.g., *.pkl / *.pt / *.safetensors) or a directory bundle. The function does not pre-create the target path; it only ensures the parent directory exists and lets BERTopic decide the layout.
bertopic_save( model, path, serialization = c("pickle", "safetensors", "pt"), save_embedding_model = FALSE, overwrite = FALSE )bertopic_save( model, path, serialization = c("pickle", "safetensors", "pt"), save_embedding_model = FALSE, overwrite = FALSE )
model |
A "bertopic_r" model. |
path |
Destination path (file or directory, as required by BERTopic). |
serialization |
One of "pickle", "safetensors", or "pt". Default "pickle". |
save_embedding_model |
Logical; whether to include the embedding model. Default FALSE. |
overwrite |
Logical; if TRUE and the target exists, it will be replaced. |
Invisibly returns the normalized path.
Runs a quick end-to-end smoke test:
Report Python path/version.
Verify that bertopic is importable and report its version.
Minimal round trip: fit -> transform -> save -> load.
bertopic_self_check()bertopic_self_check()
A named list with fields:
Logical.
Logical.
Logical.
Character vector of diagnostic messages.
## Not run: bertopic_self_check() ## End(Not run)## Not run: bertopic_self_check() ## End(Not run)
Summarize Python/BERTopic session info
bertopic_session_info()bertopic_session_info()
A named list containing paths, versions, and module availability:
Path of the active Python.
Path to libpython, if any.
Python version string.
Whether NumPy is available.
NumPy version string (if available).
A data.frame with availability for key modules.
## Not run: bertopic_session_info() ## End(Not run)## Not run: bertopic_session_info() ## End(Not run)
Set a new embedding model on a fitted BERTopic instance. This enables
transform() after loading when the embedding model was not saved.
bertopic_set_embedding_model(model, embedding_model)bertopic_set_embedding_model(model, embedding_model)
model |
A "bertopic_r" model. |
embedding_model |
Either a character identifier (e.g., "all-MiniLM-L6-v2") or a Python embedding model object (e.g., a SentenceTransformer instance). |
The input model (invisibly).
Set custom labels for topics. Accepts a named character vector or a
data.frame with columns topic and label.
bertopic_set_topic_labels(model, labels)bertopic_set_topic_labels(model, labels)
model |
A "bertopic_r" model. |
labels |
A named character vector (names are topic ids) or a data.frame. |
The input model (invisibly).
Get top terms for a topic
bertopic_topic_terms(model, topic_id, top_n = 10L)bertopic_topic_terms(model, topic_id, top_n = 10L)
model |
A "bertopic_r" model |
topic_id |
Integer topic id |
top_n |
Number of top terms to return |
A tibble with columns term and weight
Get topic info as a tibble
bertopic_topics(model)bertopic_topics(model)
model |
A "bertopic_r" object returned by |
A tibble with topic-level information from Python get_topic_info().
Wrapper for Python BERTopic.topics_over_time(). Returns a tibble and
attaches the original Python dataframe in the "_py" attribute for use in
visualization.
bertopic_topics_over_time( model, docs, timestamps, nr_bins = NULL, datetime_format = NULL )bertopic_topics_over_time( model, docs, timestamps, nr_bins = NULL, datetime_format = NULL )
model |
A "bertopic_r" model. |
docs |
Character vector of documents. |
timestamps |
A vector of timestamps (Date, POSIXt, or character). |
nr_bins |
Optional number of temporal bins. |
datetime_format |
Optional strftime-style format if timestamps are strings. |
A tibble with topics-over-time data; attribute "_py" stores the
original Python dataframe.
Transform new documents with a fitted BERTopic model
bertopic_transform(model, new_text, embeddings = NULL)bertopic_transform(model, new_text, embeddings = NULL)
model |
A "bertopic_r" model from |
new_text |
Character vector of new documents. |
embeddings |
Optional numeric matrix for new documents. |
A list with topics and probs for the new documents.
Call Python BERTopic.update_topics() to recompute topic representations.
bertopic_update_topics(model, text)bertopic_update_topics(model, text)
model |
A "bertopic_r" model. |
text |
Character vector of training documents used in |
The input model (invisibly), updated in place on the Python side.
Visualize a topic barchart
bertopic_visualize_barchart(model, topic_id = NULL, file = NULL)bertopic_visualize_barchart(model, topic_id = NULL, file = NULL)
model |
A "bertopic_r" model. |
topic_id |
Integer topic id. If NULL, a set of top topics is shown. |
file |
Optional HTML output path. |
A barchart.
Wrapper around Python BERTopic.visualize_distribution(). This function
takes a single document's topic probability vector (e.g., one row from
probs) and returns an interactive Plotly figure as HTML or writes it
to disk.
bertopic_visualize_distribution( model, probs, min_probability = NULL, custom_labels = FALSE, title = NULL, width = NULL, height = NULL, file = NULL )bertopic_visualize_distribution( model, probs, min_probability = NULL, custom_labels = FALSE, title = NULL, width = NULL, height = NULL, file = NULL )
model |
A "bertopic_r" model. |
probs |
Numeric vector of topic probabilities for a single document. |
min_probability |
Optional numeric scalar. If provided, only
probabilities greater than this value are visualized (forwarded to
|
custom_labels |
Logical or character scalar. If logical, whether to
use custom topic labels as set via |
title |
Optional character plot title. |
width, height
|
Optional integer figure width/height in pixels. |
file |
Optional HTML output path. If NULL, an |
If file is NULL, an htmltools::HTML object. Otherwise, the
normalized file path is returned invisibly.
Visualize embedded documents
bertopic_visualize_documents(model, docs = NULL, file = NULL)bertopic_visualize_documents(model, docs = NULL, file = NULL)
model |
A "bertopic_r" model. |
docs |
Optional character vector of documents to visualize. |
file |
Optional HTML output path. |
An html file.
Visualize topic similarity heatmap
bertopic_visualize_heatmap(model, file = NULL)bertopic_visualize_heatmap(model, file = NULL)
model |
A "bertopic_r" model. |
file |
Optional HTML output path. |
An html file output.
Wrapper around Python BERTopic.visualize_hierarchical_documents().
This function visualizes documents and their topics in 2D at different
levels of a hierarchical topic structure.
bertopic_visualize_hierarchical_documents( model, docs, hierarchical_topics, topics = NULL, embeddings = NULL, reduced_embeddings = NULL, sample = NULL, hide_annotations = FALSE, hide_document_hover = TRUE, nr_levels = 10L, level_scale = c("linear", "log"), custom_labels = FALSE, title = NULL, width = NULL, height = NULL, file = NULL )bertopic_visualize_hierarchical_documents( model, docs, hierarchical_topics, topics = NULL, embeddings = NULL, reduced_embeddings = NULL, sample = NULL, hide_annotations = FALSE, hide_document_hover = TRUE, nr_levels = 10L, level_scale = c("linear", "log"), custom_labels = FALSE, title = NULL, width = NULL, height = NULL, file = NULL )
model |
A "bertopic_r" model. |
docs |
Character vector of documents used in |
hierarchical_topics |
A data frame or Python object as returned by
|
topics |
Optional integer vector of topic IDs to visualize. |
embeddings |
Optional numeric matrix of document embeddings. |
reduced_embeddings |
Optional numeric matrix of 2D reduced embeddings. |
sample |
Optional numeric (0–1) or integer controlling subsampling of documents per topic (forwarded to Python). |
hide_annotations |
Logical; if TRUE, hide cluster labels in the plot. |
hide_document_hover |
Logical; if TRUE, hide document text on hover to speed up rendering. |
nr_levels |
Integer; number of hierarchy levels to display. |
level_scale |
Character, either "linear" or "log", controlling how hierarchy distances are scaled across levels. |
custom_labels |
Logical or character scalar controlling label behavior (forwarded to Python). |
title |
Optional character plot title. |
width, height
|
Optional integer figure width/height in pixels. |
file |
Optional HTML output path. If NULL, an |
If file is NULL, an htmltools::HTML object. Otherwise, the
normalized file path is returned invisibly.
Visualize hierarchical clustering of topics
bertopic_visualize_hierarchy(model, file = NULL)bertopic_visualize_hierarchy(model, file = NULL)
model |
A "bertopic_r" model. |
file |
Optional HTML output path. |
An html file output.
Visualize term rank evolution
bertopic_visualize_term_rank(model, file = NULL)bertopic_visualize_term_rank(model, file = NULL)
model |
A "bertopic_r" model. |
file |
Optional HTML output path. |
No output. An HTML file will be saved.
Visualize topic map
bertopic_visualize_topics(model, file = NULL)bertopic_visualize_topics(model, file = NULL)
model |
A "bertopic_r" model. |
file |
Optional HTML output path. If NULL, returns htmltools::HTML. |
An HTML file.
Visualize topics over time
bertopic_visualize_topics_over_time( model, topics_over_time, top_n = 10L, file = NULL )bertopic_visualize_topics_over_time( model, topics_over_time, top_n = 10L, file = NULL )
model |
A "bertopic_r" model. |
topics_over_time |
A tibble returned by |
top_n |
Number of topics to display. |
file |
Optional HTML output path. |
An HTML object.
Wrapper around Python BERTopic.visualize_topics_per_class(). This
visualizes how topics are distributed across a set of classes, using the
output of Python topics_per_class(docs, classes).
bertopic_visualize_topics_per_class( model, topics_per_class, top_n_topics = 10L, topics = NULL, normalize_frequency = FALSE, custom_labels = FALSE, title = NULL, width = NULL, height = NULL, file = NULL )bertopic_visualize_topics_per_class( model, topics_per_class, top_n_topics = 10L, topics = NULL, normalize_frequency = FALSE, custom_labels = FALSE, title = NULL, width = NULL, height = NULL, file = NULL )
model |
A "bertopic_r" model. |
topics_per_class |
A data frame or Python object as returned by
|
top_n_topics |
Integer; number of most frequent topics to display. |
topics |
Optional integer vector of topic IDs to include. |
normalize_frequency |
Logical; whether to normalize each topic's frequency within classes. |
custom_labels |
Logical or character scalar controlling label behavior (forwarded to Python). |
title |
Optional character plot title. |
width, height
|
Optional integer figure width/height in pixels. |
file |
Optional HTML output path. If NULL, an |
If file is NULL, an htmltools::HTML object. Otherwise, the
normalized file path is returned invisibly.
Coefficients (top terms) for BERTopic
## S3 method for class 'bertopic_r' coef(object, top_n = 10L, ...)## S3 method for class 'bertopic_r' coef(object, top_n = 10L, ...)
object |
A "bertopic_r" model. |
top_n |
Number of terms per topic. |
... |
Unused. |
A data.frame with columns topic, term, weight.
Fortify method for ggplot2
fortify.bertopic_r(model, data, ...)fortify.bertopic_r(model, data, ...)
model |
A "bertopic_r" model. |
data |
Ignored. |
... |
Unused. |
A data.frame of document-topic assignments.
Tries Conda first (recommended). If Conda is unavailable, falls back to virtualenv. On success, prints which route was used.
install_py_deps( envname = "r-bertopic", python_version = "3.10", python = NULL, reinstall = FALSE, validate = TRUE, verbose = TRUE )install_py_deps( envname = "r-bertopic", python_version = "3.10", python = NULL, reinstall = FALSE, validate = TRUE, verbose = TRUE )
envname |
Character. Environment name (both routes). Default "r-bertopic". |
python_version |
Character. Python version for Conda route, e.g. "3.10". |
python |
Optional path to python for virtualenv route. |
reinstall |
Logical. Recreate the environment if it exists (route-specific). |
validate |
Logical. Attempt to validate imports if reticulate is not already initialized to another Python. |
verbose |
Logical. Print progress. |
Invisibly, the path to the selected Python interpreter.
Creates (or reuses) a Conda environment with a pinned Python toolchain,
installs the scientific stack + PyTorch (CPU) + sentence-transformers, then
installs bertopic==0.16.0 via pip. Optionally validates imports.
install_py_deps_conda( envname = "r-bertopic", python_version = "3.10", reinstall = FALSE, validate = TRUE, verbose = TRUE )install_py_deps_conda( envname = "r-bertopic", python_version = "3.10", reinstall = FALSE, validate = TRUE, verbose = TRUE )
envname |
Character. Conda environment name. Default |
python_version |
Character. Python version to use, e.g. |
reinstall |
Logical. If |
validate |
Logical. If |
verbose |
Logical. Print progress messages. |
Invisibly returns the path to the Python executable inside the env.
## Not run: install_py_deps_conda(envname = "r-bertopic", python_version = "3.10") ## End(Not run)## Not run: install_py_deps_conda(envname = "r-bertopic", python_version = "3.10") ## End(Not run)
Creates (or reuses) a virtualenv and installs bertopic==0.16.0
plus required dependencies via pip. Optionally validates imports.
install_py_deps_venv( envname = "r-bertopic", python = NULL, reinstall = FALSE, validate = TRUE, verbose = TRUE )install_py_deps_venv( envname = "r-bertopic", python = NULL, reinstall = FALSE, validate = TRUE, verbose = TRUE )
envname |
Character. Virtualenv name. Default |
python |
Character. Path to a Python executable to create the venv with.
If |
reinstall |
Logical. If |
validate |
Logical. If |
verbose |
Logical. Print progress messages. |
Invisibly returns the path to the Python executable inside the venv.
## Not run: install_py_deps_venv(envname = "r-bertopic") ## End(Not run)## Not run: install_py_deps_venv(envname = "r-bertopic") ## End(Not run)
Predict method for BERTopic models
## S3 method for class 'bertopic_r' predict( object, newdata, type = c("both", "class", "prob"), embeddings = NULL, ... )## S3 method for class 'bertopic_r' predict( object, newdata, type = c("both", "class", "prob"), embeddings = NULL, ... )
object |
A "bertopic_r" model. |
newdata |
Character vector of new documents. |
type |
One of "class", "prob", or "both". |
embeddings |
Optional numeric matrix of embeddings. |
... |
Reserved for future arguments. |
Depending on type, an integer vector, a matrix/data frame, or a list.
Print method for bertopic_r
## S3 method for class 'bertopic_r' print(x, ...)## S3 method for class 'bertopic_r' print(x, ...)
x |
A "bertopic_r" object. |
... |
Unused. |
No return value. Output will be printed.
Set random seed for R and Python backends
set_bertopic_seed(seed)set_bertopic_seed(seed)
seed |
Integer seed |
No return value. The seed will be changed.
A cleaned subset of the UCI SMS Spam Collection, suitable for quick examples and tests in this package. Each row is an SMS message labeled as "ham" or "spam".
sms_spamsms_spam
A data frame with two columns:
Character, either "ham" or "spam".
Character, the SMS message content (UTF-8).
This dataset is included for educational/demo purposes. If you use it in publications, please cite the original authors and the UCI repository page.
UCI Machine Learning Repository: SMS Spam Collection. Dataset page: https://archive.ics.uci.edu/dataset/228/sms+spam+collection Original citation: Almeida, T.A., Hidalgo, J.M.G., & Yamakami, A. (2011). Contributions to the Study of SMS Spam Filtering: New Collection and Results.
data(sms_spam) head(sms_spam)data(sms_spam) head(sms_spam)
Summary for BERTopic models
## S3 method for class 'bertopic_r' summary(object, ...)## S3 method for class 'bertopic_r' summary(object, ...)
object |
A "bertopic_r" model. |
... |
Unused. |
Invisibly returns a named list of summary fields.
If a Conda env with the given name exists, prefer Conda; otherwise try a virtualenv with the same name. Stops if neither exists.
use_bertopic(envname = "r-bertopic")use_bertopic(envname = "r-bertopic")
envname |
Character. Environment name. Default "r-bertopic". |
Invisibly, the Python executable path.
Sets RETICULATE_PYTHON to the environment's Python and initializes
reticulate. If reticulate is already initialized to a different
Python, this stops with an informative error.
use_bertopic_condaenv(envname = "r-bertopic", required = TRUE)use_bertopic_condaenv(envname = "r-bertopic", required = TRUE)
envname |
Character. Conda env name (default |
required |
Logical. Kept for API symmetry; unused. |
Invisibly returns the Python executable path in the env.
## Not run: use_bertopic_condaenv("r-bertopic") ## End(Not run)## Not run: use_bertopic_condaenv("r-bertopic") ## End(Not run)
Sets RETICULATE_PYTHON to the Python inside the given virtualenv and
initializes reticulate. If reticulate is already initialized to a
different Python, this stops with an informative error.
use_bertopic_virtualenv(envname = "r-bertopic", required = TRUE)use_bertopic_virtualenv(envname = "r-bertopic", required = TRUE)
envname |
Character. Virtualenv name (default |
required |
Logical. Kept for API symmetry; unused. |
Invisibly returns the Python executable path in the venv.
## Not run: use_bertopic_virtualenv("r-bertopic") ## End(Not run)## Not run: use_bertopic_virtualenv("r-bertopic") ## End(Not run)