Modern NLP work in R has long been split between elegant data tooling and pragmatic compromises. If you wanted a fast baseline classifier, or robust word vectors with subword information, you usually left R, called out to a Python or C++ binary, and stitched results back with temp files. That’s brittle and slow for day-to-day analysis.
fastrtext is a thin, native R interface to fastText, the C++ library by Facebook/Meta for efficient text classification and word representations. It brings the fastText CLI into memory in R—train, load, and predict without dropping to the shell or juggling intermediate files.
Why fastText here, now
- Strong baselines on CPU. fastText’s “bag of tricks” architecture trains quickly and sets a solid line for many real‑world text tasks, especially when GPUs aren’t an option. It also ships subword models that handle rare words and morphology gracefully.
- Multi-lingual reach. Pre-trained vectors exist for well over a hundred languages, which matters when projects span markets.
- Small models, if you need them. Quantization is supported upstream for tight deployment footprints.
What fastrtext adds for R users
- One function to do what the CLI does.
execute()
mirrors the CLI (supervised, skipgram, cbow, options, etc.), but you stay in R. - In-memory prediction. Load a model with
load_model()
and callpredict()
on character vectors—no disk round-trips. - Usable vector ops. Retrieve the dictionary, extract word vectors, and get nearest neighbors straight from R.
Five-minute classifier (supervised)
library(fastrtext)
# toy data shipped with the package
data("train_sentences"); data("test_sentences")
# fastText expects "__label__<class> <text>"
train_txt <- tempfile()
writeLines(
paste0("__label__", train_sentences$class.text, " ",
tolower(train_sentences$text)),
train_txt
)
# train a small model in-memory via the CLI-compatible API
model_path <- tempfile()
execute(c("supervised",
"-input", train_txt,
"-output", model_path,
"-dim", "50", "-epoch", "15",
"-wordNgrams", "2", "-verbose", "1"))
m <- load_model(model_path)
# one-label prediction; a named numeric vector (proba), names = predicted label
p <- predict(m, tolower(test_sentences$text), k = 1, simplify = TRUE)
# quick accuracy check
acc <- mean(names(p) == test_sentences$class.text)
acc
The point isn’t to beat a bi lstm on a benchmark; it’s to give you a respectable, production‑tolerant baseline in minutes, all in R.
Word vectors in two calls (unsupervised)
library(fastrtext)
data("train_sentences")
corpus_txt <- tempfile()
writeLines(tolower(train_sentences$text), corpus_txt)
vec_model <- tempfile()
execute(c("skipgram", "-input", corpus_txt, "-output", vec_model))
mv <- load_model(vec_model)
get_nn(mv, "time", 10) # nearest neighbors by cosine distance
If you have a domain corpus (support emails, product titles, clinical notes), these vectors are often a sensible baseline for downstream similarity or light‑weight classifiers.
R first, without giving up fastText
- The API mirrors the CLI (
print_help()
, training flags, quantization options), so docs/tutorials from upstream carry over cleanly. - Small helper affordances (Hamming loss, dictionary access) make quick iteration comfortable in RStudio/notebooks.
Where to start
- Repo: github.com/pommedeterresautee/fastrtext
- Docs: package site with “Supervised” and “Unsupervised” quickstarts.
- Upstream background: fastText website + papers (Bojanowski et al. on subword vectors; Joulin et al. on efficient text classification).