Skip to content

fastrtext — fastText for R, without the papercuts

Posted on:February 15, 2020

Modern NLP work in R has long been split between elegant data tooling and pragmatic compromises. If you wanted a fast baseline classifier, or robust word vectors with subword information, you usually left R, called out to a Python or C++ binary, and stitched results back with temp files. That’s brittle and slow for day-to-day analysis.

fastrtext is a thin, native R interface to fastText, the C++ library by Facebook/Meta for efficient text classification and word representations. It brings the fastText CLI into memory in R—train, load, and predict without dropping to the shell or juggling intermediate files.

Why fastText here, now

What fastrtext adds for R users

Five-minute classifier (supervised)

library(fastrtext)

# toy data shipped with the package
data("train_sentences"); data("test_sentences")

# fastText expects "__label__<class> <text>"
train_txt <- tempfile()
writeLines(
  paste0("__label__", train_sentences$class.text, " ",
         tolower(train_sentences$text)),
  train_txt
)

# train a small model in-memory via the CLI-compatible API
model_path <- tempfile()
execute(c("supervised",
          "-input",  train_txt,
          "-output", model_path,
          "-dim", "50", "-epoch", "15",
          "-wordNgrams", "2", "-verbose", "1"))

m <- load_model(model_path)

# one-label prediction; a named numeric vector (proba), names = predicted label
p <- predict(m, tolower(test_sentences$text), k = 1, simplify = TRUE)

# quick accuracy check
acc <- mean(names(p) == test_sentences$class.text)
acc

The point isn’t to beat a bi lstm on a benchmark; it’s to give you a respectable, production‑tolerant baseline in minutes, all in R.

Word vectors in two calls (unsupervised)

library(fastrtext)

data("train_sentences")
corpus_txt <- tempfile()
writeLines(tolower(train_sentences$text), corpus_txt)

vec_model <- tempfile()
execute(c("skipgram", "-input", corpus_txt, "-output", vec_model))

mv <- load_model(vec_model)
get_nn(mv, "time", 10)  # nearest neighbors by cosine distance

If you have a domain corpus (support emails, product titles, clinical notes), these vectors are often a sensible baseline for downstream similarity or light‑weight classifiers.

R first, without giving up fastText

Where to start