Draw a collection of documents
Details
Standard pattern for generating a simulated DTM suitable for tmfast():
set.seed(42)
theta = rdirichlet(n_docs, alpha = 1, k = n_topics)
phi = rdirichlet(n_topics, alpha = 0.1, k = vocab_size)
corpus = draw_corpus(rep(doc_length, n_docs), theta, phi)
model = tmfast(corpus, n = n_topics)alpha = 1 for theta gives uniform topic mixing; alpha = 0.1 for phi
gives sparse, topic-specific word distributions. doc_length should be large
enough that the full vocabulary is likely to appear (50–200 words per document
is typical for a small simulated example).
See also
Other generators:
journal_specific(),
peak_alpha(),
rdirichlet()
Examples
# \donttest{
set.seed(42)
theta = rdirichlet(30, 1, k = 3)
phi = rdirichlet(3, 0.1, k = 20)
corpus = draw_corpus(rep(50L, 30), theta, phi)
#> Warning: package ‘purrr’ was built under R version 4.5.2
head(corpus)
#> # A tibble: 6 × 3
#> doc word n
#> <int> <int> <int>
#> 1 1 1 3
#> 2 1 2 16
#> 3 1 12 1
#> 4 1 15 6
#> 5 1 17 3
#> 6 1 19 17
# }