Skip to contents

Draw a collection of documents

Usage

draw_corpus(N, theta, phi)

Arguments

N

Length of documents

theta

Topic distribution for all documents, \(n \times k\) matrix

phi

Word distribution for all topics, \(k \times v\) matrix

Value

Document-term matrix, as a tibble, with columns doc, word, and n

Details

Standard pattern for generating a simulated DTM suitable for tmfast():

set.seed(42)
theta  = rdirichlet(n_docs,   alpha = 1,   k = n_topics)
phi    = rdirichlet(n_topics, alpha = 0.1, k = vocab_size)
corpus = draw_corpus(rep(doc_length, n_docs), theta, phi)
model  = tmfast(corpus, n = n_topics)

alpha = 1 for theta gives uniform topic mixing; alpha = 0.1 for phi gives sparse, topic-specific word distributions. doc_length should be large enough that the full vocabulary is likely to appear (50–200 words per document is typical for a small simulated example).

See also

Other generators: journal_specific(), peak_alpha(), rdirichlet()

Examples

# \donttest{
set.seed(42)
theta  = rdirichlet(30, 1, k = 3)
phi    = rdirichlet(3, 0.1, k = 20)
corpus = draw_corpus(rep(50L, 30), theta, phi)
#> Warning: package ‘purrr’ was built under R version 4.5.2
head(corpus)
#> # A tibble: 6 × 3
#>     doc  word     n
#>   <int> <int> <int>
#> 1     1     1     3
#> 2     1     2    16
#> 3     1    12     1
#> 4     1    15     6
#> 5     1    17     3
#> 6     1    19    17
# }