Compare topic-word distributions using Hellinger distance
Source:R/compare_betas.R
compare_betas.RdComputes pairwise Hellinger distances between topics from one or two fitted models. Tokens missing from a beta dataframe are filled with probability 0 before comparison, so both models need not share the same vocabulary.
Arguments
- beta1
Tidy beta dataframe with columns
token,topic, andbeta, as returned bytidy(model, matrix = 'beta').- beta2
Optional second tidy beta dataframe in the same format. If
NULL(default), pairwise distances among the topics inbeta1are returned.- vocab
Character vector of vocabulary tokens used to align the column space of both matrices. Tokens in
beta1orbeta2that are not invocabare dropped; tokens invocababsent from a beta are filled with probability 0.
Value
Numeric matrix of Hellinger distances. Dimensions are k1 × k1 when
beta2 = NULL, or k1 × k2 when two beta dataframes are supplied, where
k1 and k2 are the number of topics in each model.
Examples
set.seed(42)
vocab = letters[1:5]
make_beta = function(k) {
rdirichlet(k, rep(1, length(vocab))) |>
tibble::as_tibble(.name_repair = ~vocab) |>
dplyr::mutate(topic = paste0('t', dplyr::row_number())) |>
tidyr::pivot_longer(-topic, names_to = 'token', values_to = 'beta')
}
beta1 = make_beta(3)
beta2 = make_beta(4)
compare_betas(beta1, vocab = vocab)
#> 3 x 3 Matrix of class "dgeMatrix"
#> t1 t2 t3
#> t1 0.0000000 5.934778e-01 0.2699147
#> t2 0.5934778 1.825012e-08 0.4786181
#> t3 0.2699147 4.786181e-01 0.0000000
compare_betas(beta1, beta2, vocab = vocab)
#> 3 x 4 Matrix of class "dgeMatrix"
#> t1 t2 t3 t4
#> t1 0.3477229 0.3744788 0.3449200 0.6172048
#> t2 0.4991166 0.3211290 0.4987743 0.4733966
#> t3 0.2468361 0.2657232 0.2429489 0.4208010