Skip to contents

Computes pairwise Hellinger distances between topics from one or two fitted models. Tokens missing from a beta dataframe are filled with probability 0 before comparison, so both models need not share the same vocabulary.

Usage

compare_betas(beta1, beta2 = NULL, vocab)

Arguments

beta1

Tidy beta dataframe with columns token, topic, and beta, as returned by tidy(model, matrix = 'beta').

beta2

Optional second tidy beta dataframe in the same format. If NULL (default), pairwise distances among the topics in beta1 are returned.

vocab

Character vector of vocabulary tokens used to align the column space of both matrices. Tokens in beta1 or beta2 that are not in vocab are dropped; tokens in vocab absent from a beta are filled with probability 0.

Value

Numeric matrix of Hellinger distances. Dimensions are k1 × k1 when beta2 = NULL, or k1 × k2 when two beta dataframes are supplied, where k1 and k2 are the number of topics in each model.

Examples

set.seed(42)
vocab = letters[1:5]
make_beta = function(k) {
  rdirichlet(k, rep(1, length(vocab))) |>
    tibble::as_tibble(.name_repair = ~vocab) |>
    dplyr::mutate(topic = paste0('t', dplyr::row_number())) |>
    tidyr::pivot_longer(-topic, names_to = 'token', values_to = 'beta')
}
beta1 = make_beta(3)
beta2 = make_beta(4)
compare_betas(beta1, vocab = vocab)
#> 3 x 3 Matrix of class "dgeMatrix"
#>           t1           t2        t3
#> t1 0.0000000 5.934778e-01 0.2699147
#> t2 0.5934778 1.825012e-08 0.4786181
#> t3 0.2699147 4.786181e-01 0.0000000
compare_betas(beta1, beta2, vocab = vocab)
#> 3 x 4 Matrix of class "dgeMatrix"
#>           t1        t2        t3        t4
#> t1 0.3477229 0.3744788 0.3449200 0.6172048
#> t2 0.4991166 0.3211290 0.4987743 0.4733966
#> t3 0.2468361 0.2657232 0.2429489 0.4208010