Compare topic-word distributions using Hellinger distance

Computes pairwise Hellinger distances between topics from one or two fitted models. Tokens missing from a beta dataframe are filled with probability 0 before comparison, so both models need not share the same vocabulary.

Usage

compare_betas(beta1, beta2 = NULL, vocab)

Arguments

beta1: Tidy beta dataframe with columns token, topic, and beta, as returned by tidy(model, matrix = 'beta').
beta2: Optional second tidy beta dataframe in the same format. If NULL (default), pairwise distances among the topics in beta1 are returned.
vocab: Character vector of vocabulary tokens used to align the column space of both matrices. Tokens in beta1 or beta2 that are not in vocab are dropped; tokens in vocab absent from a beta are filled with probability 0.

Value

Numeric matrix of Hellinger distances. Dimensions are k1 × k1 when beta2 = NULL, or k1 × k2 when two beta dataframes are supplied, where k1 and k2 are the number of topics in each model.

Examples

set.seed(42)
vocab = letters[1:5]
make_beta = function(k) {
  rdirichlet(k, rep(1, length(vocab))) |>
    tibble::as_tibble(.name_repair = ~vocab) |>
    dplyr::mutate(topic = paste0('t', dplyr::row_number())) |>
    tidyr::pivot_longer(-topic, names_to = 'token', values_to = 'beta')
}
beta1 = make_beta(3)
beta2 = make_beta(4)
compare_betas(beta1, vocab = vocab)
#> 3 x 3 Matrix of class "dgeMatrix"
#>           t1           t2        t3
#> t1 0.0000000 5.934778e-01 0.2699147
#> t2 0.5934778 1.825012e-08 0.4786181
#> t3 0.2699147 4.786181e-01 0.0000000
compare_betas(beta1, beta2, vocab = vocab)
#> 3 x 4 Matrix of class "dgeMatrix"
#>           t1        t2        t3        t4
#> t1 0.3477229 0.3744788 0.3449200 0.6172048
#> t2 0.4991166 0.3211290 0.4987743 0.4733966
#> t3 0.2468361 0.2657232 0.2429489 0.4208010