Calculates Hellinger distance between rows of one or two matrices or tidied topic model dataframes.
Usage
hellinger(topics1, ...)
# S3 method for class 'Matrix'
hellinger(topics1, topics2 = NULL, ...)
# S3 method for class 'matrix'
hellinger(...)
# S3 method for class 'data.frame'
hellinger(
topics1,
id1 = "document",
cat1 = "topic",
prob1 = "prob",
topics2 = NULL,
id2 = "document",
cat2 = "topic",
prob2 = "prob",
df = FALSE,
...
)Arguments
- topics1
First matrix (\(n_1 \times k\)), base R matrix, or tidied topic model dataframe.
- ...
Not used; required for S3 method compatibility.
- topics2
Optional second matrix (\(n_2 \times k\)) or dataframe of the same type as
topics1. WhenNULL(default), pairwise distances withintopics1are returned.- id1
Unit identifier column in
topics1(data.frame method only).- cat1
Category identifier column in
topics1(data.frame method only).- prob1
Probability value column in
topics1(data.frame method only).- id2
Unit identifier column in
topics2(data.frame method only).- cat2
Category identifier column in
topics2(data.frame method only).- prob2
Probability value column in
topics2(data.frame method only).- df
Should the function return the matrix of Hellinger distances (default) or a tidy dataframe? (data.frame method only)
Value
Matrix of size \(n_1 \times n_1\) or \(n_1 \times n_2\) (Matrix/matrix methods), or a matrix or tidy dataframe of Hellinger distances (data.frame method).
Examples
# Matrix / matrix method
set.seed(2022-06-09)
topics1 = rdirichlet(3, rep(5, 5))
topics2 = rdirichlet(3, rep(5, 5))
hellinger(topics1)
#> [,1] [,2] [,3]
#> [1,] 0.0000000 0.3067419 2.668745e-01
#> [2,] 0.3067419 0.0000000 1.230902e-01
#> [3,] 0.2668745 0.1230902 1.053671e-08
hellinger(topics1, topics2)
#> [,1] [,2] [,3]
#> [1,] 0.2361547 0.2632094 0.33018266
#> [2,] 0.1777705 0.1060308 0.12270871
#> [3,] 0.1296687 0.1732766 0.08788004
# data.frame method
set.seed(2022-06-09)
topics1 = rdirichlet(3, rep(5, 5)) |>
tibble::as_tibble(rownames = 'doc_id',
.name_repair = tmfast:::make_colnames) |>
dplyr::mutate(doc_id = stringr::str_c('doc_', doc_id)) |>
tidyr::pivot_longer(tidyselect::starts_with('V'),
names_to = 'topic',
values_to = 'gamma')
topics2 = rdirichlet(3, rep(5, 5)) |>
tibble::as_tibble(rownames = 'doc_id',
.name_repair = tmfast:::make_colnames) |>
dplyr::mutate(doc_id = stringr::str_c('doc_', as.integer(doc_id) + 5)) |>
tidyr::pivot_longer(tidyselect::starts_with('V'),
names_to = 'topic',
values_to = 'gamma')
hellinger(topics1, doc_id, prob1 = 'gamma', df = TRUE)
#> # A tibble: 9 × 3
#> doc_id document dist
#> <chr> <chr> <dbl>
#> 1 doc_1 doc_1 0
#> 2 doc_1 doc_2 0.307
#> 3 doc_1 doc_3 0.267
#> 4 doc_2 doc_1 0.307
#> 5 doc_2 doc_2 0
#> 6 doc_2 doc_3 0.123
#> 7 doc_3 doc_1 0.267
#> 8 doc_3 doc_2 0.123
#> 9 doc_3 doc_3 0.0000000105
hellinger(topics1, doc_id, prob1 = 'gamma',
topics2 = topics2, id2 = doc_id, prob2 = 'gamma')
#> doc_6 doc_7 doc_8
#> doc_1 0.2361547 0.2632094 0.33018266
#> doc_2 0.1777705 0.1060308 0.12270871
#> doc_3 0.1296687 0.1732766 0.08788004