Hellinger distances — hellinger • tmfast

Calculates Hellinger distance between rows of one or two matrices or tidied topic model dataframes.

Usage

hellinger(topics1, ...)

# S3 method for class 'Matrix'
hellinger(topics1, topics2 = NULL, ...)

# S3 method for class 'matrix'
hellinger(...)

# S3 method for class 'data.frame'
hellinger(
  topics1,
  id1 = "document",
  cat1 = "topic",
  prob1 = "prob",
  topics2 = NULL,
  id2 = "document",
  cat2 = "topic",
  prob2 = "prob",
  df = FALSE,
  ...
)

Arguments

topics1: First matrix (\(n_1 \times k\)), base R matrix, or tidied topic model dataframe.
...: Not used; required for S3 method compatibility.
topics2: Optional second matrix (\(n_2 \times k\)) or dataframe of the same type as topics1. When NULL (default), pairwise distances within topics1 are returned.
id1: Unit identifier column in topics1 (data.frame method only).
cat1: Category identifier column in topics1 (data.frame method only).
prob1: Probability value column in topics1 (data.frame method only).
id2: Unit identifier column in topics2 (data.frame method only).
cat2: Category identifier column in topics2 (data.frame method only).
prob2: Probability value column in topics2 (data.frame method only).
df: Should the function return the matrix of Hellinger distances (default) or a tidy dataframe? (data.frame method only)

Value

Matrix of size \(n_1 \times n_1\) or \(n_1 \times n_2\) (Matrix/matrix methods), or a matrix or tidy dataframe of Hellinger distances (data.frame method).

Examples

# Matrix / matrix method
set.seed(2022-06-09)
topics1 = rdirichlet(3, rep(5, 5))
topics2 = rdirichlet(3, rep(5, 5))
hellinger(topics1)
#>           [,1]      [,2]         [,3]
#> [1,] 0.0000000 0.3067419 2.668745e-01
#> [2,] 0.3067419 0.0000000 1.230902e-01
#> [3,] 0.2668745 0.1230902 1.053671e-08
hellinger(topics1, topics2)
#>           [,1]      [,2]       [,3]
#> [1,] 0.2361547 0.2632094 0.33018266
#> [2,] 0.1777705 0.1060308 0.12270871
#> [3,] 0.1296687 0.1732766 0.08788004

# data.frame method
set.seed(2022-06-09)
topics1 = rdirichlet(3, rep(5, 5)) |>
    tibble::as_tibble(rownames = 'doc_id',
                      .name_repair = tmfast:::make_colnames) |>
    dplyr::mutate(doc_id = stringr::str_c('doc_', doc_id)) |>
    tidyr::pivot_longer(tidyselect::starts_with('V'),
                        names_to = 'topic',
                        values_to = 'gamma')
topics2 = rdirichlet(3, rep(5, 5)) |>
    tibble::as_tibble(rownames = 'doc_id',
                      .name_repair = tmfast:::make_colnames) |>
    dplyr::mutate(doc_id = stringr::str_c('doc_', as.integer(doc_id) + 5)) |>
    tidyr::pivot_longer(tidyselect::starts_with('V'),
                        names_to = 'topic',
                        values_to = 'gamma')
hellinger(topics1, doc_id, prob1 = 'gamma', df = TRUE)
#> # A tibble: 9 × 3
#>   doc_id document         dist
#>   <chr>  <chr>           <dbl>
#> 1 doc_1  doc_1    0           
#> 2 doc_1  doc_2    0.307       
#> 3 doc_1  doc_3    0.267       
#> 4 doc_2  doc_1    0.307       
#> 5 doc_2  doc_2    0           
#> 6 doc_2  doc_3    0.123       
#> 7 doc_3  doc_1    0.267       
#> 8 doc_3  doc_2    0.123       
#> 9 doc_3  doc_3    0.0000000105
hellinger(topics1, doc_id, prob1 = 'gamma',
          topics2 = topics2, id2 = doc_id, prob2 = 'gamma')
#>           doc_6     doc_7      doc_8
#> doc_1 0.2361547 0.2632094 0.33018266
#> doc_2 0.1777705 0.1060308 0.12270871
#> doc_3 0.1296687 0.1732766 0.08788004