Skip to contents

Calculates Hellinger distance between rows of one or two matrices or tidied topic model dataframes.

Usage

hellinger(topics1, ...)

# S3 method for class 'Matrix'
hellinger(topics1, topics2 = NULL, ...)

# S3 method for class 'matrix'
hellinger(...)

# S3 method for class 'data.frame'
hellinger(
  topics1,
  id1 = "document",
  cat1 = "topic",
  prob1 = "prob",
  topics2 = NULL,
  id2 = "document",
  cat2 = "topic",
  prob2 = "prob",
  df = FALSE,
  ...
)

Arguments

topics1

First matrix (\(n_1 \times k\)), base R matrix, or tidied topic model dataframe.

...

Not used; required for S3 method compatibility.

topics2

Optional second matrix (\(n_2 \times k\)) or dataframe of the same type as topics1. When NULL (default), pairwise distances within topics1 are returned.

id1

Unit identifier column in topics1 (data.frame method only).

cat1

Category identifier column in topics1 (data.frame method only).

prob1

Probability value column in topics1 (data.frame method only).

id2

Unit identifier column in topics2 (data.frame method only).

cat2

Category identifier column in topics2 (data.frame method only).

prob2

Probability value column in topics2 (data.frame method only).

df

Should the function return the matrix of Hellinger distances (default) or a tidy dataframe? (data.frame method only)

Value

Matrix of size \(n_1 \times n_1\) or \(n_1 \times n_2\) (Matrix/matrix methods), or a matrix or tidy dataframe of Hellinger distances (data.frame method).

Examples

# Matrix / matrix method
set.seed(2022-06-09)
topics1 = rdirichlet(3, rep(5, 5))
topics2 = rdirichlet(3, rep(5, 5))
hellinger(topics1)
#>           [,1]      [,2]         [,3]
#> [1,] 0.0000000 0.3067419 2.668745e-01
#> [2,] 0.3067419 0.0000000 1.230902e-01
#> [3,] 0.2668745 0.1230902 1.053671e-08
hellinger(topics1, topics2)
#>           [,1]      [,2]       [,3]
#> [1,] 0.2361547 0.2632094 0.33018266
#> [2,] 0.1777705 0.1060308 0.12270871
#> [3,] 0.1296687 0.1732766 0.08788004

# data.frame method
set.seed(2022-06-09)
topics1 = rdirichlet(3, rep(5, 5)) |>
    tibble::as_tibble(rownames = 'doc_id',
                      .name_repair = tmfast:::make_colnames) |>
    dplyr::mutate(doc_id = stringr::str_c('doc_', doc_id)) |>
    tidyr::pivot_longer(tidyselect::starts_with('V'),
                        names_to = 'topic',
                        values_to = 'gamma')
topics2 = rdirichlet(3, rep(5, 5)) |>
    tibble::as_tibble(rownames = 'doc_id',
                      .name_repair = tmfast:::make_colnames) |>
    dplyr::mutate(doc_id = stringr::str_c('doc_', as.integer(doc_id) + 5)) |>
    tidyr::pivot_longer(tidyselect::starts_with('V'),
                        names_to = 'topic',
                        values_to = 'gamma')
hellinger(topics1, doc_id, prob1 = 'gamma', df = TRUE)
#> # A tibble: 9 × 3
#>   doc_id document         dist
#>   <chr>  <chr>           <dbl>
#> 1 doc_1  doc_1    0           
#> 2 doc_1  doc_2    0.307       
#> 3 doc_1  doc_3    0.267       
#> 4 doc_2  doc_1    0.307       
#> 5 doc_2  doc_2    0           
#> 6 doc_2  doc_3    0.123       
#> 7 doc_3  doc_1    0.267       
#> 8 doc_3  doc_2    0.123       
#> 9 doc_3  doc_3    0.0000000105
hellinger(topics1, doc_id, prob1 = 'gamma',
          topics2 = topics2, id2 = doc_id, prob2 = 'gamma')
#>           doc_6     doc_7      doc_8
#> doc_1 0.2361547 0.2632094 0.33018266
#> doc_2 0.1777705 0.1060308 0.12270871
#> doc_3 0.1296687 0.1732766 0.08788004