Calculates \(\log_2 n \times \delta H\), the log total occurrence times information gain (relative to the uniform distribution) for each term. I prefer this for vocabulary selection over methods such as TF-IDF.
Examples
# \donttest{
if (requireNamespace("janeaustenr", quietly = TRUE)) {
library(dplyr)
library(tidytext)
austen_df = janeaustenr::austen_books() |>
unnest_tokens(term, text, token = 'words') |>
mutate(author = 'Jane Austen') |>
count(author, book, term)
ndH(austen_df, book, term, n)
}
#> Warning: package ‘dplyr’ was built under R version 4.5.2
#>
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#>
#> filter, lag
#> The following objects are masked from ‘package:base’:
#>
#> intersect, setdiff, setequal, union
#> # A tibble: 14,520 × 5
#> term H dH n ndH
#> <chr> <dbl> <dbl> <int> <dbl>
#> 1 emma 0.0141 2.57 787 24.7
#> 2 elinor 0 2.58 623 24.0
#> 3 crawford 0 2.58 493 23.1
#> 4 marianne 0 2.58 492 23.1
#> 5 weston 0 2.58 389 22.2
#> 6 darcy 0 2.58 373 22.1
#> 7 fanny 0.323 2.26 862 22.1
#> 8 edmund 0 2.58 364 22.0
#> 9 knightley 0 2.58 356 21.9
#> 10 harriet 0.0873 2.50 419 21.8
#> # ℹ 14,510 more rows
# }