Calculates \(\log_2 n \times \delta H\), the log total occurrence times information gain (relative to the uniform distribution) for each term. I prefer this for vocabulary selection over methods such as TF-IDF.
Examples
library(tidyverse)
#> Warning: package ‘tibble’ was built under R version 4.5.2
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.4 ✔ readr 2.1.5
#> ✔ forcats 1.0.0 ✔ stringr 1.5.1
#> ✔ ggplot2 3.5.2 ✔ tibble 3.3.1
#> ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
#> ✔ purrr 1.1.0
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the conflicted package to force all conflicts to become errors
library(tidytext)
library(janeaustenr)
austen_df = austen_books() |>
unnest_tokens(term, text, token = 'words') |>
mutate(author = 'Jane Austen') |>
count(author, book, term)
ndH(austen_df, book, term, n)
#> # A tibble: 14,520 × 5
#> term H dH n ndH
#> <chr> <dbl> <dbl> <int> <dbl>
#> 1 emma 0.0141 2.57 787 24.7
#> 2 elinor 0 2.58 623 24.0
#> 3 crawford 0 2.58 493 23.1
#> 4 marianne 0 2.58 492 23.1
#> 5 weston 0 2.58 389 22.2
#> 6 darcy 0 2.58 373 22.1
#> 7 fanny 0.323 2.26 862 22.1
#> 8 edmund 0 2.58 364 22.0
#> 9 knightley 0 2.58 356 21.9
#> 10 harriet 0.0873 2.50 419 21.8
#> # ℹ 14,510 more rows