Calculates \(\log_2 n \times \delta H\), the log total occurrence times information gain (relative to the uniform distribution) for each term. I prefer this for vocabulary selection over methods such as TF-IDF.
- dataf
Tidy document-term matrix
- doc_col
Column of
with document IDs- term_col
Column of
with terms- count_col
Column of
with document-term counts
austen_df = austen_books() |>
unnest_tokens(term, text, token = 'words') |>
mutate(author = 'Jane Austen') |>
count(author, book, term)
ndH(austen_df, book, term, n)
#> # A tibble: 14,520 × 5
#> term H dH n ndH
#> <chr> <dbl> <dbl> <int> <dbl>
#> 1 emma 0.0141 2.57 787 24.7
#> 2 elinor 0 2.58 623 24.0
#> 3 crawford 0 2.58 493 23.1
#> 4 marianne 0 2.58 492 23.1
#> 5 weston 0 2.58 389 22.2
#> 6 darcy 0 2.58 373 22.1
#> 7 fanny 0.323 2.26 862 22.1
#> 8 edmund 0 2.58 364 22.0
#> 9 knightley 0 2.58 356 21.9
#> 10 harriet 0.0873 2.50 419 21.8
#> # ℹ 14,510 more rows