Calculates \(\log_2 n \times \delta H\), the log total occurrence times information gain (relative to the uniform distribution) for each term. I prefer this for vocabulary selection over methods such as TF-IDF.
Arguments
- dataf
Tidy document-term matrix
- doc_col
Column of
dataf
with document IDs- term_col
Column of
dataf
with terms- count_col
Column of
dataf
with document-term counts
Examples
library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.2 ✔ readr 2.1.4
#> ✔ forcats 1.0.0 ✔ stringr 1.5.0
#> ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
#> ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
#> ✔ purrr 1.0.2
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
library(janeaustenr)
austen_df = austen_books() |>
unnest_tokens(term, text, token = 'words') |>
mutate(author = 'Jane Austen') |>
count(author, book, term)
ndH(austen_df, book, term, n)
#> # A tibble: 14,520 × 5
#> term H dH n ndH
#> <chr> <dbl> <dbl> <int> <dbl>
#> 1 emma 0.0141 2.57 787 24.7
#> 2 elinor 0 2.58 623 24.0
#> 3 crawford 0 2.58 493 23.1
#> 4 marianne 0 2.58 492 23.1
#> 5 weston 0 2.58 389 22.2
#> 6 darcy 0 2.58 373 22.1
#> 7 fanny 0.323 2.26 862 22.1
#> 8 edmund 0 2.58 364 22.0
#> 9 knightley 0 2.58 356 21.9
#> 10 harriet 0.0873 2.50 419 21.8
#> # ℹ 14,510 more rows