Information gain (uniform distribution)

Calculates \(\log_2 n \times \delta H\), the log total occurrence times information gain (relative to the uniform distribution) for each term. I prefer this for vocabulary selection over methods such as TF-IDF.

Usage

ndH(dataf, doc_col, term_col, count_col)

Arguments

dataf: Tidy document-term matrix
doc_col: Column of dataf with document IDs
term_col: Column of dataf with terms
count_col: Column of dataf with document-term counts

Value

Dataframe with columns

- `{{ term col }}`, term
- `dH`, information gain relative to uniform distribution over documents
- `n`, total count of term occurrence
- `ndH`, \eqn{\log_2 n \times \delta H}

Examples

# \donttest{
if (requireNamespace("janeaustenr", quietly = TRUE)) {
  library(dplyr)
  library(tidytext)
  austen_df = janeaustenr::austen_books() |>
      unnest_tokens(term, text, token = 'words') |>
      mutate(author = 'Jane Austen') |>
      count(author, book, term)
  ndH(austen_df, book, term, n)
}
#> Warning: package ‘dplyr’ was built under R version 4.5.2
#> 
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union
#> # A tibble: 14,520 × 5
#>    term           H    dH     n   ndH
#>    <chr>      <dbl> <dbl> <int> <dbl>
#>  1 emma      0.0141  2.57   787  24.7
#>  2 elinor    0       2.58   623  24.0
#>  3 crawford  0       2.58   493  23.1
#>  4 marianne  0       2.58   492  23.1
#>  5 weston    0       2.58   389  22.2
#>  6 darcy     0       2.58   373  22.1
#>  7 fanny     0.323   2.26   862  22.1
#>  8 edmund    0       2.58   364  22.0
#>  9 knightley 0       2.58   356  21.9
#> 10 harriet   0.0873  2.50   419  21.8
#> # ℹ 14,510 more rows
# }