Skip to contents

Calculates \(\log_2 n \times \delta H\), the log total occurrence times information gain (relative to the uniform distribution) for each term. I prefer this for vocabulary selection over methods such as TF-IDF.

Usage

ndH(dataf, doc_col, term_col, count_col)

Arguments

dataf

Tidy document-term matrix

doc_col

Column of dataf with document IDs

term_col

Column of dataf with terms

count_col

Column of dataf with document-term counts

Value

Dataframe with columns

- `{{ term col }}`, term
- `dH`, information gain relative to uniform distribution over documents
- `n`, total count of term occurrence
- `ndH`, \eqn{\log_2 n \times \delta H}

Examples

library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#>  dplyr     1.1.2      readr     2.1.4
#>  forcats   1.0.0      stringr   1.5.0
#>  ggplot2   3.4.2      tibble    3.2.1
#>  lubridate 1.9.2      tidyr     1.3.0
#>  purrr     1.0.2     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#>  dplyr::filter() masks stats::filter()
#>  dplyr::lag()    masks stats::lag()
#>  Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
library(janeaustenr)
austen_df = austen_books() |>
    unnest_tokens(term, text, token = 'words') |>
    mutate(author = 'Jane Austen') |>
    count(author, book, term)
ndH(austen_df, book, term, n)
#> # A tibble: 14,520 × 5
#>    term           H    dH     n   ndH
#>    <chr>      <dbl> <dbl> <int> <dbl>
#>  1 emma      0.0141  2.57   787  24.7
#>  2 elinor    0       2.58   623  24.0
#>  3 crawford  0       2.58   493  23.1
#>  4 marianne  0       2.58   492  23.1
#>  5 weston    0       2.58   389  22.2
#>  6 darcy     0       2.58   373  22.1
#>  7 fanny     0.323   2.26   862  22.1
#>  8 edmund    0       2.58   364  22.0
#>  9 knightley 0       2.58   356  21.9
#> 10 harriet   0.0873  2.50   419  21.8
#> # ℹ 14,510 more rows