Information gain (length-proportional distribution)

An alternative to ndH() that uses information gain relative to a distribution of documents that is proportional to length. With the uniform distribution and dramatic differences in document lengths (eg, over a few orders of magnitude), high-ndH terms tend to be distinctive terms from very long documents. With the length-proportional distribution, high information-gain terms are more likely to come from shorter documents. Informal testing suggests this approach performs better than the ndH() uniform distribution when documents have widely varying lengths, eg, over a few orders of magnitude.

Usage

ndR(dataf, doc_col, term_col, count_col)

Arguments

dataf: Tidy document-term matrix
doc_col: Column of dataf with document IDs
term_col: Column of dataf with terms
count_col: Column of dataf with document-term counts

Value

Dataframe with columns

- `{{ term col }}`, term
- `n`, total count of term occurrence
- `dR`, information gain relative to length-proportional distribution over documents
- `ndR`, \eqn{\log_2 n \times \delta R}

Examples

library(tidyverse)
library(tidytext)
library(janeaustenr)
austen_df = austen_books() |>
    unnest_tokens(term, text, token = 'words') |>
    mutate(author = 'Jane Austen') |>
    count(author, book, term)
ndR(austen_df, book, term, n)
#> # A tibble: 14,520 × 4
#>    term          n    dR   ndR
#>    <chr>     <int> <dbl> <dbl>
#>  1 elliot      254  3.12  24.9
#>  2 tilney      196  3.22  24.5
#>  3 anne        467  2.76  24.5
#>  4 elinor      623  2.60  24.1
#>  5 wentworth   191  3.12  23.6
#>  6 marianne    492  2.60  23.2
#>  7 thorpe      126  3.22  22.5
#>  8 morland     125  3.22  22.4
#>  9 allen       116  3.22  22.1
#> 10 darcy       373  2.57  21.9
#> # ℹ 14,510 more rows