An alternative to ndH()
that uses information gain relative to a distribution of documents that is proportional to length. With the uniform distribution and dramatic differences in document lengths (eg, over a few orders of magnitude), high-ndH terms tend to be distinctive terms from very long documents. With the length-proportional distribution, high information-gain terms are more likely to come from shorter documents. Informal testing suggests this approach performs better than the ndH()
uniform distribution when documents have widely varying lengths, eg, over a few orders of magnitude.
Arguments
- dataf
Tidy document-term matrix
- doc_col
Column of
dataf
with document IDs- term_col
Column of
dataf
with terms- count_col
Column of
dataf
with document-term counts
Examples
library(tidyverse)
library(tidytext)
library(janeaustenr)
austen_df = austen_books() |>
unnest_tokens(term, text, token = 'words') |>
mutate(author = 'Jane Austen') |>
count(author, book, term)
ndR(austen_df, book, term, n)
#> # A tibble: 14,520 × 4
#> term n dR ndR
#> <chr> <int> <dbl> <dbl>
#> 1 elliot 254 3.12 24.9
#> 2 tilney 196 3.22 24.5
#> 3 anne 467 2.76 24.5
#> 4 elinor 623 2.60 24.1
#> 5 wentworth 191 3.12 23.6
#> 6 marianne 492 2.60 23.2
#> 7 thorpe 126 3.22 22.5
#> 8 morland 125 3.22 22.4
#> 9 allen 116 3.22 22.1
#> 10 darcy 373 2.57 21.9
#> # ℹ 14,510 more rows