TF-IDF – Introduction to Text Analytics with R Part 5
TF-IDF includes specific coverage of:
• Discussion of how the document-term frequency matrix representation can be improved:
– How to deal with documents of unequal lengths.
– What to do about terms that are very common across documents.
•Introduction of the mighty term frequency-inverse document frequency to implement these improvements:
-TF for dealing with documents of unequal lengths.
-IDF for dealing with terms that appear frequently across documents.
• Implementation of TF-IDF using R functions and applying them to document-term frequency matrices.
• Data cleaning of matrices post weighting/transformation.
Kaggle Spam Data Set
The data and R code here
Introduction to Text Analytics with R
More Data Science Material:
[Video] Subset, Reshape, and Summarize Data – Introduction to dplyr Part 2
[Blog] Text Analytics: Make Text Machine-Readable