TF-IDF – Introduction to Text Analytics with R Part 5
TF-IDF includes specific coverage of:
• Discussion of how the document-term frequency matrix representation can be improved:
– How to deal with documents of unequal lengths.
– What to do about terms that are very common across documents.
•Introduction of the mighty term frequency-inverse document frequency to implement these improvements:
-TF for dealing with documents of unequal lengths.
-IDF for dealing with terms that appear frequently across documents.
• Implementation of TF-IDF using R functions and applying them to document-term frequency matrices.
• Data cleaning of matrices post weighting/transformation.
Kaggle Dataset can be found here
The data and R code used in this series is available here