TF-IDF – Introduction to Text Analytics with R Part 5

TF-IDF includes specific coverage of:

• Discussion of how the document-term frequency matrix representation can be improved:
– How to deal with documents of unequal lengths.
– What to do about terms that are very common across documents.
•Introduction of the mighty term frequency-inverse document frequency to implement these improvements:
-TF for dealing with documents of unequal lengths.
-IDF for dealing with terms that appear frequently across documents.
• Implementation of TF-IDF using R functions and applying them to document-term frequency matrices.
• Data cleaning of matrices post weighting/transformation.

Kaggle Dataset:
Kaggle Spam Data Set

The data and R code here

Full Series:
Introduction to Text Analytics with R

More Data Science Material:
[Video] Subset, Reshape, and Summarize Data – Introduction to dplyr Part 2
[Blog]  Text Analytics: Make Text Machine-Readable

(464)

Avatar
About The Author
- Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.

Avatar

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>