TF-IDF – Introduction to Text Analytics with R
TF-IDF includes specific coverage of:
• Discussion of how the document-term frequency matrix representation can be improved:
– How to deal with documents of unequal lengths.
– What to do about terms that are very common across documents.
•Introduction of the mighty term frequency-inverse document frequency (TF-IDF) to implement these improvements:
-TF for dealing with documents of unequal lengths.
-IDF for dealing with terms that appear frequently across documents.
• Implementation of TF-IDF using R functions and applying TF-IDF to document-term frequency matrices.
• Data cleaning of matrices post TF-IDF weighting/transformation.
Kaggle Dataset can be found here
The data and R code used in this series is available here
About the Series
This data science tutorial introduces the viewer to the exciting world of text analytics with R programming. As exemplified by the popularity of blogging and social media, textual data if far from dead – it is increasing exponentially! Not surprisingly, knowledge of text analytics is a critical skill for data scientists if this wealth of information is to be harvested and incorporated into data products. This data science training provides introductory coverage of the following tools and techniques:
– Tokenization, stemming, and n-grams
– The bag-of-words and vector space models
– Feature engineering for textual data (e.g. cosine similarity between documents)
– Feature extraction using singular value decomposition (SVD)
– Training classification models using textual data
– Evaluating accuracy of the trained classification models