Introduction to Text Analytics with R – Part 6: N-grams
This data science tutorial introduces the viewer to the exciting world of text analytics with R programming. As exemplified by the popularity of blogging and social media, textual data if far from dead – it is increasing exponentially! Not surprisingly, knowledge of text analytics is a critical skill for data scientists if this wealth of information is to be harvested and incorporated into data products.
Part 6 of this video series includes specific coverage of:
• Validate the effectiveness of TF-IDF in improving model accuracy.
• Introduce the concept of N-grams as an extension to the bag-of-words model to allow for word ordering.
• Discuss the trade-offs involved of N-grams and how Text Analytics suffers from the “Curse of Dimensionality”.
• Illustrate how quickly Text Analytics can strain the limits of your computer hardware.
This data science training provides introductory coverage of the following tools and techniques:
– Tokenization, stemming, and n-grams
– The bag-of-words and vector space models
– Feature engineering for textual data (e.g. cosine similarity between documents)
– Feature extraction using singular value decomposition (SVD)
– Training classification models using textual data
– Evaluating accuracy of the trained classification models
The data and R code used in this series is available via the public GitHub: