Introduction to Text Analytics with R – Part 2: Text Analytics Fundamentals

This data science tutorial introduces the viewer to the exciting world of text analytics with R programming. As exemplified by the popularity of blogging and social media, textual data if far from dead – it is increasing exponentially! Not surprisingly, knowledge of text analytics is a critical skill for data scientists if this wealth of information is to be harvested and incorporated into data products. This data science training provides introductory coverage of the following tools and techniques:

– Tokenization, stemming, and n-grams
– The bag-of-words and vector space models
– Feature engineering for textual data (e.g. cosine similarity between documents)
– Feature extraction using singular value decomposition (SVD)
– Training classification models using textual data
– Evaluating accuracy of the trained classification models

Part 2 of this video series includes specific coverage of:

– The importance of splitting data in to training and test datasets
– Stratified sampling of imbalanced data using the caret package
– Representing text data for the purposes of machine learning
– Introduction to tokenization, stop words, and stemming
– The bag-of-words model for text analytics
– Text analytics considerations for data pre-processing

Kaggle Dataset:
https://www.kaggle.com/uciml/sms-spam-collection-dataset

The data and R code used in this series is available via the public GitHub:
https://github.com/datasciencedojo/IntroToTextAnalyticsWithR

(176)

About The Author
- Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>