Data Pipelines – Introduction to Text Analytics with R Part 3
Data Science Tutorials
Rating: 9.3 / 10
Data Pipelines – Introduction to Text Analytics with R Part 3
November 26, 2013 3:43 am
In our next installment of introduction to text analytics, data pipelines, we take cover:
– Exploration of textual data for pre-processing “gotchas”
– Using the quanteda package for text analytics
– Creation of a prototypical text analytics pre-processing pipeline, including (but not limited to): tokenization, lower casing, stop word removal, and stemming.
– Creation of a document-frequency matrix used to train machine learning models
Kaggle Dataset:
Kaggle Spam Data Set
The data and R code here
Full Series:
Introduction to Text Analytics with R
More Data Science Material:
[Video] What is a Data Engineer?
[Blog] The 4 Pillars of Data Democratization
(634)