Data Pipelines – Introduction to Text Analytics with R Part 3

In our next installment of introduction to text analytics, data pipelines, we take cover:

– Exploration of textual data for pre-processing “gotchas”
– Using the quanteda package for text analytics
– Creation of a prototypical text analytics pre-processing pipeline, including (but not limited to): tokenization, lower casing, stop word removal, and stemming.
– Creation of a document-frequency matrix used to train machine learning models

Kaggle Dataset:
Kaggle Spam Data Set

The data and R code here

Full Series:
Introduction to Text Analytics with R

More Data Science Material:
[Video] What is a Data Engineer?
[Blog]  The 4 Pillars of Data Democratization

(337)

Avatar
About The Author
- Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.

Avatar

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>