Introduction to Text Analytics with R – Part 11: Our First Test
This data science tutorial introduces the viewer to the exciting world of text analytics with R programming. As exemplified by the popularity of blogging and social media, textual data if far from dead – it is increasing exponentially! Not surprisingly, knowledge of text analytics is a critical skill for data scientists if this wealth of information is to be harvested and incorporated into data products. This data science training provides introductory coverage of the following tools and techniques:
– Tokenization, stemming, and n-grams
– The bag-of-words and vector space models
– Feature engineering for textual data (e.g. cosine similarity between documents)
– Feature extraction using singular value decomposition (SVD)
– Training classification models using textual data
– Evaluating accuracy of the trained classification models
Part 11 of this video series includes specific coverage of:
– Pre-processing new, unseen textual data to allow for predictions from our trained model.
– The importance of caching the IDF values calculated from the training data set to TF-IDF new, unseen, pre-processed data.
– Performing SVD projections of new, unseen, pre-processed textual data into the latent semantic space.
– Creating predictions and evaluating model effectiveness in the context of accuracy, sensitivity, and specificity.
The data and R code used in this series is available via the public