R tutorial: In this video tutorial you will learn how to write standard web scraping commands in R, filter timely data based on time diffs, analyze or summarize key information in the text, and send an email alert of the …
In this conclusion to Text Analytics with R we cover topics such as:
– Optimizing our model for the best generalization on new/unseen data.
– Discussion of the sensitivity/specificity trade-off of our optimized model.
– Potential next steps regarding feature …
Your First Test includes specific coverage of:
– Pre-processing new, unseen textual data to allow for predictions from our trained model.
– The importance of caching the IDF values calculated from the training data set to TF-IDF new, unseen, pre-processed …
Cosine Similarity includes specific coverage of:
– How cosine similarity is used to measure similarity between documents in vector space.
– The mathematics behind cosine similarity.
– Using cosine similarity in text analytics feature engineering.
– Evaluation of the effectiveness …
Model Metrics includes specific coverage of:
– The importance of metrics beyond accuracy for building effective models.
– Coverage of sensitivity and specificity and their importance for building effective binary classification models.
– The importance of feature engineering for building …
SVD with R includes specific coverage of:
– Use of the irlba package to perform truncated SVD.
– How to project a TF-IDF document vector into the SVD semantic space (i.e., LSA).
– Comparison of model performance between a single …
Part 7 of this video series includes specific coverage of LSA, VSM, & SVD:
– The trade-offs of expanding the text analytics feature space with n-grams.
– How bag-of-words representations map to the vector space model (VSM).
– Usage of …
N-grams includes specific coverage of:
• Validate the effectiveness of TF-IDF in improving model accuracy.
• Introduce the concept of N-grams as an extension to the bag-of-words model to allow for word ordering.
• Discuss the trade-offs involved of N-grams …
TF-IDF includes specific coverage of:
• Discussion of how the document-term frequency matrix representation can be improved:
– How to deal with documents of unequal lengths.
– What to do about terms that are very common across documents.
•Introduction of …
We are now ready to build our first model in RStudio and to do our model building, we will cover:
– Correcting column names derived from tokenization to ensure smooth model training.
– Using caret to set up stratified cross …