In the first part of our Introduction to R series, we take you through how to download the R Programming Language for Mac, Windows or Linux. R is a language for statistical computing and graphics and has several built in statistical and structural tools for programming. We will also cover how to download RStudio, which
Here we introduce some common benefits and disadvantages to using the R language. R is designed for statistical analysis, and has some great built in functions and libraries. But it is also an older program, with an older language. Watch to learn more about the benefits and disadvantages of R programming! Download R Here Download
In this Introduction to R tutorial, we dive into R’s interface, functions, and variables. We will take you through setting up your interface in R, and how to assign basic functions and variables to your program. Download R Here Download RStudio Here
In this introduction to R tutorial, we go deeper into how R functions, and introduce data types and the 5 atomic classes. R has three number types, a character type, and a logical type. Download R Here Download RStudio Here
In this introduction to R programming tutorial, we introduce the basics of vectors in the R programming language. A vector is the most basic compound object in R. Every object in R is represented by a vector, so understanding how to create and use vectors is an essential step to working with R. Download R
In this introduction to R programming tutorial, we dive deeper into the basics of vectors in the R programming language. You will learn how to work with character vectors and how to use them for encoding categorical data. Download R Here Download RStudio Here
In this Introduction to R tutorial, we introduce the basics of matrices in the R programming language. In this first of two videos on the Basics of Matrices, you will learn the different ways to create and modify a matrices, how to index and assign your matrix, and how to identify the dimensions of your
In this Introduction to R tutorial, we introduce the basics of matrices in the R programming language. In this second of two videos on the Basics of Matrices, you will learn more about naming and indexing the rows and columns of your matrices, and how to name those columns and rows in a way that
In this Introduction to R tutorial, we introduce data frames in the R programming language. Data frames are unique to R, and are a collection of vectors that can be used to store any kind of data that you want in R! Download R Here Download RStudio Here
In this Introduction to R tutorial, we teach you how to work with lists in R. Lists are data frames that don’t require a vector of equal length for their columns. We will show you how to construct a list and create names for your different list elements. Download R Here Download RStudio Here
In this Introduction to R tutorial, we will continue explaining the data functions and types you can use in R. These functions and types include coercion, booleans, IS, and casting characters. Download R Here Download RStudio Here
Here we explain the missing values of the R programming language, NA, NaN, and Null. By the end of this tutorial, you will understand the differences between these three kinds of missing values, and how to handle them. Download R Here Download RStudio Here
In this Introduction to R tutorial, we show you how to use 3rd party packages for the R programming language. You will learn about CRAN, an open sourced group that moderates R and hosts a repository of thousands of R packages. You can install any package by name from that repository using the function install.packages.
In this Introduction to R tutorial, we introduce the built-in interfaces such as reading and writing text data. R has several built-in interfaces for text data reading and writing, and understanding how to utilize these will add valuable tools to your R tool set. Download R Here Download RStudio Here
In this Introduction to R Tutorial, we dive deeper into the basics of the R language, and introduce you to the control instructors and functions such as “if statements.” Download R Here Download RStudio Here
In this Introduction to R tutorial, we introduce the built-in functions for data exploration and alteration in R and RStudio. Functions covered include STR, Summary, and Means and Medians. Download R Here Download RStudio Here
In this Introduction to R tutorial, we continue with our introduction to the basic features of R, and show off the “apply functions” in the R programming language. Apply functions take in either an array, data frame, vector or matrix, and applies the function to every column or row. Download R Here Download RStudio Here
In this Introduction to R tutorial, we introduce plotting packages in the R programming language. We will cover the three most common packages for plotting in R, how to access them, and how to utilize them. Download R Here Download RStudio Here
In the last video our Introduction to R series, we finish explaining the basics of data exploration and visualization in R. We cover how to use lattice to generate histograms and how to work with ggplot. Download R Here Download RStudio Here
dplyr is a a great tool to perform data manipulation. It makes your data analysis process a lot more efficient. Even better, it’s fairly simple to learn and start applying immediately to your work! Oftentimes, with just a few elegant lines of code, your data becomes that much easier to dissect and analyze. For these
We go over some basic functions of dplyr including the mighty group_by and summarize combo that makes dividing up datasets a breeze, as well as arrange, select, and filter that help get the data in a cleaner and more organized format. Group-by aggregation is one of the most powerful, yet simple, tools you can use
We introduce functions that make it easy to find overlapping and distinct values from two different data sources, intersect and setdiff. These two functions let you see the shared and unique elements from different vectors, making it easy to spot commonalities and differences. After watching this video, you’ll walk away feeling more empowered to tackle
In this final tutorial of our Introduction to dplyr series, we will cover ways to do feature engineering both with dplyr and base R . You’ll learn how to impute missing values as well as create new values based on existing columns. In addition, we’ll go over four different ways to combine […]
The overview of this video series provides an introduction to text analytics as a whole and what is to be expected throughout the instruction. It also includes specific coverage of: – Overview of the spam dataset used throughout the series – Loading the data and initial data cleaning – Some initial data analysis, feature engineering,
Text analytics fundamentals covers: – The importance of splitting data in to training and test datasets – Stratified sampling of imbalanced data using the caret package – Representing text data for the purposes of machine learning – Introduction to tokenization, stop words, and stemming – The bag-of-words model for text analytics – Text analytics considerations
In our next installment of introduction to text analytics, data pipelines, we take cover: – Exploration of textual data for pre-processing “gotchas” – Using the quanteda package for text analytics – Creation of a prototypical text analytics pre-processing pipeline, including : tokenization, lower casing, stop word removal, and stemming. – Creation of […]
We are now ready to build our first model in RStudio and to do that, we cover: – Correcting column names derived from tokenization to ensure smooth model training. – Using caret to set up stratified cross validation. – Using the doSNOW package to accelerate caret machine learning training by using multiple CPUs in parallel.
TF-IDF includes specific coverage of: • Discussion of how the document-term frequency matrix representation can be improved: – How to deal with documents of unequal lengths. – What to do about terms that are very common across documents. •Introduction of the mighty term frequency-inverse document frequency to implement these improvements: -TF for dealing with […]
N-grams includes specific coverage of: • Validate the effectiveness of TF-IDF in improving model accuracy. • Introduce the concept of N-grams as an extension to the bag-of-words model to allow for word ordering. • Discuss the trade-offs involved of N-grams and how Text Analytics suffers from the “Curse of Dimensionality”. • Illustrate how quickly Text
Part 7 of this video series includes specific coverage of: – The trade-offs of expanding the text analytics feature space with n-grams. – How bag-of-words representations map to the vector space model . – Usage of the dot product between document vectors as a proxy for correlation. – Latent semantic analysis as a means […]
SVD with R includes specific coverage of: – Use of the irlba package to perform truncated SVD. – How to project a TF-IDF document vector into the SVD semantic space . – Comparison of model performance between a single decision tree and the mighty random forest. – Exploration of random forest tuning using the […]
Model Metrics includes specific coverage of: – The importance of metrics beyond accuracy for building effective models. – Coverage of sensitivity and specificity and their importance for building effective binary classification models. – The importance of feature engineering for building the most effective models. – How to identify if an engineered feature is likely to
Cosine Similarity includes specific coverage of: – How cosine similarity is used to measure similarity between documents in vector space. – The mathematics behind cosine similarity. – Using cosine similarity in text analytics feature engineering. – Evaluation of the effectiveness of the cosine similarity feature. The data and R code used in this series is
Your First Test includes specific coverage of: – Pre-processing new, unseen textual data to allow for predictions from our trained model. – The importance of caching the IDF values calculated from the training data set to TF-IDF new, unseen, pre-processed data. – Performing SVD projections of new, unseen, pre-processed textual data into the latent semantic
This video concludes our Introduction to Text Analytics with R and covers: – Optimizing our model for the best generalizability on new/unseen data. – Discussion of the sensitivity/specificity tradeoff of our optimized model. – Potential next steps regarding feature engineering and algorithm selection for additional gains in effectiveness. – For those that are interested, a
As an introduction to Kaggle and your first Kaggle submission we will explain: What Kaggle is, how to create a Kaggle account, and how to submit your model to the Kaggle competition. Here is a link to the data set being used: Titanic Data Set
In this tutorial we will show you how to complete the Titanic Kaggle competition in Azure ML . It is helpful to have prior knowledge of Azure ML Studio, as well as have an Azure account. Check out the experiment here: Kaggle Titanic Experiment
As part of the Titanic Kaggle Competition in R, you need to create a model out of the titanic data set and submit it. We will show you how you can begin by using RStudio. You can watch Part Two of this series here. Check out the data set we use here: Titanic Data Set Download RStudio
In part two of using RStudio for Data Science Dojo’s titanic kaggle competition, we will show you more advanced cleaning functions for your model. If you have not seen part one, you can view it here. Check out the data set we use here: Titanic Data Set Download RStudio here: Download RStudio
AI For Social Good
February 4, 2019
It’s not hard to see machine learning and artificial intelligence in nearly every app we use – from any website we visit, to any mobile device we carry, to any goods or services we use. Where there are commercial applications, data scientists are all over it. What we don’t typically see, however, is how AI
NLP 101 + Chatbots
November 20, 2018
Learn the basics of natural language processing: the components of NLP , enterprise applications of NLP, and finally build a simple FAQ Chatbot. About the Speaker: Chris Shei is the technical evangelist for Jet.com where he explores trending tech and helps Jet’s engineering org build stronger relationships with the external tech […]
Introduction to Data Visualization with ggplot2
June 22, 2018
The R programming language is experiencing rapid increases in popularity and wide adoption across industries. This popularity is due, in part, to R’s rich and powerful data visualization capabilities. While tools like Excel, Power BI, and Tableau are often the go-to solutions for data visualizations, none of these tools can compete with R in terms
Data Manipulation with dplyr
March 19, 2018
dplyr is a a great tool to perform data manipulation. It makes your data analysis process a lot more efficient. Even better, it’s fairly simple to learn and start applying immediately to your work! Oftentimes, with just a few elegant lines of code, your data becomes that much easier to dissect and analyze. For these
Building a Business Case for your Machine Learning Idea
January 15, 2018
This presentation will discuss building a business case for your machine learning idea. In this talk, our presenter, Neeti Gupta, will provide a 10-step checklist with examples for the audience to build their own business model. This 10-step business checklist is a synthesis of the speaker’s real world experience evaluating companies that have built a
Ethical Dimensions of Data Science
December 15, 2017
From distorting experiments with systemic bias to imposing human ethics on machine learning models, data scientists have far more to worry about than the raw numbers in their spreadsheet. Join Raja Iqbal on an exploration of data science’s past evils and how we can pave the way to a brighter future.
Feature Engineering for Bot Detection
October 27, 2017
According to some estimates, bots constitute close to 50% of the overall traffic. In this introductory talk to Feature Engineering for Bot Detection, we will cover various aspects of feature engineering for bot detection of automated web traffic. We will start with understanding the impact of bots on an online business and various types of
Online Experimentation and A/B Testing
October 16, 2017
In this meetup, we provide a quick introduction to online experimentation and A/B testing. To keep the tutorial self-contained, we will first give an overview of stats fundamentals needed to understand A/B testing. We then explain how A/B testing is done in an online business. We will conclude by mentioning some of the pitfalls that
Building Robust Machine Learning Models
October 13, 2017
Modern machine learning libraries make model building look deceptively easy. An unnecessary emphasis on tools like R, Python, SparkML, and techniques like deep learning is prevalent. Relying on tools and techniques while ignoring the fundamentals is the wrong approach to model building. Real-world machine learning requires hard work, discipline and […]
Introduction to Data Visualization with R and ggplot2
August 18, 2017
The R programming language is experiencing rapid increases in popularity and wide adoption across industries. This popularity is due, in part, to R’s rich and powerful data visualization capabilities. While tools like Excel, Power BI, and Tableau are often the go-to solutions for data visualizations, none of these tools can compete with R in terms
Storytelling with PowerBI
February 10, 2014
Storytelling is a cornerstone of the human experience. Though many elements of stories have remained the same throughout history, we have developed better tools and mediums for telling them, such as printed books, movies, and comics. This has changed storytelling styles—and perhaps most importantly, the impact of those stories. Today the best stories are often
Introduction to Machine Learning with R and caret
February 10, 2014
The R programming language is experiencing rapid increases in popularity and wide adoption across industries. This popularity is due, in part, to R’s huge collection of open source machine learning algorithms. If you are a data scientist working with R, the caret package is a must-have tool in your […]
Business Data Analysis with Excel
January 20, 2014
Lecture Starts at: 8:25 Business data presents a challenge for the data analyst. Business data is often aggregated, recorded over time, and tends to exhibit autocorrelation. Additionally, and most problematically, the amount of business data is usually quite limited. These characteristics lead to a situation where many of the tools in the analyst’s tool belt
Introduction to R Programming for Excel Users
January 8, 2014
R programming is rapidly becoming a valuable skill for data professionals of all stripes and a must-have skill for aspiring data scientists. Adding R programming to your data analyst skillset allows you to leverage powerful data visualizations, statistical analyses, and even machine learning in your daily work. In this presentation, Dave Langer illustrates how your
Introduction to Event Log Mining with R
January 8, 2014
Event logs are everywhere and represent a prime source of Big Data. Event log sources run the gamut from e-commerce web servers to devices participating in globally distributed Internet of Things architectures. Even Enterprise Resource Planning systems produce event logs! Given the rich and varied data contained in event logs, mining these assets […]
Intro to R Visualizations in Microsoft Power BI
January 8, 2014
Microsoft’s Power BI is a powerful technology for quickly creating rich visualizations. Power BI has many practical uses for the modern data professional including executive dashboards, operational dashboards, and visualizations for data exploration/analysis. Microsoft has also extended Power BI with support for incorporating R visualizations into Power BI projects, enabling a myriad of data visualization
Scale R to Big Data Using Hadoop and Spark
January 8, 2014
R is currently one of the most popular data science languages in the world. However, it’s always had constraints around scaling out to big data. What happens when you expand beyond a couple gigabytes of data? You packed up your data and you used something else; Python, Java, or Mahout to name a few. Now
Nearly 100 years after Einstein predicted the existence of gravitational waves, Laser Interferometer and Gravitational Wave Observatory astounded the world by successfully detecting these waves. Detection was made possible by the advancement of laser technology and data processing techniques. Being able to distinguish the gravitational waves from the background noise was key to verifying […]
At this meetup, presenter Craig Guarraci speaks about how to Make Sense of Unstructured Text With Python, MS Cognitive Services & PowerBI – In this presentation we’ll take a broad look at industry research to see how text analytics and sentiment analysis is used – We’ll look at difficulties associated with sentiment analysis – Review
Building Real-Time Sentiment Pipeline for Live Tweets
January 8, 2014
At this Data Science Dojo meetup, Phuc Duong talks about Building a Real-Time Sentiment Pipeline for Live Tweets Using Python, R, & Azure Supplementary Material found here: https://github.com/gokul180288/meetup
In this 90-minute video tutorial, we will cover an overview of solving a simple predictive analytics problem. We will use R for Feature Exploration, Visualization, and Predictive Modeling with R and Azure ML We will be using the Titanic data set for our exercise. You will see the end-to-end process of building a predictive model.