Data Mining Fundamentals Part 1 – Basic Vocabulary

This data mining fundamentals series is jam-packed with all the background information, technical terminology, and basic knowledge that you will need to hit the ground running. Starting off this video series, we cover what data is and the basic vocabulary associated with it.

The purpose of this particular webinar
is to give you all sort of some basic vocabulary and a very
basic understanding of a number of different important topics
regarding data science fundamentals.
So a lot of this talk is a vocabulary lesson.
So it’s really important that you guys
make sure you understand all the terms that I’m introducing
and all the ways that they’re used.
We’re going to be covering a lot of material
over the next couple of hours.
So it is pretty aggressively paced,
but we should be able to get through all of it.
All right.
So you see on your screen here the topics
that we’re going to be covering.
So we’re going to be talking to start about data and data
types and sort of setting some ground work for all
the things we’ll be talking about over the course
of the Boot Camp.
Then we’re going to talk about data quality and data
preprocessing, which are very connected things.
And, finally, we’re going to talk
about some similarity and dissimilarity metrics
and also some data exploration and visualization.
So we’ll cover data exploration visualization very briefly
here.
We’re going to talk about it a lot more
next week in the introduction to our webinar.
So without further ado, then, let’s start with data and data
types.
So what is data is sort of a very fundamental question
that we can ask.
And here’s where our vocabulary lessons start.
So data is a collection of objects
that are defined by attributes.
So attributes are the properties or characteristics
of our objects.
So every entry in our table, here–
and not all data can be represented nicely in a table,
but a lot of it can be.
So in this case, the object’s, a data object, is a row,
and a data attribute as a column.
So we think of the attributes as being
properties of the objects.
So the eye color of a person, the temperature,
whether someone filed for a tax refund
in the next year, what their taxable income was,
those are all attributes of our data objects.
So one of the struggles people sometimes
have in getting into data science
is that because data science is a synthesis of probably three
or four completely distinct fields all coming together
in one way, there are a lot of different terms
for the same things in a lot of cases.
So this is our first encounter with that.
And it’s going to show up again.
So attribute is sort of a decent name for these ideas.
But they’re also called variables and fields
and characteristics and features and predictors.
And if you’ve got tabular data, they’ll
be called columns sometimes.
So all of those different names all
refer to essentially the same thing.
They’re all attributes.
They’re a property or characteristic of our object.
Similarly, when we have our objects–
so our objects are then, basically,
a collection of attributes.
It’s kind of a circular definition.
But it’s what we’ve got.
So each object is defined by its exact attribute values.
And objects– we’ll use the term data objects
throughout this talk, but in general, objects
have a lot of different names.
You’ll see them called records and points and cases, samples,
entities, entries, instances, all of that
and many more sort of things.
You’ll also see a set of data called a data set.
But sometimes it will be called a table.
And sometimes you’ll just hear, oh, yeah,
we have our data, referring to the set as a whole.

Topics:
– Data and Data Types
– Data Quality
– Data Preprocessing
– Similarity and Dissimilarity
– Data Exploration and Visualization

Part 2:
Data Attributes

Complete Series:
Data Mining Fundamentals

More Data Science Material:
[Video Series] Beginning R Programming
[Video Series] Introduction to Text Analytics with R
[Video Series] Introduction to Azure Machine Learning
[Video] Building data science products? Think business first!
[Blog] R vs Python: Which is better for Data Science?

(1240)

Avatar
About The Author
- Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.

Avatar

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>