Data types can be categorized into three set types, Record, Ordered, and Graph. In this tutorial, we will give you examples of when you would want to use each data set.
All right, we can move on to data set classification.
there are a lot of different types of data sets.
And they require different approaches to analysis.
The pre-processing steps, the modeling steps,
pretty much everything that you do
with these different types of data sets
is going to be different.
The kinds of models you use, the kinds of visualizations
you construct, the kind of cleaning that
is proper for that kind of data.
Understanding the structure of your data at the beginning
is very important to not wasting time and not
producing incorrect results.
And it’s in this step, the understanding the structure
of your data that things like domain knowledge
tend to be very important.
But there are still, certainly, categories
that tend to be similar no matter what domain they’re in.
So we’ll talk about these three different kinds of types
of data sets, records, graphs, and ordered data sets,
in a little bit more detail coming up here.
So record data is data that consists
of a collection of records, each of which
consists of a fixed that of attributes.
So this particular data set, which I use in several places,
Every data object has one tax ID, has a value of whether they
asked for refund, marital status,
whether they’re single married or divorced,
a taxable income field, and whether they
cheated on their taxes or not.
So that’s what’s, sort of, the structure of this data set.
So any data, which consists of this kind of collection
of records, which consists of a fixed set of attributes,
you almost always represent this kind
of data as a table, whether a database
table, or a spreadsheet, or something like that.
And it’s the most common kind of data.
So a lot of people will, if you talk about data or data sets,
this is what they visualize, entirely, is record data.
So it’s, sort of, your most common and, sort of,
fundamental kind of data set.
So within record data, there are a few useful subsets.
So this record data, with the tax data,
has some categorical values and then one ordinal variable.
So tax ID is ordinal, right?
It’s really more of a nominal variable, when
you think about it, because ordering doesn’t necessarily
Right, sure, it takes numbers but 10
is not meaningfully different from five.
There’s no ordering implied here.
So tax ID is a nominal field.
Nominal categorical field.
Tax refund is a categorical field, marital status also,
taxable income is a continuous field.
So most data that you encounter has mixed data types like this.
You have some categorical, some numeric,
and that’s, sort of, your traditional type of record
If, on the other hand, your record data consists entirely
of numeric attributes, so this is entirely continuous,
entirely interval, or ratio variables.
Then we can think of it as a mathematical matrix rather than
So we would have an m by n matrix.
There are m rows, one for each data object
and columns, one for each attribute.
And this is nice because we can think of these data objects
as points in a multi-dimensional space,
where each attribute is represented
And that allows us to use a number of numeric techniques,
specifically, involving distance that some algorithms,
not only make some algorithms easier,
but which some algorithms require.
There’s a number of algorithms that
require you to have data matrix data, all numeric data.
Document & Transaction Data
Data Attributes (cont.)
Data Mining Fundamentals
More Data Science Material:
[Video] Building data science products? Think business first!
[Blog] Getting Started with Kaggle Competitions