We continue our discussion of data attributes and identifying the subsets of attribute classification. These subsets include: categorical, nominal, ordinal, interval and ratio.
All right, so within these two sort
of big categories of attributes, we
have some subsets that are also important to think about.
And one of the most important of these
is the distinction between categorical attributes
and non-categorical attributes.
So categorical attributes are discrete attributes
that specifically have a finite set of values
that they are allowed to take.
So for instance, so there’s several examples here.
And within categorical, there are two useful subsets.
So categorical values are any attribute,
categorical attributes are any attribute
that have only a finite set of values.
If that finite set of values has a natural ordering,
so this is something like rankings or grades or clothing
sizes, we call that an ordinal attribute.
So ordinal means that it has an order, pretty straightforward
And ordinal attributes are nice, because we
can code them as integers and maintain
the ordering between them.
So we can, we don’t know how to treat them particularly
specially, but most categorical variables
are what we call nominal categorical variables
So nominal attributes have no inherent ordering to them.
So I color zip codes, ID numbers, hair color,
whether someone is married or not, or divorced,
or living with a partner.
There’s no way you can say oh yes, blue
should have a value of 5, and green should have a value of 2
because I don’t like green eyes.
There’s no ordering that you can put into those variables.
So nominal attributes in particular we have to handle,
we kind of have to be careful about handling.
Other useful types to think about in terms
of things that allow us, variable types that
allow us to treat them specially in ways that are useful,
On the continuous side are interval and ratio variables.
You can certainly have intervals or ratios that are discrete,
but for the most part, you see them as real, or as continuous.
Interval variables are a variable
where the measurement is a measurement, basically,
where the difference between two values
is constant and meaningful.
So for instance, with temperature, say,
temperature in Celsius, a temperature of 100 degrees
and a temperature of 90 degrees have the same difference
in heat between them as a heat of 80 degrees
and a heat of 90 degrees.
So interval variables are basically continuous variables
that have a nice metric we can assign them
that gives us some nice handling.
Something like the decibel scale, on the other hand,
is much harder to handle as an interval,
because the decibel scale, if you’re
thinking about the actual intensity of the sound,
it’s a logarithmic scale.
So the difference between three decibels and four decibels
is smaller than the difference between 13 and 14 decibels.
So that’s an example of a continuous variable that
isn’t an interval variable.
Basic Data Types
Data Mining Fundamentals
More Data Science Material:
[Video] Intro. to Azure ML: Renaming Columns and Replicating Data
[Blog] 101 Machine Learning Algorithms for Data Science with Cheat Sheets