Data Attributes – Data Mining Fundamentals Part 2

Data attributes are set values or “metadata” of data that characteristically sets it apart from other data. Such as ID, age, or location. By the end of this tutorial, you will understand the different kinds of attribute classifications, and when you should use each.

So we have objects, and we have attributes.
So each attribute has a set of values which
the objects can draw from.
So each attribute, each object is defined
by a set of attribute values.
And each attribute we can think of as being
defined by the set of values that it can hold.
So we can have the same attribute mapped
to different attribute values.
Height can be measured in meters or feet,
temperature can be measured in Celsius, Kelvin, or Fahrenheit,
lots of other sorts of things like that.
And different attributes will often
be mapped to the same set of values.
ID numbers and age are both usually given
as integer values.
Temperature and height are both often
given as floating point values, as decimal values.
So the properties of our attributes
can also be different.
Height, for instance, has a pretty practical maximum
and minimum value, as does something like age,
whereas ID number has no real limit.
It’s whatever the people who created the dataset
define it to be.
So and that kind of gets into an interesting question of,
who defines what value set that a given attribute uses?
And the answer to that is essentially, we do, right?
The people who create the dataset do, the people who
hand us the data, the data engineers or the Twitter API
or other APIs that we’re calling in order to get the data
will have some definition of it.
But we can set that ourselves too.
We can change our attitudes to be maps
to different sets of values.
And we’ll use that in a variety of places.
All right, so attributes have, so we
know that we have these attribute values.
So it’s useful to talk about attributes
as being part of different classes, different types
of attributes that we’re going to end up
having to handle differently as we get into the actual data
mining and modeling processes.
So there’s two sort of fundamental types
of attributes, discrete attributes
and continuous attributes.
So discrete attributes have either a finite or countably
infinite set of values.
For those of you who don’t know, the term countably infinite
basically means integers.
If you can turn your attribute into integers,
then it’s countably infinite, or finite
if you’ve got only a limited set of integers.
So good examples of these are zip codes,
things like click counts, the set of a word count,
word counts in a collection of documents.
We could in theory, have as many clicks as we want.
There’s a countably infinite set,
but there are always going to be integers.
So we have a countably infinite set of values there.
Usually we represent these as integer variables.
And binary attributes are a pretty special case
of discrete attributes that we end up
having to handle differently in some cases.
Binary attributes have only two values.
And we might call those yes or no, dead or alive, 1 or 0.
And those kinds of columns are sort of a special case.
In some contexts, we really like them, they make things easier.
In other contexts, they could be problematic, which
is pretty much everything.
The other big type of attribute classification that we
see are continuous attributes.
So in this case, we have real numbers
as our attribute values.
There’s no limitation to just integers.
So temperature, height, weight, oxygen level, taxable income,
all these things have real numbers
as their attribute values.
They can theoretically take any value at all.
Now in practice of course, we have
to put these things into a computer,
and computers can only measure and represent
a finite set of digits.
So generally speaking, these attributes
are usually represented as floating point variables.
So floating points, for those of you
who are farther out from your learning of programming,
are essentially just variables that
hold a real number, that can hold a decimal,
the floating point being the decimal point in the number.

Part 4:
Data Attributes (cont.)

Part 2:
Basic Vocabulary

Complete Series:
Data Mining Fundamentals

More Data Science Material:
[Video] Introduction to Classification Models
[Blog] Evolution for Data Entry to Data Science


About The Author
- Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>