N-grams in Minutes

In this quick tutorial, we learn that machines can not only make sense of words but also make sense of words in their context. N-grams are one way to help machines understand a word in its context by looking at words in pairs. We go over what n-grams are and some examples of how you could use them in natural language processing. By looking at pairs of words, we capture the broader context of words to then train machines to learn these language queues and gain a better understanding of the real meaning of the text.

Welcome to this short introduction to N-Grams
in our video on natural language processing we explain how machines can
not only make sense of words but also make a sense of words in their context
N-grams are one way to help machines understand a word in its context to get
a better understanding of the meaning of a word. For example we need to “book our
tickets soon” versus “we need to read this book soon”. The former ‘book’ is used as a
verb and is therefore about the action of planning a trip somewhere. The latter
‘book’ is used as a noun and is therefore about a little book or object
How do I know this?
How can we tell the difference between the verb book and the noun book?
We take into account the context of the sentence and we do this innately as we
humans have been attuned to language cues since we were born. Machines on the
other hand have to learn these cues by looking at the surrounding context of
the target world. Think of it like a context window of the before word and
after word. This is what n-grams look at. They look at what came before the target
word ‘book’ and what came after to then determine if the word is used as a noun
or a verb or in another context.
‘This book’, ‘a book’, ‘your book’, ‘my book’, ‘his book’
‘her book’, are all examples of by grams where the before word indicates ‘book’ is used
as a noun. The ‘n’ in n-grams is just the number of words you want to look at.
Bi- grams are two pairs of words that occur together looking at the before
word and afterward sliding over the words, for example “read this book soon” is
split up into ‘read this’, ‘this book’, ‘book soon’ we could train a machine to learn
that when these words ‘read this’ and ‘this book’ occur together in pairs the text is
mostly discussing a literal book. You can also extend the context window to make
your n-grams a tri-gram, looking at three pairs words at a time: ‘read this book’
‘this book soon’. But bear in mind the longer your
context window the harder it is to pick up on words that frequently appear
throughout the text when you’re looking at fairly unique sets of words.
I recommend taking the Goldilocks approach to n-grams:
not too long not too short just right. And by just right I mean looking at two pairs
of words as the before word and after word is probably all the context
you need to capture the meaning of the text. N-grams are also useful when
trying to capture words used in a negative context and vice versa
for example “the staff were not friendly, terribly really”, ‘not friendly’ and
‘friendly terrible’ is enough context to know that the word ‘friendly’ is used in a
negative context. In isolation, the word friendly is positive when we’re looking
at the before and after word, ‘not’ and ‘terrible’ cancel out the positive meaning
reversing it to have a negative meaning. Another example is capturing sarcasm
such as “that’s funny… not”. When ‘funny not’ occur together it also cancels out ‘funny’
and reverses it to be the exact opposite in meaning. By looking at n-grams or
pairs of words to capture the broader context of words to then train machines
to learn these language queues and gain a better understanding of the real meaning
of the text. N-grams are fairly simple yet effective approach to capture the
context and meaning of words in natural language processing. And that sums up
n-grams for you. Thanks for watching, if you found this video useful give us a
like or you can check out our other videos at Data Science Dojo tutorials.

Next video:
Introduction to Natural Language Processing

Previous video:
One vs. One versus One vs. All

Recommended Data Science Material:
[Video] NLP 101 + Chatbots
[Video] Introduction to Text Analytics in R
[Blog] Natural Language Processing with R Programming Books

 

(346)

Rebecca Merrett
About The Author
- Rebecca holds a bachelor’s degree of information and media from the University of Technology Sydney and a post graduate diploma in mathematics and statistics from the University of Southern Queensland. She has a background in technical writing for games dev and has written for tech publications.

Avatar

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>