NATURAL LANGUAGE PROCESSING (NLP) Basics with spaCy (Part 1)

6 min readMar 2, 2020

© https://www.learntek.org/blog/natural-language-processing-applications

In this article you’ll learn about the basics of Natural Language Processing(NLP) such as Tokenization, Lemmatization, Stopwords, Part of speech (POS)tagging, and more, using the spaCy library.

Text data as we all know can be a rich source of information if it can be efficiently mined and processed.

According to Statistics from the link below 👇

How Much Data is Created on the Internet Each Day?

90% of the data on the internet has been created since 2016, according to an IBM Marketing Cloud study. People…

blog.microfocus.com

In the year 2019, 26 billion texts were sent each day by 27 million people in the US.

You wouldn’t want to start a journey without knowing your destination? 😎let’s talk a bit about what NLP and its uses before move into the spaCy Library.

What is NLP?

NLP is an area of computer science and artificial intelligence, concerned with the interactions between computers and humans (natural languages), in particular how to program computers to process and analyze large amounts of natural language data.

NLP can simply be said to be the ability of a computer to read and understand human languages.

Uses of NLP

Classifying emails as legitimate or illegitimate
Sentiment Analysis
Movie reviews
Understanding text commands e.g ‘hey google, play this song’

Now that we know what NLP is, and what it is used for, let dive into the spaCY library.

Why spaCY?

spaCy is an open-source NLP library for python developed by explosion.ai in 2015. it was designed to effectively handle NLP tasks efficiently. spaCY is faster and much easier to use, however, it doesn’t include pre-created models for sentiment analysis.

The figure below shows the speed comparison of spaCY to other NLP libraries.

figure 1. Speed comparison of spaCY to other NLP library

Getting Started

pip install -U spacy

To install the English language model

python -m spacy download en_core_web_sm

It should take a while, so you can grab a cup of coffee ☕️ or go do some other work e.t.c while the English model is installing. when it has been successfully installed, it would show “linking successful”.

let’s go ahead and import spaCy.

#import spacy and load the english language model
import spacy
nlp = spacy.load("en_core_web_sm")

Note: the English language model must be loaded alongside spaCy to allow spaCy to perform Language processing in English.

Tokenization

is the process of breaking down a text into components or pieces. these pieces are called tokens. when performing tokenization we should note the following: prefix, suffix, infix, and exceptions. we don’t have to worry too much about that because spaCy has a built-in tokenizer that does the job for you.

let’s get our hands dirty with some codes

Code:

"""A doc is a sequence of tokens, a doc can be an article, a newspaper or a tweet
'u' is a unicode string, it is the industry standard for handling text."""doc = nlp(u"NLP isn't the same as Neuro-linguistic programming.")for token in doc:
    print(f"{token.text}")

Output:

As we can see, spaCy is smart enough to distinguish between the prefix, infix, and suffix.

Stemming

Stemming is the process of reducing words to their root forms (stem) such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.

The most common algorithm for stemming English is Porter’s algorithm which was developed in 1980.

figure 3. rules used by porter’s stemmer copyright.

Although spaCy doesn’t have a library for stemming, it uses a better approach called lemmatization to reduce words effectively to their root forms.

Lemmatization

is the method of reducing a word to its root or base form (lemma). unlike stemming which follows an algorithm to chop off the words, lemmatization takes it a step further by looking at the surrounding text to determine the given word part of speech.

the lemma of running is run, while the lemma of eating is eat. again spaCy does this under the hood, so you don’t need to worry.

Note: the difference between “lemma” and “lemma_” in the figure above. “lemma_” returns the lemma while “lemma” returns the hash values of the lemma.

If you’re confused 😵and want to know more differences between lemmatization and stemming click here

Part-of-Speech tagging

part-of-speech tagging or POS tagging is the marking of a word in a text or document to a particular part of speech(noun, verb, adjective) based on its definition and context. for additional knowledge on POS, tagging click here.

spaCy can parse and tag a given text or document. spaCy uses a statistical model, which enables it to make predictions of which tag or label most likely applies in this context.

Code:

import spacynlp = spacy.load("en_core_web_sm")
doc2 = nlp("Apple is looking at buying U.K. startup for $1 billion")for token in doc:
    print(f"{token.text:10}  {token.pos_:10} {token.tag_:10} {spacy.explain(token.pos_):15} {spacy.explain(token.tag_)}")#Note the use of underscore

Output:

Text: The original word text.
POS: The simple part-of-speech tag.
Tag: The detailed part-of-speech tag.

Stop words

commonly used words such as ‘I’, ‘you’, ‘anyone’, appear so often in a document and as such cannot be tagged as nouns, verbs or a modifier.

spaCy has a built-in list of over 300 English stop words that can be used to filter out unnecessary words in a document. spaCy also allows a user to add custom stopwords, not in the spaCy’s built-in list.

Code:

#To get the list of spaCY built-in stop words 
print(nlp.Defaults.stop_words)

Output:

Note: caution should be taken when removing stop words from a document, as shown in the figure above. if the sentence itself is a stopword it would lose its meaning.

Visualizing Parts-of-speech and Named Entities

spaCy comes with a built-in dependency visualizer called displacy, which can be used to visualize the syntactic dependency (relationships) between tokens and the entities contained in a text.

displacy can be viewed either in Jupyter notebook

Code:

from spacy import displacydoc4 = "The Bus arrives by noon"displacy.render(doc4,style='dep')

Output:

figure 7. Visualizing syntactic dependency using displacy

We can play around by changing the style argument from ‘dep’ to ‘ent’ to visualize Named Entity using displacy.

Code:

doc5 = nlp("Almost four months after promising $1 million of his own money to veterans' causes, Donald Trump moved to fufil that pledge Monday evening - promising the entire sum to a single charity as he came under intense media scrutiny.")displacy.render(doc5,style='ent')

Output:

figure 8. Visualizing Named Entities using displacy

Conclusion

We have come to the end of the first part in the NLP basics using the spaCy library in the next part we would go over Named Entity Recognition(NER) and Sentence segmentation.

I hope you found this tutorial interesting. Please share and remember to comment with your suggestions or feedback.

Don’t forget to follow me for posts on Data Science and AI.

Cheers!!!

Reference

[1] Jose Portilla, Natural Language Processing with Python

[2] https://SpaCy.io

NATURAL LANGUAGE PROCESSING (NLP) Basics with spaCy (Part 1)

How Much Data is Created on the Internet Each Day?

90% of the data on the internet has been created since 2016, according to an IBM Marketing Cloud study. People…

What is NLP?

Uses of NLP

Why spaCY?

Getting Started

Tokenization

Stemming

Lemmatization

Part-of-Speech tagging

Stop words

Visualizing Parts-of-speech and Named Entities

Output:

Conclusion

Written by Abdulquadri Ayodeji Oshoare