NLP: Basic Pattern Matching using Python's spaCy library

spaCy is a free open-source library for Natural Language Processing in Python.

Some spaCy features include:

Non-destructive tokenization
Pretrained word vectors
Named entity recognition
POS tagging

You can read more about spaCy and its features here.

This blog will cover Basic Pattern Matching using spaCy.

Pattern Matching is the process of checking whether a set of tokens (A token can be considered as part of the text, such as characters or words) or phrases exist within a document/data.
More about Pattern Matching here.

STEP 1 - Importing required libraries and functions

import spacy
from spacy.matcher import PhraseMatcher

NOTE:
Make sure you have installed all the required libraries before importing them. Installation instructions for spaCy can be found here.

STEP 2 - Loading the model

model=spacy.load("en_core_web_sm")

spaCy provides support for multiple languages. Here we are loading the English model.
More details about available models here.
More details about the English model here.

STEP 3 - Creating PhraseMatcher

pmatcher=PhraseMatcher(model.vocab,attr='LOWER')

PhraseMatcher allows you to match an extensive list of terms.
The pmatcher is using the vocabulary of the English model (which we loaded in STEP 2).
attr='LOWER' provides cases insensitive matching. More details about PhraseMatcher and its parameters here.

STEP 4 - Creating a list of terms to match in data

items=['lorem ipsum','Integer','venenatis','Consectetur','Apple']
terms=[]
for i in items:
  terms.append(model(i))
pmatcher.add('TermsList',terms)

items contain a list of words to match in the given text.
The PhraseMatcher needs the words as document objects. This is done by calling the model on the items list.
These document objects are stored in a new list called terms.
Next, we add this new list to our PhraseMatcher (TermsList here is our match_id).

STEP 5 - Searching in the text

text=model("Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec nunc metus, iaculis a convallis fringilla, venenatis nec metus. Integer volutpat laoreet sollicitudin. Nunc vestibulum dolor sit amet arcu iaculis aliquet.")
matches=pmatcher(text)

We create a document from the text (that is to be searched).
Then we use the PhaseMatcher to find the position of terms in that text.
match_id, start, and end position of a phrase are stored in matches.

STEP 6 - Printing

print('Matches:',matches)
id=matches[0][0]
print(model.vocab.strings[id],':',sep='')
for i in range(len(matches)):
  start,end=matches[i][1],matches[i][2]
  print(text[start:end])

The above code prints all the matching terms found in the text. (Output in next code block)

Matches: [(16033226493888527015, 0, 2), (16033226493888527015, 6, 7), (16033226493888527015, 19, 20), (16033226493888527015, 23, 24)]
TermsList:
Lorem ipsum
consectetur
venenatis
Integer

Thank you for reading!
I would love to connect with you on LinkedIn.
Contact me at nisargkapkar00@gmail.com.

Check out my articles on Medium: