spaCy is a free open-source library for Natural Language Processing in Python.
Some spaCy features include:
- Non-destructive tokenization
- Pretrained word vectors
- Named entity recognition
- POS tagging
You can read more about spaCy and its features here.
This blog will cover Basic Pattern Matching using spaCy.
Pattern Matching is the process of checking whether a set of tokens (A token can be considered as part of the text, such as characters or words) or phrases exist within a document/data.
More about Pattern Matching here.
STEP 1 - Importing required libraries and functions
import spacy
from spacy.matcher import PhraseMatcher
NOTE:
Make sure you have installed all the required libraries before importing them. Installation instructions for spaCy can be found here.
STEP 2 - Loading the model
model=spacy.load("en_core_web_sm")
spaCy provides support for multiple languages. Here we are loading the English model.
More details about available models here.
More details about the English model here.
STEP 3 - Creating PhraseMatcher
pmatcher=PhraseMatcher(model.vocab,attr='LOWER')
PhraseMatcher allows you to match an extensive list of terms.
The pmatcher is using the vocabulary of the English model (which we loaded in STEP 2).
attr='LOWER' provides cases insensitive matching.
More details about PhraseMatcher and its parameters here.
STEP 4 - Creating a list of terms to match in data
items=['lorem ipsum','Integer','venenatis','Consectetur','Apple']
terms=[]
for i in items:
terms.append(model(i))
pmatcher.add('TermsList',terms)
items contain a list of words to match in the given text.
The PhraseMatcher needs the words as document objects. This is done by calling the model on the items list.
These document objects are stored in a new list called terms.
Next, we add this new list to our PhraseMatcher (TermsList here is our match_id).
STEP 5 - Searching in the text
text=model("Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec nunc metus, iaculis a convallis fringilla, venenatis nec metus. Integer volutpat laoreet sollicitudin. Nunc vestibulum dolor sit amet arcu iaculis aliquet.")
matches=pmatcher(text)
We create a document from the text (that is to be searched).
Then we use the PhaseMatcher to find the position of terms in that text.
match_id, start, and end position of a phrase are stored in matches.
STEP 6 - Printing
print('Matches:',matches)
id=matches[0][0]
print(model.vocab.strings[id],':',sep='')
for i in range(len(matches)):
start,end=matches[i][1],matches[i][2]
print(text[start:end])
The above code prints all the matching terms found in the text. (Output in next code block)
Matches: [(16033226493888527015, 0, 2), (16033226493888527015, 6, 7), (16033226493888527015, 19, 20), (16033226493888527015, 23, 24)]
TermsList:
Lorem ipsum
consectetur
venenatis
Integer
Thank you for reading!
I would love to connect with you on LinkedIn.
Contact me at nisargkapkar00@gmail.com.
Check out my articles on Medium: