Python String Matching Without Complex RegEx Syntax
Learn how to use human-friendly programmable regular expressions for complex Python string matching.
Image by Author
I have a love-and-hate relationship with regular expressions (RegEx), especially in Python. I love how you can extract or match strings without writing multiple logical functions. It is even better than the String search function.
What I don’t like is how it is hard for me to learn and understand RegEx patterns. I can deal with simple String matching, such as extracting all alpha-numerical characters and cleaning the text for NLP tasks. Things get harder when it comes to extracting IP addresses, emails, and IDs from junk text. You have to write a complex RegEx String pattern to extract the required item.
To make complex RegEx tasks simple, we will learn about a simple Python Package called pregex. Furthermore, we will also look at a few examples of extracting dates and emails from a long string of text.
Getting Started with PRegEx
Pregex is a higher-level API built on top of the `re` module. It is a RegEx without complex RegEx patterns that make it easy for any programmer to understand and remember regular expressions. Moreover, you don’t have to group patterns or escape metacharacters, and it is modular.
You can simply install the library using PIP.
pip install pregex
To test the powerful functionality of PRegEx, we will use modified sample code from the documentation.
In the example below, we are extracting either HTTP URL or an IPv4 address with a port number. We don’t have to create complex logic for it. We can use built-in functions `HttpUrl` and `IPv4`.
- Create a port number using AnyDigit(). The first digit of the port should not be zero, and the next three digits can be any number.
- Use Either() to add multiple logics to extract, either HTTP URL or IP address with a port number.
from pregex.core.pre import Pregex
from pregex.core.classes import AnyDigit
from pregex.core.operators import Either
from pregex.meta.essentials import HttpUrl, IPv4
port_number = (AnyDigit() - '0') + 3 * AnyDigit()
pre = Either(
HttpUrl(capture_domain=True, is_extensible=True),
IPv4(is_extensible=True) + ':' + port_number
)
We will use a long string of text with characters and descriptions.
text = """IPV4--192.168.1.1:8000--
address--https://www.abid.works--
website--https://www.kdnuggets.com--text"""
Before we extract the matching string, let’s look at the RegEx pattern.
regex_pattren = pre.get_pattern()
print(regex_pattren)
Output
As we can see, it is hard to read or even understand what is going on. This is where PRegEx shines. To provide you with a human-friendly API for performing complex regular expression tasks.
(?:https?:\/\/)?(?:www\.)?(?:[a-z\dA-Z][a-z\-\dA-Z]{,61}[a-z\dA-Z]\.)*([a-z\dA-Z][a-z\-\dA-Z]{,61}[a-z\dA-Z])\.[a-z]{2,6}(?::\d{1,4})?(?:\/[!-.0-~]+)*\/?(?:(?<=[!-\/\[-`{-~:-@])|(?<=\w))|(?:(?:\d|[1-9]\d|1\d{2}|2(?:[0-4]\d|5[0-5]))\.){3}(?:\d|[1-9]\d|1\d{2}|2(?:[0-4]\d|5[0-5])):[1-9]\d{3}
Just like `re.match`, we will use `.get_matches(text)` to extract the required string.
results = pre.get_matches(text)
print(results)
Output
We have extracted both the IP address with port number and two web URLs.
['192.168.1.1:8000', 'https://www.abid.works', 'https://www.kdnuggets.com']
Example 1: Date Format
Let’s look at a couple of examples where we can understand the full potential of PRegEx.
In this example, we will be extracting certain kinds of date patterns from the text below.
text = """
04-15-2023
2023-08-15
06-20-2023
06/24/2023
"""
By using Exactly() and AnyDigit(), we will create the day, month, and year of the date. The day and month have two digits, whereas the year has 4 digits. They are separated by “-” dashes.
After creating the pattern, we will run `get_match` to extract the matching String.
from pregex.core.classes import AnyDigit
from pregex.core.quantifiers import Exactly
day_or_month = Exactly(AnyDigit(), 2)
year = Exactly(AnyDigit(), 4)
pre = (
day_or_month +
"-" +
day_or_month +
"-" +
year
)
results = pre.get_matches(text)
print(results)
Output
['04-15-2023', '06-20-2023']
Let’s look at the RegEx pattern by using the `get_pattern()` function.
regex_pattren = pre.get_pattern()
print(regex_pattren)
Output
As we can see, it has a simple RegEx syntax.
\d{2}-\d{2}-\d{4}
Example 2: Email Extraction
The second example is a bit complex, where we will extract valid email addresses from junk text.
text = """
user1@abid.works
editorial@@kdnuggets.com
lover@python.gg.
editorial1@kdnuggets.com
"""
- Create a user pattern with `OneOrMore()`. We will use `AnyButFrom()` to remove “@” and space from the logic.
- Similar to a user pattern we create a company pattern by removing the additional character “.” from the logic.
- For the domain, we will use `MatchAtLineEnd()` to start the search from the end with any two or more characters except "@", space, and full stop.
- Combine all three to create the final pattern: user@company.domain.
from pregex.core.classes import AnyButFrom
from pregex.core.quantifiers import OneOrMore, AtLeast
from pregex.core.assertions import MatchAtLineEnd
user = OneOrMore(AnyButFrom("@", ' '))
company = OneOrMore(AnyButFrom("@", ' ', '.'))
domain = MatchAtLineEnd(AtLeast(AnyButFrom("@", ' ', '.'), 2))
pre = (
user +
"@" +
company +
'.' +
domain
)
results = pre.get_matches(text)
print(results)
Output
As we can see, PRegEx has identified two valid email address.
['user1@abid.works', 'editorial1@kdnuggets.com']
Note: both code examples are modified versions of work by The PyCoach.
Conclusion
If you are a data scientist, analyst, or NLP enthusiast, you should use PRegEx to clean the text and create simple logic. It will reduce your dependency on NLP frameworks as most of the matching can be done using simple API.
In this mini tutorial, we have learned about the Python package PRegEx and its use cases with examples. You can learn more by reading the official documentation or solving a wordle problem using programmable regular expressions.
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.