Regular Expressions
Chapter 11
Python for Everybody
[Link]
Regular Expressions
In computing, a regular expression, also referred to as
“regex” or “regexp”, provides a concise and flexible
means for matching strings of text, such as particular
characters, words, or patterns of characters. A regular
expression is written in a formal language that can be
interpreted by a regular expression processor.
[Link]
Regular Expressions
Really clever “wild card” expressions for matching
and parsing strings
[Link]
Really smart “Find” or “Search”
Understanding Regular Expressions
• Very powerful and quite cryptic
• Fun once you understand them
• Regular expressions are a language unto themselves
• A language of “marker characters” - programming with characters
• It is kind of an “old school” language - compact
[Link]
Regular Expression Quick Guide
^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
\s Matches whitespace
\S Matches any non-whitespace character
* Repeats a character zero or more times
+ Repeats a character one or more times
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
{} Range set
[a-z] Any lower case alphabet
[A-Z] Any Upper case alphabet
[0-9] Any digits
The Regular Expression Module
• Before you can use regular expressions in your program, you must
import the library using “import re”
• You can use [Link]() to see if a string matches a regular expression,
similar to using the find() method for strings
• You can use [Link]() to extract portions of a string that match your
regular expression, similar to a combination of find() and slicing:
var[5:10]
Using [Link]() Like find()
import re
hand = open('[Link]')
for line in hand: hand = open('[Link]')
line = [Link]() for line in hand:
if [Link]('From:') >= 0: line = [Link]()
print(line) if [Link]('From:', line) :
print(line)
Using [Link]() Like startswith()
import re
hand = open('[Link]')
for line in hand: hand = open('[Link]')
line = [Link]() for line in hand:
if [Link]('From:') : line = [Link]()
print(line) if [Link]('^From:', line) :
print(line)
We fine-tune what is matched by adding special characters to the string
Wild-Card Characters
• The dot character matches any character
• If you add the asterisk character, the character is “any number of
times”
Many times
Match the start of the line
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-DSPAM-Confidence: 0.8475
X-Content-Type-Message-Body: text/plain
^X.*:
Match any character
Fine-Tuning Your Match
Depending on how “clean” your data is and the purpose of your
application, you may want to narrow your match down a bit
Many times
Match the start of
X-Sieve: CMU Sieve 2.3 the line
X-DSPAM-Result: Innocent
X-Plane is behind schedule: two weeks
X-: Very short
^X.*:
Match any character
Fine-Tuning Your Match
Depending on how “clean” your data is and the purpose of your
application, you may want to narrow your match down a bit
One or more
Match the start of
X-Sieve: CMU Sieve 2.3 times
X-DSPAM-Result: Innocent the line
X-: Very Short
X-Plane is behind schedule: two weeks ^X-\S+:
Match any non-whitespace character
Matching and Extracting Data
• [Link]() returns a True/False depending on whether the string
matches the regular expression
• If we actually want the matching strings to be extracted, we use
[Link]()
>>> import re
[0-9]+ >>> x = 'My 2 favorite numbers are 19 and 42'
>>> y = [Link]('[0-9]+',x)
>>> print(y)
['2', '19', '42']
One or more digits
Matching and Extracting Data
When we use [Link](), it returns a list of zero or more sub-strings that
match the regular expression
>>> import re
>>> x = 'My 2 favorite numbers are 19 and 42'
>>> y = [Link]('[0-9]+',x)
>>> print(y)
['2', '19', '42']
>>> y = [Link]('[AEIOU]+',x)
>>> print(y)
[]
Warning: Greedy Matching
The repeat characters (* and +) push outward in both directions (greedy)
to match the largest possible string
One or more
characters
>>> import re
>>> x = 'From: Using the : character'
>>> y = [Link]('^F.+:', x)
>>> print(y) ^F.+:
['From: Using the :']
First character in the Last character in the
Why not 'From:' ?
match is an F match is a :
Non-Greedy Matching
Not all regular expression repeat codes are greedy! If you
add a ? character, the + and * chill out a bit... One or more
characters but
not greedy
>>> import re
>>> x = 'From: Using the : character'
>>> y = [Link]('^F.+?:', x) ^F.+?:
>>> print(y)
['From:']
First character in the Last character in the
match is an F match is a :
Fine-Tuning String Extraction
You can refine the match for [Link]() and separately determine which portion of
the match is to be extracted by using parentheses
From [Link]@[Link] Sat Jan 5 09:14:16 2008
>>> y = [Link]('\S+@\S+',x) \S+@\S+
>>> print(y)
['[Link]@[Link]’]
At least one non-
whitespace
character
Fine-Tuning String Extraction
Parentheses are not part of the match - but they tell where to start and stop
what string to extract
From [Link]@[Link] Sat Jan 5 09:14:16 2008
>>> y = [Link]('\S+@\S+',x)
>>> print(y)
['[Link]@[Link]']
^From (\S+@\S+)
>>> y = [Link]('^From (\S+@\S+)',x)
>>> print(y)
['[Link]@[Link]']
String Parsing Examples…
21 31
From [Link]@[Link] Sat Jan 5 09:14:16 2008
>>> data = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
>>> atpos = [Link]('@')
>>> print(atpos)
21
>>> sppos = [Link](' ',atpos) Extracting a host
>>> print(sppos) name - using find
31
>>> host = data[atpos+1 : sppos]
and string slicing
>>> print(host)
[Link]
The Double Split Pattern
Sometimes we split a line one way, and then grab one of the pieces of the
line and split that piece again
From [Link]@[Link] Sat Jan 5 09:14:16 2008
words = [Link]() [Link]@[Link]
email = words[1] ['[Link]', '[Link]']
pieces = [Link]('@')
print(pieces[1]) '[Link]'
The Regex Version
From [Link]@[Link] Sat Jan 5 09:14:16 2008
import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('@([^ ]*)',lin)
print(y)
['[Link]']
'@([^ ]*)'
Look through the string until you find an at sign
The Regex Version
From [Link]@[Link] Sat Jan 5 09:14:16 2008
import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('@([^ ]*)',lin)
print(y)
['[Link]']
'@([^ ]*)'
Match non-blank character Match many of them
The Regex Version
From [Link]@[Link] Sat Jan 5 09:14:16 2008
import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('@([^ ]*)',lin)
print(y)
['[Link]']
'@([^ ]*)'
Extract the non-blank characters
Even Cooler Regex Version
From [Link]@[Link] Sat Jan 5 09:14:16 2008
import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('^From .*@([^ ]*)',lin)
print(y)
['[Link]']
'^From .*@([^ ]*)'
Starting at the beginning of the line, look for the string 'From '
Even Cooler Regex Version
From [Link]@[Link] Sat Jan 5 09:14:16 2008
import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('^From .*@([^ ]*)',lin)
print(y)
['[Link]']
'^From .*@([^ ]*)'
Skip a bunch of characters, looking for an at sign
Even Cooler Regex Version
From [Link]@[Link] Sat Jan 5 09:14:16 2008
import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('^From .*@([^ ]*)',lin)
print(y)
['[Link]']
'^From .*@([^ ]*)'
Start extracting
Even Cooler Regex Version
From [Link]@[Link] Sat Jan 5 09:14:16 2008
import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('^From .*@([^ ]*)',lin)
print(y)
['[Link]']
'^From .*@([^ ]+)'
Match non-blank character Match many of them
Even Cooler Regex Version
From [Link]@[Link] Sat Jan 5 09:14:16 2008
import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('^From .*@([^ ]*)',lin)
print(y)
['[Link]']
'^From .*@([^ ]+)'
Stop extracting
Spam Confidence
import re
hand = open('[Link]')
numlist = list()
for line in hand:
line = [Link]()
stuff = [Link]('^X-DSPAM-Confidence: ([0-9.]+)', line)
if len(stuff) != 1 : continue
num = float(stuff[0])
[Link](num)
print('Maximum:', max(numlist)) python [Link]
Maximum: 0.9907
X-DSPAM-Confidence: 0.8475
Escape Character
If you want a special regular expression character to just behave
normally (most of the time) you prefix it with '\'
>>> import re At least one or
>>> x = 'We just received $10.00 for cookies.' more
>>> y = [Link]('\$[0-9.]+',x)
>>> print(y)
['$10.00']
\$[0-9.]+
A real dollar sign A digit or period
Summary
• Regular expressions are a cryptic but powerful language for
matching strings and extracting elements from those strings
• Regular expressions have special characters that indicate intent
Acknowledgements / Contributions
These slides are Copyright 2010- Charles R. Severance (
...
[Link]) of the University of Michigan School of
Information and [Link] and made available under a
Creative Commons Attribution 4.0 License. Please maintain this
last slide in all copies of the document to comply with the
attribution requirements of the license. If you make a change,
feel free to add your name and organization to the list of
contributors on this page as you republish the materials.
Initial Development: Charles Severance, University of Michigan
School of Information
… Insert new Contributors and Translations here