String Manipulation
(in Python)
11/08/2022 1
Strings
11/08/2022 2
Compound data type
• Strings are made up of smaller units – characters and we
may access the whole or its parts
• A character in Python is a string of size 1
• Both single and double quotes can be used e.g. “fruits”
or ‘fruits’
• >>>name = “sachin”
• >>>print(name)
• >>>sachin
• >>>print(name[0])
• >>>s
11/08/2022 3
Length of a string
• Length can be found using the len() function
• fruit = “apple”
• >>>len(fruit)
• 5
• >>>len(“apple”)
• 5
• >>>fruit[5]
• Error
• >>>fruit[len(fruit)-1]
• >>>’e’
11/08/2022 4
Traversal using while
• >>>index = 0
• >>>fruit=“apple”
• >>>while(index < len(fruit)):
• print(fruit[index])
• index+=1
• >>>a
• >>>p
• >>>p
• >>>l
• >>>e
11/08/2022 5
Traversal using for
• >>>fruit=“apple”
• >>> for c in fruit:
• print(c)
• >>>a
• >>>p
• >>>p
• >>>l
• >>>e
11/08/2022 6
String Slices
• >>>fruit = “apple”
• >>>fruit[1:3]
• >>>’pp’
• >>>fruit[1:]
• >>>’pple’
• >>>fruit[:4]
• >>>’appl’
• >>>fruit[:]
• >>>’apple’
11/08/2022 7
String Comparison
• >>>“apple” < “banana”
• True
• >>>”mango” < “banana”
• False
• >>>”mango” < “mango”
• False
• >>>”mango” < “Mango”
• False
• >>>”Mango” < “mango”
• True
11/08/2022 8
Strings are immutable
• >>>fruit = “Mango”
• >>>dance = fruit
• >>>dance
• ‘Mango’
• >>>dance[0]
• ‘M’
• >>>dance[0]=‘T’
• Error
• >>>dance = ‘T’+fruit[1:]
• >>>dance
• ‘Tango’
11/08/2022 9
String methods
• >>>import string
• >>>fruit=‘mango’
• >>>[Link](“go”)
• 3
• >>>[Link]()
• ‘MANGO’
11/08/2022 10
String methods
• dir(fruit) lists the methods associated with the
object fruit i.e. string methods
• >>>type(fruit)
• <class ‘str’>
• help([Link]) gives a description of
the method
• >>>help([Link])
11/08/2022 11
Regular Expressions
In computing, a regular expression, also referred to as
"regex" or "regexp", provides a concise and flexible
means for matching strings of text, such as particular
characters, words, or patterns of characters. A regular
expression is written in a formal language that can be
interpreted by a regular expression processor.
[Link]
Regular Expressions
Really clever "wild card" expressions for matching and
parsing strings.
Understanding Regular
Expressions
• Very powerful and quite cryptic
• Fun once you understand them
• Regular expressions are a language unto
themselves
• A language of "marker characters" -
programming with characters
• It is kind of an "old school" language -
compact
Regular Expression Quick
Guide
^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
\s Matches whitespace
\S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a chracter one or more times
+? Repeats a character one or more times (non-greedy)
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end
The Regular Expression
Module
• Before you can use regular expressions in your
program, you must import the library using
"import re"
• You can use [Link]() to see if a string matches
a regular expression similar to using the find()
method for strings
• You can use [Link]() extract portions of a
string that match your regular expression similar
to a combination of find() and slicing:
var[5:10]
Using [Link]() like
find()
import re
hand = open('[Link]')
for line in hand: hand = open('[Link]')
line = [Link]() for line in hand:
if [Link]('From:') >= 0: line = [Link]()
print line if [Link]('From:', line) :
print line
Using [Link]() like
startswith()
import re
hand = open('[Link]')
for line in hand: hand = open('[Link]')
line = [Link]() for line in hand:
if [Link]('From:') : line = [Link]()
print line if [Link]('^From:', line) :
print line
We fine-tune what is matched by adding special characters to the string
Wild-Card Characters
• The dot character matches any character
• If you add the asterisk character, the character
is "any number of times"
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-DSPAM-Confidence: 0.8475 ^X.*:
X-Content-Type-Message-Body: text/plain
Wild-Card Characters
• The dot character matches any character
• If you add the asterisk character, the character
is "any number of times"
Match the start of the line Many times
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-DSPAM-Confidence: 0.8475 ^X.*:
X-Content-Type-Message-Body: text/plain
Match any character
Fine-Tuning Your Match
• Depending on how "clean" your data is and
the purpose of your application, you may want
to narrow your match down a bit
Match the start of the line Many times
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-Plane is behind schedule: two weeks ^X.*:
Match any character
Fine-Tuning Your Match
• Depending on how "clean" your data is and
the purpose of your application, you may want
to narrow your match down a bit
One or more
Match the start of the line times
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-Plane is behind schedule: two weeks ^X-\S+:
Match any non-whitespace character
Matching and Extracting
Data
• The [Link]() returns a True/False depending on
whether the string matches the regular expression
• If we actually want the matching strings to be
extracted, we use [Link]()
>>> import re
[0-9]+ >>> x = 'My 2 favorite numbers are 19 and 42'
>>> y = [Link]('[0-9]+',x)
>>> print y
One or more digits ['2', '19', '42']
Matching and Extracting
Data
• When we use [Link]() it returns a list of zero or
more sub-strings that match the regular expression
>>> import re
>>> x = 'My 2 favorite numbers are 19 and 42’
>>> y = [Link]('[0-9]+',x)>>> print y['2', '19', '42']
>>> y = [Link]('[AEIOU]+',x)
>>> print y
[]
Warning: Greedy Matching
• The repeat characters (* and +) push outward in both
directions (greedy) to match the largest possible string
One or more
>>> import re characters
>>> x = 'From: Using the : character'
>>> y = [Link]('^F.+:', x)
>>> print y
^F.+:
['From: Using the :']
First character in the Last character in the
Why not 'From:'? match is an F match is a :
Non-Greedy Matching
• Not all regular expression repeat codes are greedy!
If you add a ? character - the + and * chill outOne
a bit...
or more
characters but
>>> import re not greedily
>>> x = 'From: Using the : character'
>>> y = [Link]('^F.+?:', x)
>>> print y
^F.+?:
['From:']
First character in the Last character in the
match is an F match is a :
Fine Tuning String
Extraction
• You can refine the match for [Link]() and separately determine
which portion of the match that is to be extracted using parenthesis
From [Link]@[Link] Sat Jan 5 09:14:16 2008
>>> y = [Link]('\S+@\S+',x)
>>> print y \S+@\S+
['[Link]@[Link]’]
>>> y = [Link]('^From:.*? (\S+@\S+)’,x)
At least one
>>> print y['[Link]@[Link]']
non-whitespace
character
Fine Tuning String
Extraction
• Parenthesis are not part of the match - but they tell
where to start and stop what string to extract
From [Link]@[Link] Sat Jan 5 09:14:16 2008
>>> y = [Link]('\S+@\S+',x)
>>> print y ^From (\S+@\S+)
['[Link]@[Link]']
>>> y = [Link]('^From (\S+@\S+)',x)
>>> print y
['[Link]@[Link]']
21 31
From [Link]@[Link] Sat Jan 5 09:14:16 2008
>>> data = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
>>> atpos = [Link]('@')
>>> print atpos
21
>>> sppos = [Link](' ',atpos)
>>> print sppos Extracting a host
31 name - using find
>>> host = data[atpos+1 : sppos] and string slicing.
>>> print host
[Link]
The Double Split Version
• Sometimes we split a line one way and then grab one of
the pieces of the line and split that piece again
From [Link]@[Link] Sat Jan 5 09:14:16 2008
The Double Split Version
• Sometimes we split a line one way and then grab one of
the pieces of the line and split that piece again
From [Link]@[Link] Sat Jan 5 09:14:16 2008
words = [Link]() [Link]@[Link]
email = words[1]
pieces = [Link]('@') ['[Link]', '[Link]']
print pieces[1]
'[Link]'
The Regex Version
From [Link]@[Link] Sat Jan 5 09:14:16 2008
import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('@([^ ]*)',lin)
print y['[Link]']
'@([^ ]*)'
Look through the string until you find an at-sign
The Regex Version
From [Link]@[Link] Sat Jan 5 09:14:16 2008
import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('@([^ ]*)',lin)
print y['[Link]']
'@([^ ]*)'
Match non-blank character Match many of them
The Regex Version
From [Link]@[Link] Sat Jan 5 09:14:16 2008
import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('@([^ ]*)',lin)
print y['[Link]']
'@([^ ]*)'
Extract the non-blank characters
Even Cooler Regex Version
From [Link]@[Link] Sat Jan 5 09:14:16 2008
import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('^From .*@([^ ]*)',lin)
print y['[Link]']
'^From .*@([^ ]*)'
Starting at the beginning of the line, look for the string 'From '
Even Cooler Regex
Version
From [Link]@[Link] Sat Jan 5 09:14:16 2008
import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('^From .*@([^ ]*)',lin)
print y['[Link]']
'^From .*@([^ ]*)'
Skip a bunch of characters, looking for an at-sign
Another Regex Version
From [Link]@[Link] Sat Jan 5 09:14:16 2008
import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('^From .*@([^ ]*)',lin)
print y['[Link]']
'^From .*@([^ ]*)'
Start 'extracting'
Another Regex Version
From [Link]@[Link] Sat Jan 5 09:14:16 2008
import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('^From .*@([^ ]*)',lin)
print y['[Link]']
'^From .*@([^ ]*)'
Match non-blank character Match many of them
Even Cooler Regex
Version
From [Link]@[Link] Sat Jan 5 09:14:16 2008
import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('^From .*@([^ ]*)',lin)
print y['[Link]']
'^From .*@([^ ]*)'
Stop 'extracting'
Spam Confidence
import re
hand = open('[Link]')
numlist = list()
for line in hand:
line = [Link]()
stuff = [Link]('^X-DSPAM-Confidence: ([0-9.]+)', line)
if len(stuff) != 1 : continue
num = float(stuff[0])
[Link](num)
print 'Maximum:', max(numlist)
python [Link]
Maximum: 0.9907
Regular Expression
Quick Guide
^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
\s Matches whitespace
\S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a chracter one or more times
+? Repeats a character one or more times (non-greedy)
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end
Escape Character
• If you want a special regular expression character to just
behave normally (most of the time) you prefix it with '\'
>>> import re At least one or
>>> x = 'We just received $10.00 for cookies.' more
>>> y = [Link]('\$[0-9.]+',x)
>>> print y
['$10.00'] \$[0-9.]+
A real dollar sign A digit or period
Summary
• Regular expressions are a cryptic but
powerful language for matching strings and
extracting elements from those strings
• Regular expressions have special characters
that indicate intent