0% found this document useful (0 votes)
9 views43 pages

Python String Manipulation Guide

String manipulation

Uploaded by

Rishav Dhama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views43 pages

Python String Manipulation Guide

String manipulation

Uploaded by

Rishav Dhama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

String Manipulation

(in Python)

11/08/2022 1
Strings

11/08/2022 2
Compound data type
• Strings are made up of smaller units – characters and we
may access the whole or its parts
• A character in Python is a string of size 1
• Both single and double quotes can be used e.g. “fruits”
or ‘fruits’
• >>>name = “sachin”
• >>>print(name)
• >>>sachin
• >>>print(name[0])
• >>>s

11/08/2022 3
Length of a string
• Length can be found using the len() function
• fruit = “apple”
• >>>len(fruit)
• 5
• >>>len(“apple”)
• 5
• >>>fruit[5]
• Error
• >>>fruit[len(fruit)-1]
• >>>’e’

11/08/2022 4
Traversal using while
• >>>index = 0
• >>>fruit=“apple”
• >>>while(index < len(fruit)):
• print(fruit[index])
• index+=1
• >>>a
• >>>p
• >>>p
• >>>l
• >>>e

11/08/2022 5
Traversal using for

• >>>fruit=“apple”
• >>> for c in fruit:
• print(c)
• >>>a
• >>>p
• >>>p
• >>>l
• >>>e

11/08/2022 6
String Slices

• >>>fruit = “apple”
• >>>fruit[1:3]
• >>>’pp’
• >>>fruit[1:]
• >>>’pple’
• >>>fruit[:4]
• >>>’appl’
• >>>fruit[:]
• >>>’apple’

11/08/2022 7
String Comparison
• >>>“apple” < “banana”
• True
• >>>”mango” < “banana”
• False
• >>>”mango” < “mango”
• False
• >>>”mango” < “Mango”
• False
• >>>”Mango” < “mango”
• True

11/08/2022 8
Strings are immutable
• >>>fruit = “Mango”
• >>>dance = fruit
• >>>dance
• ‘Mango’
• >>>dance[0]
• ‘M’
• >>>dance[0]=‘T’
• Error
• >>>dance = ‘T’+fruit[1:]
• >>>dance
• ‘Tango’

11/08/2022 9
String methods

• >>>import string
• >>>fruit=‘mango’
• >>>[Link](“go”)
• 3
• >>>[Link]()
• ‘MANGO’

11/08/2022 10
String methods

• dir(fruit) lists the methods associated with the


object fruit i.e. string methods
• >>>type(fruit)
• <class ‘str’>
• help([Link]) gives a description of
the method
• >>>help([Link])

11/08/2022 11
Regular Expressions

In computing, a regular expression, also referred to as


"regex" or "regexp", provides a concise and flexible
means for matching strings of text, such as particular
characters, words, or patterns of characters. A regular
expression is written in a formal language that can be
interpreted by a regular expression processor.

[Link]
Regular Expressions

Really clever "wild card" expressions for matching and


parsing strings.
Understanding Regular
Expressions
• Very powerful and quite cryptic
• Fun once you understand them
• Regular expressions are a language unto
themselves
• A language of "marker characters" -
programming with characters
• It is kind of an "old school" language -
compact
Regular Expression Quick
Guide
^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
\s Matches whitespace
\S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a chracter one or more times
+? Repeats a character one or more times (non-greedy)
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end
The Regular Expression
Module
• Before you can use regular expressions in your
program, you must import the library using
"import re"
• You can use [Link]() to see if a string matches
a regular expression similar to using the find()
method for strings
• You can use [Link]() extract portions of a
string that match your regular expression similar
to a combination of find() and slicing:
var[5:10]
Using [Link]() like
find()

import re
hand = open('[Link]')
for line in hand: hand = open('[Link]')
line = [Link]() for line in hand:
if [Link]('From:') >= 0: line = [Link]()
print line if [Link]('From:', line) :
print line
Using [Link]() like
startswith()

import re
hand = open('[Link]')
for line in hand: hand = open('[Link]')
line = [Link]() for line in hand:
if [Link]('From:') : line = [Link]()
print line if [Link]('^From:', line) :
print line

We fine-tune what is matched by adding special characters to the string


Wild-Card Characters

• The dot character matches any character


• If you add the asterisk character, the character
is "any number of times"
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-DSPAM-Confidence: 0.8475 ^X.*:
X-Content-Type-Message-Body: text/plain
Wild-Card Characters

• The dot character matches any character


• If you add the asterisk character, the character
is "any number of times"
Match the start of the line Many times
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-DSPAM-Confidence: 0.8475 ^X.*:
X-Content-Type-Message-Body: text/plain
Match any character
Fine-Tuning Your Match

• Depending on how "clean" your data is and


the purpose of your application, you may want
to narrow your match down a bit
Match the start of the line Many times
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-Plane is behind schedule: two weeks ^X.*:
Match any character
Fine-Tuning Your Match

• Depending on how "clean" your data is and


the purpose of your application, you may want
to narrow your match down a bit
One or more
Match the start of the line times
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-Plane is behind schedule: two weeks ^X-\S+:
Match any non-whitespace character
Matching and Extracting
Data

• The [Link]() returns a True/False depending on


whether the string matches the regular expression
• If we actually want the matching strings to be
extracted, we use [Link]()
>>> import re
[0-9]+ >>> x = 'My 2 favorite numbers are 19 and 42'
>>> y = [Link]('[0-9]+',x)
>>> print y
One or more digits ['2', '19', '42']
Matching and Extracting
Data
• When we use [Link]() it returns a list of zero or
more sub-strings that match the regular expression

>>> import re
>>> x = 'My 2 favorite numbers are 19 and 42’
>>> y = [Link]('[0-9]+',x)>>> print y['2', '19', '42']
>>> y = [Link]('[AEIOU]+',x)
>>> print y
[]
Warning: Greedy Matching

• The repeat characters (* and +) push outward in both


directions (greedy) to match the largest possible string
One or more
>>> import re characters
>>> x = 'From: Using the : character'
>>> y = [Link]('^F.+:', x)
>>> print y
^F.+:
['From: Using the :']
First character in the Last character in the
Why not 'From:'? match is an F match is a :
Non-Greedy Matching

• Not all regular expression repeat codes are greedy!


If you add a ? character - the + and * chill outOne
a bit...
or more
characters but
>>> import re not greedily
>>> x = 'From: Using the : character'
>>> y = [Link]('^F.+?:', x)
>>> print y
^F.+?:
['From:']
First character in the Last character in the
match is an F match is a :
Fine Tuning String
Extraction
• You can refine the match for [Link]() and separately determine
which portion of the match that is to be extracted using parenthesis

From [Link]@[Link] Sat Jan 5 09:14:16 2008

>>> y = [Link]('\S+@\S+',x)
>>> print y \S+@\S+
['[Link]@[Link]’]
>>> y = [Link]('^From:.*? (\S+@\S+)’,x)
At least one
>>> print y['[Link]@[Link]']
non-whitespace
character
Fine Tuning String
Extraction

• Parenthesis are not part of the match - but they tell


where to start and stop what string to extract

From [Link]@[Link] Sat Jan 5 09:14:16 2008

>>> y = [Link]('\S+@\S+',x)
>>> print y ^From (\S+@\S+)
['[Link]@[Link]']
>>> y = [Link]('^From (\S+@\S+)',x)
>>> print y
['[Link]@[Link]']
21 31

From [Link]@[Link] Sat Jan 5 09:14:16 2008

>>> data = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'


>>> atpos = [Link]('@')
>>> print atpos
21
>>> sppos = [Link](' ',atpos)
>>> print sppos Extracting a host
31 name - using find
>>> host = data[atpos+1 : sppos] and string slicing.
>>> print host
[Link]
The Double Split Version

• Sometimes we split a line one way and then grab one of


the pieces of the line and split that piece again

From [Link]@[Link] Sat Jan 5 09:14:16 2008


The Double Split Version

• Sometimes we split a line one way and then grab one of


the pieces of the line and split that piece again

From [Link]@[Link] Sat Jan 5 09:14:16 2008

words = [Link]() [Link]@[Link]


email = words[1]
pieces = [Link]('@') ['[Link]', '[Link]']
print pieces[1]
'[Link]'
The Regex Version

From [Link]@[Link] Sat Jan 5 09:14:16 2008

import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('@([^ ]*)',lin)
print y['[Link]']
'@([^ ]*)'

Look through the string until you find an at-sign


The Regex Version

From [Link]@[Link] Sat Jan 5 09:14:16 2008

import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('@([^ ]*)',lin)
print y['[Link]']
'@([^ ]*)'

Match non-blank character Match many of them


The Regex Version

From [Link]@[Link] Sat Jan 5 09:14:16 2008

import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('@([^ ]*)',lin)
print y['[Link]']
'@([^ ]*)'

Extract the non-blank characters


Even Cooler Regex Version

From [Link]@[Link] Sat Jan 5 09:14:16 2008

import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('^From .*@([^ ]*)',lin)
print y['[Link]']
'^From .*@([^ ]*)'

Starting at the beginning of the line, look for the string 'From '
Even Cooler Regex
Version

From [Link]@[Link] Sat Jan 5 09:14:16 2008

import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('^From .*@([^ ]*)',lin)
print y['[Link]']
'^From .*@([^ ]*)'

Skip a bunch of characters, looking for an at-sign


Another Regex Version

From [Link]@[Link] Sat Jan 5 09:14:16 2008

import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('^From .*@([^ ]*)',lin)
print y['[Link]']
'^From .*@([^ ]*)'

Start 'extracting'
Another Regex Version

From [Link]@[Link] Sat Jan 5 09:14:16 2008

import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('^From .*@([^ ]*)',lin)
print y['[Link]']
'^From .*@([^ ]*)'

Match non-blank character Match many of them


Even Cooler Regex
Version

From [Link]@[Link] Sat Jan 5 09:14:16 2008

import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('^From .*@([^ ]*)',lin)
print y['[Link]']
'^From .*@([^ ]*)'

Stop 'extracting'
Spam Confidence

import re
hand = open('[Link]')
numlist = list()
for line in hand:
line = [Link]()
stuff = [Link]('^X-DSPAM-Confidence: ([0-9.]+)', line)
if len(stuff) != 1 : continue
num = float(stuff[0])
[Link](num)
print 'Maximum:', max(numlist)
python [Link]
Maximum: 0.9907
Regular Expression
Quick Guide

^ Matches the beginning of a line


$ Matches the end of the line
. Matches any character
\s Matches whitespace
\S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a chracter one or more times
+? Repeats a character one or more times (non-greedy)
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end
Escape Character

• If you want a special regular expression character to just


behave normally (most of the time) you prefix it with '\'

>>> import re At least one or


>>> x = 'We just received $10.00 for cookies.' more
>>> y = [Link]('\$[0-9.]+',x)
>>> print y
['$10.00'] \$[0-9.]+
A real dollar sign A digit or period
Summary
• Regular expressions are a cryptic but
powerful language for matching strings and
extracting elements from those strings
• Regular expressions have special characters
that indicate intent

You might also like