0% found this document useful (0 votes)

9 views43 pages

Python String Manipulation Guide

String manipulation

Uploaded by

Rishav Dhama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views43 pages

Python String Manipulation Guide

String manipulation

Uploaded by

Rishav Dhama

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

String Manipulation

(in Python)

11/08/2022 1
Strings

11/08/2022 2
Compound data type
• Strings are made up of smaller units – characters and we
may access the whole or its parts
• A character in Python is a string of size 1
• Both single and double quotes can be used e.g. “fruits”
or ‘fruits’
• >>>name = “sachin”
• >>>print(name)
• >>>sachin
• >>>print(name[0])
• >>>s

11/08/2022 3
Length of a string
• Length can be found using the len() function
• fruit = “apple”
• >>>len(fruit)
• 5
• >>>len(“apple”)
• 5
• >>>fruit[5]
• Error
• >>>fruit[len(fruit)-1]
• >>>’e’

11/08/2022 4
Traversal using while
• >>>index = 0
• >>>fruit=“apple”
• >>>while(index < len(fruit)):
• print(fruit[index])
• index+=1
• >>>a
• >>>p
• >>>p
• >>>l
• >>>e

11/08/2022 5
Traversal using for

• >>>fruit=“apple”
• >>> for c in fruit:
• print(c)
• >>>a
• >>>p
• >>>p
• >>>l
• >>>e

11/08/2022 6
String Slices

• >>>fruit = “apple”
• >>>fruit[1:3]
• >>>’pp’
• >>>fruit[1:]
• >>>’pple’
• >>>fruit[:4]
• >>>’appl’
• >>>fruit[:]
• >>>’apple’

11/08/2022 7
String Comparison
• >>>“apple” < “banana”
• True
• >>>”mango” < “banana”
• False
• >>>”mango” < “mango”
• False
• >>>”mango” < “Mango”
• False
• >>>”Mango” < “mango”
• True

11/08/2022 8
Strings are immutable
• >>>fruit = “Mango”
• >>>dance = fruit
• >>>dance
• ‘Mango’
• >>>dance[0]
• ‘M’
• >>>dance[0]=‘T’
• Error
• >>>dance = ‘T’+fruit[1:]
• >>>dance
• ‘Tango’

11/08/2022 9
String methods

• >>>import string
• >>>fruit=‘mango’
• >>>[Link](“go”)
• 3
• >>>[Link]()
• ‘MANGO’

11/08/2022 10
String methods

• dir(fruit) lists the methods associated with the

object fruit i.e. string methods
• >>>type(fruit)
• <class ‘str’>
• help([Link]) gives a description of
the method
• >>>help([Link])

11/08/2022 11
Regular Expressions

In computing, a regular expression, also referred to as

"regex" or "regexp", provides a concise and flexible
means for matching strings of text, such as particular
characters, words, or patterns of characters. A regular
expression is written in a formal language that can be
interpreted by a regular expression processor.

[Link]
Regular Expressions

Really clever "wild card" expressions for matching and

parsing strings.
Understanding Regular
Expressions
• Very powerful and quite cryptic
• Fun once you understand them
• Regular expressions are a language unto
themselves
• A language of "marker characters" -
programming with characters
• It is kind of an "old school" language -
compact
Regular Expression Quick
Guide
^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
\s Matches whitespace
\S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a chracter one or more times
+? Repeats a character one or more times (non-greedy)
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end
The Regular Expression
Module
• Before you can use regular expressions in your
program, you must import the library using
"import re"
• You can use [Link]() to see if a string matches
a regular expression similar to using the find()
method for strings
• You can use [Link]() extract portions of a
string that match your regular expression similar
to a combination of find() and slicing:
var[5:10]
Using [Link]() like
find()

import re
hand = open('[Link]')
for line in hand: hand = open('[Link]')
line = [Link]() for line in hand:
if [Link]('From:') >= 0: line = [Link]()
print line if [Link]('From:', line) :
print line
Using [Link]() like
startswith()

import re
hand = open('[Link]')
for line in hand: hand = open('[Link]')
line = [Link]() for line in hand:
if [Link]('From:') : line = [Link]()
print line if [Link]('^From:', line) :
print line

We fine-tune what is matched by adding special characters to the string

Wild-Card Characters

• The dot character matches any character

• If you add the asterisk character, the character
is "any number of times"
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-DSPAM-Confidence: 0.8475 ^X.*:
X-Content-Type-Message-Body: text/plain
Wild-Card Characters

• The dot character matches any character

• If you add the asterisk character, the character
is "any number of times"
Match the start of the line Many times
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-DSPAM-Confidence: 0.8475 ^X.*:
X-Content-Type-Message-Body: text/plain
Match any character
Fine-Tuning Your Match

• Depending on how "clean" your data is and

the purpose of your application, you may want
to narrow your match down a bit
Match the start of the line Many times
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-Plane is behind schedule: two weeks ^X.*:
Match any character
Fine-Tuning Your Match

• Depending on how "clean" your data is and

the purpose of your application, you may want
to narrow your match down a bit
One or more
Match the start of the line times
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-Plane is behind schedule: two weeks ^X-\S+:
Match any non-whitespace character
Matching and Extracting
Data

• The [Link]() returns a True/False depending on

whether the string matches the regular expression
• If we actually want the matching strings to be
extracted, we use [Link]()
>>> import re
[0-9]+ >>> x = 'My 2 favorite numbers are 19 and 42'
>>> y = [Link]('[0-9]+',x)
>>> print y
One or more digits ['2', '19', '42']
Matching and Extracting
Data
• When we use [Link]() it returns a list of zero or
more sub-strings that match the regular expression

>>> import re
>>> x = 'My 2 favorite numbers are 19 and 42’
>>> y = [Link]('[0-9]+',x)>>> print y['2', '19', '42']
>>> y = [Link]('[AEIOU]+',x)
>>> print y
[]
Warning: Greedy Matching

• The repeat characters (* and +) push outward in both

directions (greedy) to match the largest possible string
One or more
>>> import re characters
>>> x = 'From: Using the : character'
>>> y = [Link]('^F.+:', x)
>>> print y
^F.+:
['From: Using the :']
First character in the Last character in the
Why not 'From:'? match is an F match is a :
Non-Greedy Matching

• Not all regular expression repeat codes are greedy!

If you add a ? character - the + and * chill outOne
a bit...
or more
characters but
>>> import re not greedily
>>> x = 'From: Using the : character'
>>> y = [Link]('^F.+?:', x)
>>> print y
^F.+?:
['From:']
First character in the Last character in the
match is an F match is a :
Fine Tuning String
Extraction
• You can refine the match for [Link]() and separately determine
which portion of the match that is to be extracted using parenthesis

• Parenthesis are not part of the match - but they tell

where to start and stop what string to extract

From [Link]@[Link] Sat Jan 5 09:14:16 2008

>>> y = [Link]('\S+@\S+',x)
>>> print y ^From (\S+@\S+)
['[Link]@[Link]']
>>> y = [Link]('^From (\S+@\S+)',x)
>>> print y
['[Link]@[Link]']
21 31

From [Link]@[Link] Sat Jan 5 09:14:16 2008

• Sometimes we split a line one way and then grab one of

the pieces of the line and split that piece again

From [Link]@[Link] Sat Jan 5 09:14:16 2008

The Double Split Version

• Sometimes we split a line one way and then grab one of

the pieces of the line and split that piece again

From [Link]@[Link] Sat Jan 5 09:14:16 2008

words = [Link]() [Link]@[Link]

email = words[1]
pieces = [Link]('@') ['[Link]', '[Link]']
print pieces[1]
'[Link]'
The Regex Version

From [Link]@[Link] Sat Jan 5 09:14:16 2008

import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('@([^ ]*)',lin)
print y['[Link]']
'@([^ ]*)'

Look through the string until you find an at-sign

The Regex Version

From [Link]@[Link] Sat Jan 5 09:14:16 2008

import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('@([^ ]*)',lin)
print y['[Link]']
'@([^ ]*)'

Match non-blank character Match many of them

The Regex Version

From [Link]@[Link] Sat Jan 5 09:14:16 2008

import re
lin = 'From [Link]@[Link] Sat Jan 5 09:14:16 2008'
y = [Link]('@([^ ]*)',lin)
print y['[Link]']
'@([^ ]*)'