0% found this document useful (0 votes)
8 views78 pages

Introduction to Regular Expressions

Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views78 pages

Introduction to Regular Expressions

Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

INTRODUCTION TO REGULAR

EXPRESSION

1 Vijay Kumar Trivedi


VIT Bhopal
Regular Languages

Regular Language
L

Regular Expression Accepts

Finite State
Machine
REGULAR EXPRESSIONS

The regular expressions over an alphabet  are all and only


the strings that can be obtained as follows:

1.  is a regular expression.
2.  is a regular expression.
3. Every element of  is a regular expression.
4. If  ,  are regular expressions, then so is ..
5. If  ,  are regular expressions, then so is +.
6. If  is a regular expression, then so is *.
7.  is a regular expression, then so is +.
8. If  is a regular expression, then so is ().
REGULAR EXPRESSION
EXAMPLES
If  = {a, b}, the following are regular expressions:



a
(a + b)*
abba + 
REGULAR EXPRESSIONS DEFINE
LANGUAGES
Define L, a semantic interpretation function for regular
expressions:

1. L() = .
2. L() = {}.
3. L(c), where c   = {c}.
4. L(.) = L(). L().
5. L( + ) = L()  L().
6. L(*) = (L())*.
7. L(+) = L(*) = L() (L())*. If L() is equal to , then
L(+) is also equal to . Otherwise L(+) is the language
that is formed by concatenating together one or more
strings drawn from L().
8. L(()) = L().
THE ROLE OF THE RULES
 Rules 1, 3, 4, 5, and 6 give the language its power
to define sets.
 Rule 8 has as its only role grouping other operators.

 Rules 2 and 7 appear to add functionality to the

regular expression language, but they don’t.

2.  is a regular expression.

7.  is a regular expression, then so is +.


ANALYZING A REGULAR
EXPRESSION
L((a + b)*b) = L((a + b)*) L(b)

= (L((a + b)))* L(b)

= (L(a)  L(b))* L(b)

= ({a}  {b})* {b}

= {a, b}* {b}.


EXAMPLE

 Regular expression r a  b * a  bb 

Lr  a, bb, aa, abb, ba, bbb,...

Costas Busch - RPI 8


EXAMPLE

 Regular expression r aa * bb * b

2n 2m
Lr  {a b b : n, m 0}

9
EXAMPLE

 Regular expression r (0  1) * 00 (0  1) *

L(r ) = { all strings with at least


two consecutive 0 }

10
EXAMPLE

 Regular expression r (1  01) * (0   )

L(r ) = { all strings without


two consecutive 0 }

11
EXAMPLES
L( a*b* ) =

L( (a + b)* ) =

L( (a + b)*a*b* ) =

L( (a + b)*abba(a + b)* ) =
GOING THE OTHER WAY

L = {w  {a, b}*: |w| is even}


GOING THE OTHER WAY

L = {w  {a, b}*: |w| is even}

(a + b) (a + b))*

(aa + ab + ba + bb)*
GOING THE OTHER WAY

L = {w  {a, b}*: |w| is even}

(a + b) (a + b))*

(aa + ab + ba + bb)*

L = {w  {a, b}*: w contains an odd number of a’s}


GOING THE OTHER WAY

L = {w  {a, b}*: |w| is even}

(a  b) (a  b))*

(aa  ab  ba  bb)*

L = {w  {a, b}*: w contains an odd number of a’s}

b* (ab*ab*)* a b*

b* a b* (ab*ab*)*
MORE REGULAR EXPRESSION
EXAMPLES

L ( (aa*) +  ) =

L ( (a + )* ) =

L = {w  {a, b}*: there is no more than one b in w}

L = {w  {a, b}* : no two consecutive letters in w are the


same}
Operator Precedence in Regular Expressions

Regular Arithmetic
Expressions Expressions

Highest Kleene star exponentiation

concatenation multiplication

Lowest union addition

a b*  c d* x y 2 + i j2
EQUIVALENT REGULAR EXPRESSIONS

 Definition:

 Regular expressions r1 r
and2


are equivalent if

L(r1 ) L(r2 )

19
EXAMPLE
L = { all strings without
two consecutive 0 }

r1 (1  01) * (0   )
r2 (1* 011*) * (0   )  1* (0   )

r1 and r2
L(r1 ) L(r2 ) L are equivalent
regular expr.
20
The Details Matter
a* + b*  (a + b)*

(ab)*  a*b*
The Details Matter
L1 = {w  {a, b}* : every a is immediately followed a b}
A regular expression for L1:

A FSM for L1:

L2 = {w  {a, b}* : every a has a matching b somewhere}

A regular expression for L2:

A FSM for L2:


Kleene’s Theorem
Finite state machines and regular expressions define the
same class of languages. To prove this, we must
show:

Theorem: Any language that can be defined with a


regular expression can be accepted by some FSM and
so is regular.

Theorem: Every regular language (i.e., every language


that can be accepted by some DFSM) can be defined
with a regular expression.
THEOREM


Languages
Generated by Regular
Regular Expressions Languages

24
Theorem - Part 1


Languages
Generated by Regular
Regular Expressions Languages

1. For any regular expression r


the language L (r ) is regular

25
Theorem - Part 2


Languages
Generated by Regular
Regular Expressions Languages

2. For any regular language L there is


a regular expression r with L ( r ) L

26
For Every Regular Expression
There is a Corresponding FSM
We’ll show this by construction. An FSM for:

:
For Every Regular Expression
There is a Corresponding FSM
We’ll show this by construction. An FSM for:

:
For Every Regular Expression
There is a Corresponding FSM
We’ll show this by construction. An FSM for:

:

A single element of :
For Every Regular Expression
There is a Corresponding FSM
We’ll show this by construction. An FSM for:

:

A single element of :
For Every Regular Expression
There is a Corresponding FSM
We’ll show this by construction. An FSM for:

:

A single element of :

 (*):
For Every Regular Expression
There is a Corresponding FSM
We’ll show this by construction. An FSM for:

:

A single element of :

 or (*):
UNION
 If  is the regular expression of r1 + r2 and
if both L(r1) and L(r2) are regular:
 NFA for
L1  L2

 
M1
 

M2 33
CONCATENATION
 If  is the regular expression of r1.r2 and if
both L(r1) and L(r2) are regular then
 NFA for L .L
1 2

M1 M2
 

34
KLEEN STAR OPERATION
If  is the regular expression r1* and if
L(r1*) is regular then:
 NFA for
  L1 *
L1 * M1
 


35
An Example
(b + ab)*

An FSM for b An FSM for a An FSM for b

An FSM for ab:


An Example
(b + ab)*

An FSM for (b + ab):


An Example
(b + ab)*

An FSM for (b + ab)*:


EXAMPLE

n w w1w2  wk
 FSM for L1* {a b} * wi  L1

n
L1 {a b}
a
 b 

 39
The Algorithm regextofsm

regextofsm(: regular expression) =

Beginning with the primitive subexpressions of  and


working outwards until an FSM for all of  has been
built do:
Construct an FSM as described above.
PROOF – PART 2

2. For any regular language L there is


a regular expression r with

L ( r ) L

Proof by construction of regular expression

41

L is regular take the
Since
 NFA
M that accepts it

L ( M ) L

Single final state

42
 From Mconstruct the equivalent
 Generalized Transition Graph

 in which transition labels are regular


expressions

Example:

M
a c a c
a, b a b
43
b b
 Another Example: a
q0 q1 a, b q2
b

b b
a
q0 q1 a  b q2
b
44
b b
 Reducing the states: a
q0 q1 a  b q2
b

bb * a b

q0 bb * (a  b) q2
Costas Busch - RPI 45
 Resulting Regular Expression:

bb * a b

q0 bb * (a  b) q2

r (bb * a ) * bb * (a  b)b *

L ( r ) L ( M ) L
Costas Busch - RPI 46
IN GENERAL
e
 Removing states:
d c
qi q qj
a b

ae * d ce * b
ce * d
qi qj
ae * b 47
Costas Busch - RPI
 The final transition graph:
r1 r4
r3
q0 qf
r2
The resulting regular expression:

r r1 * r2 (r4  r3r1 * r2 ) *

L ( r ) L ( M ) L
48
NFA TO REX
For Every FSM There is a
Corresponding Regular Expression
We’ll show this by construction.

The key idea is that we’ll allow arbitrary regular expressions


to label the transitions of an FSM.
A Simple Example

Let M be:

Suppose we rip out state 2:


An Example

1. Create a new initial state and a new, unique accepting


state, neither of which is part of a loop.
An Example, Continued

2. Remove states and arcs and replace with arcs labelled


with larger and larger regular expressions.
An Example, Continued

Remove state 3:
An Example, Continued

Remove state 2:
An Example, Continued

Remove state 1:
WHEN IT’S HARD
M=
WHEN IT’S HARD
A regular expression for M:
a (aa)*
 (aa)* b(b(aa)*b)* ba(aa)*
 [a(aa)* b  (b  a(aa)* b) (b(aa)* b)* (a  ba(aa)*b)]
[b(aa)* b  (a  b(aa)* ab) (b(aa)* b)* (a  ba(aa)*b)]*
[b(aa)*  (a  b(aa)* ab) (b(aa)* b)* ba(aa)*]
Further Modifications to M Before We Start
We require that, from every state other than the accepting state there must be
exactly one transition to every state (including itself) except the start state.
And into every state other than the start state there must be exactly one
transition from every state (including itself) except the accepting state.

1. If there is more than one transition between states p and q,


collapse them into a single transition:

becomes:
Further Modifications to M Before We Start
2. If any of the required transitions are missing, add them:

becomes
:
RIPPING OUT STATES

3. Choose a state. Rip it out. Restore functionality.

Suppose we rip state 2.


What Happens When We Rip?

Consider any pair of states p and q. Once we remove rip, how can M get
from p to q?

● It can still take the transition that went directly from p


to q, or
● It can take the transition from p to rip. Then, it can take the
transition from rip back to itself zero or more times. Then it can
take the transition from rip to q.
Defining R(p, q)
After removing rip, the new regular expression that should label
the transition from p to q is:
R(p, q) /* Go directly from p to q
 /* or
R(p, rip) /* Go from p to rip, then
R(rip, rip)* /* Go from rip back to itself
any number of times, then
R(rip, q) /* Go from rip to q

Without the comments, we have:


R = R(p, q)  R(p, rip) R(rip, rip)* R(rip, q)
Returning to Our Example

R = R(p, q)  R(p, rip) R(rip, rip)* R(p, rip)

Let rip be state 2. Then:

R (1, 3) = R(1, 3)  R(1, rip)R(rip, rip)*R(rip, 3)


= R(1, 3)  R(1, 2)R(2, 2)*R(2, 3)
=   a b* a
= ab*a
The Algorithm fsmtoregex
fsmtoregex(M: FSM) =
1. M = standardize(M: FSM).
2. Return buildregex(M).

standardize(M: FSM) =
1. Remove unreachable states from M.
2. If necessary, create a new start state.
3. If necessary, create a new accepting state.
4. If there is more than one transition between states p
and q, collapse them.
5. If any transitions are missing, create them with label
.
The Algorithm fsmtoregex
buildregex(M: FSM) =
1. If M has no accepting states then return .
2. If M has only one state, then return .
3. Until only the start and accepting states remain do:
3.1 Select some state rip of M.
3.2 For every transition from p to q, if both p
and q are not rip then do
Compute the new label R for the transition
from p to q:

R (p, q) = R(p, q)  R(p, rip) R(rip, rip)*


R(rip, q)

3.3 Remove rip and all transitions into and out of it.
4. Return the regular expression that labels the
transition from the start state to the accepting state.
A Special Case of Pattern Matching
Suppose that we want to match a pattern that is composed
of a set of keywords. Then we can write a regular
expression of the form:

(* (k1  k2  …  kn) *)+

For example, suppose we want to match:

* finite state machine 


FSM  finite state automaton*

We can use regextofsm to build an FSM. But …

We can instead use buildkeywordFSM.


{CAT, BAT, CAB}
The single keyword cat:
{CAT, BAT, CAB}
Adding bat:
{CAT, BAT, CAB}
Adding cab:
A Biology Example – BLAST
Given a protein or DNA sequence, find others that are likely
to be evolutionarily close to it.

ESGHDTTTYYNKNRYPAGWNNHHDQMFFWV

Build a DFSM that can examine thousands of other


sequences and find those that match any of the selected
patterns.
Regular Expressions in Perl
Syntax Name Description
abc Concatenation Matches a, then b, then c, where a, b, and c are any regexs
a|b|c Union (Or) Matches a or b or c, where a, b, and c are any regexs
a* Kleene star Matches 0 or more a’s, where a is any regex
a+ At least one Matches 1 or more a’s, where a is any regex
a? Matches 0 or 1 a’s, where a is any regex
a{n, m} Replication Matches at least n but no more than m a’s, where a is any regex
a*? Parsimonious Turns off greedy matching so the shortest match is selected
a+?  
. Wild card Matches any character except newline
^ Left anchor Anchors the match to the beginning of a line or string
$ Right anchor Anchors the match to the end of a line or string
[a-z] Assuming a collating sequence, matches any single character in range

[^a-z] Assuming a collating sequence, matches any single character not in range

\d Digit Matches any single digit, i.e., string in [0-9]


\D Nondigit Matches any single nondigit character, i.e., [^0-9]
\w Alphanumeric Matches any single “word” character, i.e., [a-zA-Z0-9]
\W Nonalphanumeric Matches any character in [^a-zA-Z0-9]

\s White space Matches any character in [space, tab, newline, etc.]


Regular Expressions in Perl
Syntax Name Description
\S Nonwhite space Matches any character not matched by \s
\n Newline Matches newline
\r Return Matches return
\t Tab Matches tab
\f Formfeed Matches formfeed
\b Backspace Matches backspace inside []
\b Word boundary Matches a word boundary outside []
\B Nonword boundary Matches a non-word boundary
\0 Null Matches a null character
\nnn Octal Matches an ASCII character with octal value nnn
\xnn Hexadecimal Matches an ASCII character with hexadecimal value nn
\cX Control Matches an ASCII control character
\char Quote Matches char; used to quote symbols such as . and \
(a) Store Matches a, where a is any regex, and stores the matched string in the next variable
\1 Variable Matches whatever the first parenthesized expression matched
\2 Matches whatever the second parenthesized expression matched

… For all remaining variables


Using Regular Expressions
in the Real World
Matching numbers:
-? ([0-9]+(\.[0-9]*)? | \.[0-9]+)

Matching ip addresses:
S !<emphasis> ([0-9]{1,3} (\ . [0-9] {1,3}){3}) </emphasis>
!<inet> $1 </inet>!

Finding doubled words:


\< ([A-Za-z]+) \s+ \1 \>

From Friedl, J., Mastering Regular Expressions, O’Reilly,1997.


More Regular Expressions
Identifying spam:

\badv\(?ert\)?\b

Trawl for email addresses:

\b[A-Za-z0-9_%-]+@[A-Za-z0-9_%-]+ (\.[A-Za-z]
+){1,4}\b
Using Substitution
Building a chatbot:

On input:

<phrase1> is <phrase2>

the chatbot will reply:

Why is <phrase1> <phrase2>?


Chatbot Example
<user> The food there is awful
<chatbot> Why is the food there awful?

Assume that the input text is stored in the variable $text:

$text =~
s/^([A-Za-z]+)\sis\s([A-Za-z]+)\.?$/
Why is \1 \2?/
;
Simplifying Regular Expressions
Regex’s describe sets:
● Union is commutative:    =   .
● Union is associative: (  )   =   (  ).
●  is the identity for union:    =    = .
● Union is idempotent:    = .
Concatenation:
● Concatenation is associative: () = ().
●  is the identity for concatenation:   =   = .
●  is a zero for concatenation:   =   = .
Concatenation distributes over union:
● (  )  = ( )  ( ).
●  (  ) = ( )  ( ).
Kleene star:
● * = .
● * = .

You might also like