|
 |
CS 3723
Programming Languages |
2. Regular Expressions
|
|
For general material about the theory of regular expressions,
unrelated to Python, see
Regular Expressions.
2.1. The Regular Expression Module,
re: Python has extensive facilities for
regular expressions, just like other common scripting languages such
as Perl or Ruby. These are provided in Python as a library module
that must be imported to use regular expressions. (See
REs
(tutorialspoint) for a detailed discussion, and
Regular Expressions for some information.)
Raw Strings:
In Python, a raw string is a string (enclosed in
single or double quotes) preceded by an r or R.
Such strings leave backslash characters (and all other characters)
intact. Raw strings are mainly used in regular expressions and in
applications like specifying a Windows filename. A raw string cannot end in
a single backslash, as with r"\",
since the \"
stands for a double-quote character within the string,
and so the example is not
a completed string, missing the final double-quote. All regular expressions
in this write-up will be described this way. Examples:
r'(a*)b(c*)' or
R"(a|b)abb".
2.2. Initial Examples:
Here are three initial examples to get started:
Example 1 |
Output |
import re
r = re.compile( r"(\d\d):(\d\d) (am|pm)" )
m = r.search( "12:45 pm" )
g = m.groups()
print g | ('12', '45', 'pm')
|
Example 2 |
Output |
import re
r = re.compile( r"(\d\d):(\d\d) (am|pm)" )
while True:
s= raw_input( "-->" )
m = r.search( s )
if m != None:
g = m.groups()
print g
else:
print "No match!"
break |
-->12:45 pm
('12', '45', 'pm')
-->01:22 am
('01', '22', 'am')
-->12:45pm
No match!
|
Example 3 |
Output |
import re
import sys
r = re.compile( r"(\d\d):(\d\d) (am|pm)" )
while True:
s= raw_input( "-->" )
m = r.search( s )
if m != None:
j = m.lastindex
for i in range(0,j+1):
g = m.group(i)
sys.stdout.write(g);
if i != j:
sys.stdout.write(", ");
sys.stdout.write("\n")
else:
print "No match!"
break
|
-->12:45 pm
12:45 pm, 12, 45, pm
-->01:22 am
01:22 am, 01, 22, am
-->12:45pm
No match!
|
Items of Interest or for study:
- The examples show the function compile (from the module
re) used to take a regular expression (as a raw string)
and produce a regular expression object, named r here,
that can be used for matches. r has methods match,
search, and split among others.
These examples use search.
- In each example the method search is used to produce
match data in an object named m. The match data
object has methods to extract information about the match.
Illustrated here are three methods:
- groups(), which gives each of
the matches that came from ( ) parts of the regular expression.
- group(), with an integer parameter that gives the
group number desired. (group(0) is the entire match.)
- lastindex, the highest number in the list of groups.
- If the match fails, the "Null object" None is returned.
Examples 2 and 3 check for None, so that the program itself
won't fail. Example 3 prints the individual groups separated
by commas. This is confusing if there are any commas in an
individual group, so the more general program below prints the
groups on separate lines. (The group() function puts each group
inside quotes, and it even handles quotes themselves correctly
if any are in a group.)
- Examples 2 and 3 show the raw_input() construct of
Python 2.x (which becomes just input() in Python 3).
This construct uses the string inside parens as a prompt, and then
fetches all the characters up to a "return".
(See
Input Functions.)
- while True: is often written while 1:
- match versus search:
What Python calls search is what Perl and Ruby use by
default and call match. This is what we also will usually want to use.
Python's match only matches an initial segment of the input string,
while Python's search is willing to skip over any number of initial
characters while looking for a match. (Both search and
match allow anything at the end.)
Rule |
Use search
instead of match
in Python regular expressions.
|
2.3. Debug Example:
Here is an example that uses an arbitrary regular
expression and an arbitrary string as input.
It then prints the matched patterns. All strings are printed
between "|" characters.
This is called a Debug Example because you should use it
or something similar whenever you write a Python program involving
regular expressions. These REs are error prone, so you should
first debug the RE you propose to use before going on with the
rest of the program.
Debugging Regular Expressions |
Program, with Data |
Output (horizontal lines added) |
# regular.py: test regular expressions
import re
import sys
def regtest(reg, dat):
sys.stdout.write("Inputs: RegExp: |" + reg +
"|\n String: |" + dat + "|\n")
r = re.compile(reg)
# first search (not match)
m = r.search( dat )
sys.stdout.write("Search: ")
if m != None:
j = m.lastindex
if j != None:
for i in range(0,j+1):
g = m.group(i)
sys.stdout.write("group(" + str(i) +
"):|" + g);
if i != j:
sys.stdout.write("|,\n ");
sys.stdout.write("|\n")
else:
sys.stdout.write("ERR: Match, no groups\n")
else:
sys.stdout.write("ERR: No Match\n")
# now try split
s = r.split( dat )
sys.stdout.write("Split: ")
sys.stdout.write(str(s))
sys.stdout.write("\n\n")
regtest(r"(\d+)#(\d+)", "12:34post")
regtest(r"\d+:\d+", "12:34post")
regtest(r"(\d+):(\d+)", "12:34post")
regtest(r"(\d+):(\d+)", "pre12:34post")
regtest(r"(\(\d+\)):(\[\d+\])", "Time(12):[34]am")
regtest(r"^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$",
"19 Kart, Er J. @007 etc.")
regtest(r"(\w+ \w+)\s+(\d+)\s+(\d+)\s+(\d+)",
"Bruce Wayne 85 67 134")
|
% python regular.py
Inputs: RegExp: |(\d+)#(\d+)|
String: |12:34post|
Search: ERR: No Match
Split: ['12:34post']
Inputs: RegExp: |\d+:\d+|
String: |12:34post|
Search: ERR: Match, no groups
Split: ['', 'post']
Inputs: RegExp: |(\d+):(\d+)|
String: |12:34post|
Search: group(0):|12:34|,
group(1):|12|,
group(2):|34|
Split: ['', '12', '34', 'post']
Inputs: RegExp: |(\d+):(\d+)|
String: |pre12:34post|
Search: group(0):|12:34|,
group(1):|12|,
group(2):|34|
Split: ['pre', '12', '34', 'post']
Inputs: RegExp: |(\(\d+\)):(\[\d+\])|
String: |Time(12):[34]am|
Search: group(0):|(12):[34]|,
group(1):|(12)|,
group(2):|[34]|
Split: ['Time', '(12)', '[34]', 'am']
|
Inputs: RegExp: |^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$|
String: |19 Kart, Er J. @007 etc.|
Search: group(0):|19 Kart, Er J. @007 etc.|,
group(1):|19|,
group(2):|Kart|,
group(3):|Er J. |,
group(4):|@007|
Split: ['', '19', 'Kart', 'Er J. ', '@007', '']
Inputs: RegExp: |(\w+ \w+)\s+(\d+)\s+(\d+)\s+(\d+)|
String: |Bruce Wayne 85 67 134|
Search: group(0):|Bruce Wayne 85 67 134|,
group(1):|Bruce Wayne|,
group(2):|85|,
group(3):|67|,
group(4):|134|
Split: ['', 'Bruce Wayne', '85', '67', '134', '']
|
Items of Interest or for study:
- This example is mostly a single function that takes as
inputs a regular expression and a string for input to the
regular expression. The function tries out search,
and split in each case, leaving off Python's match
because it is not as useful.
Python's split shows what
comes before and after the match, and it also splits up
multiple matches.
- The construct range(0, j+1) gives integers from 0
to j inclusive.
2.4. Debug Module:
This example puts the function regtest of the previous
example into a module to allow its use in more general contexts.
First is the module containing the function. I'll use the
same name: regtest.py for this module. (I don't know if
this is a good idea or not, but it works.) Then come two
separate programs that make use of this module
Module: regtest.py |
# regtest.py: module with function regtest
import re
import sys
def regtest(reg, dat, delim="|"):
sys.stdout.write("Inputs: RegExp: " + delim + reg +
delim + "\n String: " + delim + dat +
delim + "\n")
r = re.compile(reg)
m = r.search( dat )
sys.stdout.write("Search: ")
if m != None:
j = m.lastindex
if j != None:
for i in range(0,j+1):
g = m.group(i)
sys.stdout.write("group(" + str(i) +
"):" + delim + g);
if i != j:
sys.stdout.write(delim + ",\n ");
sys.stdout.write(delim + "\n")
else:
sys.stdout.write("ERR: Match, no groups\n")
else:
sys.stdout.write("ERR: No Match\n")
s = r.split( dat )
sys.stdout.write("Split: ")
sys.stdout.write(str(s))
sys.stdout.write("\n\n")
|
Program: regfixed.py |
# regfixed.py: fixed calls to regtest
import regtest
regtest.regtest(r"(\d+)#(\d+)", "12:34post")
regtest.regtest(r"\d+:\d+", "12:34post")
regtest.regtest(r"(\d+):(\d+)", "12:34post")
regtest.regtest(r"(\d+):(\d+)", "pre12:34post")
regtest.regtest(r"(\(\d+\)):(\[\d+\])", "Time(12):[34]am")
regtest.regtest(r"^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$",
"19 Kart, Er J. @007 etc.")
regtest.regtest(r"(\w+ \w+)\s+(\d+)\s+(\d+)\s+(\d+)",
"Bruce Wayne 85 67 134")
|
Output: same as before |
% python regfixed.py
Inputs: RegExp: |(\d+)#(\d+)|
String: |12:34post|
Search: ERR: No Match
Split: ['12:34post']
...
(etc., same as before)
|
Items of Interest or for study:
- This is mostly the same at the example in Section 2.3,
except that the function has been separated out into a separate file.
A file containing code like this is called a module in Python.
This file is imported into another file that uses it
as in Section 2.4. The main difference in this case is that in
calling the function regtest inside the other file,
you have to append the file name to the function call,
as with: regtest.regtest.
Program: reginput.py (two versions) |
Program, with Run |
String Delimiter is "$" |
# reginput.py: input data, call regtest
import regtest
import sys # for final output
while True:
reg = raw_input( "RegExp-->" )
if reg == "":
break
st = raw_input( "String-->" )
regtest.regtest(reg, st)
sys.stdout.write("That's all folks!\n")
% python reginput.py
RegExp-->(\d+)#(\d+)
String-->12:34post
Inputs: RegExp: |(\d+)#(\d+)|
String: |12:34post|
Search: ERR: No Match
Split: ['12:34post']
RegExp-->\d+:\d+
String-->12:34post
Inputs: RegExp: |\d+:\d+|
String: |12:34post|
Search: ERR: Match, no groups
Split: ['', 'post']
RegExp-->(\d+):(\d+)
String-->12:34post
Inputs: RegExp: |(\d+):(\d+)|
String: |12:34post|
Search: group(0):|12:34|,
group(1):|12|,
group(2):|34|
Split: ['', '12', '34', 'post']
RegExp-->(\d+):(\d+)
String-->pre12:34post
Inputs: RegExp: |(\d+):(\d+)|
String: |pre12:34post|
Search: group(0):|12:34|,
group(1):|12|,
group(2):|34|
Split: ['pre', '12', '34', 'post']
RegExp-->(return)
That's all folks!
| # reginput.py: imput data, call regtest
import regtest
import sys # for final output
while True:
reg = raw_input( "RegExp-->" )
if reg == "":
break
st = raw_input( "String-->" )
regtest.regtest(reg, st, delim="$")
sys.stdout.write("That's all folks!\n")
% python reginput.py
RegExp-->(\d+)#(\d+)
String-->12:34post
Inputs: RegExp: $(\d+)#(\d+)$
String: $12:34post$
Search: ERR: No Match
Split: ['12:34post']
RegExp-->\d+:\d+
String-->12:34post
Inputs: RegExp: $\d+:\d+$
String: $12:34post$
Search: ERR: Match, no groups
Split: ['', 'post']
RegExp-->(\d+):(\d+)
String-->12:34post
Inputs: RegExp: $(\d+):(\d+)$
String: $12:34post$
Search: group(0):$12:34$,
group(1):$12$,
group(2):$34$
Split: ['', '12', '34', 'post']
RegExp-->(\d+):(\d+)
String-->pre12:34post
Inputs: RegExp: $(\d+):(\d+)$
String: $pre12:34post$
Search: group(0):$12:34$,
group(1):$12$,
group(2):$34$
Split: ['pre', '12', '34', 'post']
RegExp-->(return)
That's all folks!
|
Items of Interest or for study:
- This uses the same function as before, but the main
part of the code does interactive input of the RE and the string.
- The other relatively small change was to make the character that
delimits strings into a parameter. This is accomplished
by using a default value for a new parameter,
namely delim="|". If this third parameter is not used
in a call, delim gets the default value of "|".
Otherwise we call call with a different value for this
parameter, shown above with "$". Finally, we can use
the name of the parameter in a call, also shown above.
Many of these issues are discussed thoroughly at:
default parameter values, and calls with the parameter name.
- It turns out that constants like "", [ ],
( ), 0, and None
are all the same as False in Python. So the loop can
be rewritten as:
|