CS 3723 Python: Regular Expressions

CS 3723
Programming Languages

2. Regular Expressions

For general material about the theory of regular expressions, unrelated to Python, see Regular Expressions.

2.1. The Regular Expression Module, re: Python has extensive facilities for regular expressions, just like other common scripting languages such as Perl or Ruby. These are provided in Python as a library module that must be imported to use regular expressions. (See REs (tutorialspoint) for a detailed discussion, and Regular Expressions for some information.)

Raw Strings:

raw string

r"\"

r'(a*)b(c*)'

R"(a|b)abb"

2.2. Initial Examples: Here are three initial examples to get started:

Example 1	Output
import re r = re.compile( r"(\d\d):(\d\d) (am\|pm)" ) m = r.search( "12:45 pm" ) g = m.groups() print g	('12', '45', 'pm')

Example 2

Output

import re
r = re.compile( r"(\d\d):(\d\d) (am|pm)" )
while True:
    s= raw_input( "-->" )
    m = r.search( s )
    if m != None:
        g = m.groups()
        print g
    else:
        print "No match!"
        break

-->12:45 pm
('12', '45', 'pm')
-->01:22 am
('01', '22', 'am')
-->12:45pm
No match!

Example 3

Output

import re
import sys
r = re.compile( r"(\d\d):(\d\d) (am|pm)" )
while True:
    s= raw_input( "-->" )
    m = r.search( s )
    if m != None:
        j = m.lastindex
        for i in range(0,j+1):
            g = m.group(i)
            sys.stdout.write(g);
            if i != j:
                sys.stdout.write(", ");
        sys.stdout.write("\n")
    else:
        print "No match!"
        break

-->12:45 pm
12:45 pm, 12, 45, pm
-->01:22 am
01:22 am, 01, 22, am
-->12:45pm
No match!

Items of Interest or for study:

The examples show the function compile (from the module re) used to take a regular expression (as a raw string) and produce a regular expression object, named r here, that can be used for matches. r has methods match, search, and split among others. These examples use search.
In each example the method search is used to produce match data in an object named m. The match data object has methods to extract information about the match. Illustrated here are three methods:
- groups(), which gives each of the matches that came from ( ) parts of the regular expression.
- group(), with an integer parameter that gives the group number desired. (group(0) is the entire match.)
- lastindex, the highest number in the list of groups.
If the match fails, the "Null object" None is returned. Examples 2 and 3 check for None, so that the program itself won't fail. Example 3 prints the individual groups separated by commas. This is confusing if there are any commas in an individual group, so the more general program below prints the groups on separate lines. (The group() function puts each group inside quotes, and it even handles quotes themselves correctly if any are in a group.)
Examples 2 and 3 show the raw_input() construct of Python 2.x (which becomes just input() in Python 3). This construct uses the string inside parens as a prompt, and then fetches all the characters up to a "return". (See Input Functions.)
while True: is often written while 1:
match versus search: What Python calls search is what Perl and Ruby use by default and call match. This is what we also will usually want to use. Python's match only matches an initial segment of the input string, while Python's search is willing to skip over any number of initial characters while looking for a match. (Both search and match allow anything at the end.)

Rule

Use search instead of match
in Python regular expressions.

2.3. Debug Example: Here is an example that uses an arbitrary regular expression and an arbitrary string as input. It then prints the matched patterns. All strings are printed between "|" characters.

This is called a Debug Example because you should use it or something similar whenever you write a Python program involving regular expressions. These REs are error prone, so you should first debug the RE you propose to use before going on with the rest of the program.

Debugging Regular Expressions

Program, with Data

Output (horizontal lines added)

# regular.py: test regular expressions
import re
import sys

def regtest(reg, dat): 
    sys.stdout.write("Inputs: RegExp:  |" + reg +
          "|\n        String:  |" + dat + "|\n")
    r = re.compile(reg)
    # first search (not match)
    m = r.search( dat )
    sys.stdout.write("Search: ")
    if m != None:
        j = m.lastindex
        if j != None:
            for i in range(0,j+1):
                g = m.group(i)
                sys.stdout.write("group(" + str(i) +
                   "):|" + g);
                if i != j:
                    sys.stdout.write("|,\n        ");
            sys.stdout.write("|\n")
        else:
            sys.stdout.write("ERR: Match, no groups\n")
    else:
        sys.stdout.write("ERR: No Match\n")
    # now try split
    s = r.split( dat )
    sys.stdout.write("Split:  ")
    sys.stdout.write(str(s))
    sys.stdout.write("\n\n")

regtest(r"(\d+)#(\d+)",    "12:34post")
regtest(r"\d+:\d+",    "12:34post")
regtest(r"(\d+):(\d+)",    "12:34post")
regtest(r"(\d+):(\d+)", "pre12:34post")
regtest(r"(\(\d+\)):(\[\d+\])", "Time(12):[34]am")
regtest(r"^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$",
      "19 Kart, Er J. @007 etc.")
regtest(r"(\w+ \w+)\s+(\d+)\s+(\d+)\s+(\d+)",
      "Bruce Wayne      85  67  134")

% python regular.py
Inputs: RegExp:  |(\d+)#(\d+)|
        String:  |12:34post|
Search: ERR: No Match
Split:  ['12:34post']
Inputs: RegExp:  |\d+:\d+|
        String:  |12:34post|
Search: ERR: Match, no groups
Split:  ['', 'post']
Inputs: RegExp:  |(\d+):(\d+)|
        String:  |12:34post|
Search: group(0):|12:34|,
        group(1):|12|,
        group(2):|34|
Split:  ['', '12', '34', 'post']
Inputs: RegExp:  |(\d+):(\d+)|
        String:  |pre12:34post|
Search: group(0):|12:34|,
        group(1):|12|,
        group(2):|34|
Split:  ['pre', '12', '34', 'post']
Inputs: RegExp:  |(\(\d+\)):(\[\d+\])|
        String:  |Time(12):[34]am|
Search: group(0):|(12):[34]|,
        group(1):|(12)|,
        group(2):|[34]|
Split:  ['Time', '(12)', '[34]', 'am']

Inputs: RegExp:  |^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$|
        String:  |19 Kart, Er J. @007 etc.|
Search: group(0):|19 Kart, Er J. @007 etc.|,
        group(1):|19|,
        group(2):|Kart|,
        group(3):|Er J. |,
        group(4):|@007|
Split:  ['', '19', 'Kart', 'Er J. ', '@007', '']
Inputs: RegExp:  |(\w+ \w+)\s+(\d+)\s+(\d+)\s+(\d+)|
        String:  |Bruce Wayne      85  67  134|
Search: group(0):|Bruce Wayne      85  67  134|,
        group(1):|Bruce Wayne|,
        group(2):|85|,
        group(3):|67|,
        group(4):|134|
Split:  ['', 'Bruce Wayne', '85', '67', '134', '']

Items of Interest or for study:

This example is mostly a single function that takes as inputs a regular expression and a string for input to the regular expression. The function tries out search, and split in each case, leaving off Python's match because it is not as useful. Python's split shows what comes before and after the match, and it also splits up multiple matches.
The construct range(0, j+1) gives integers from 0 to j inclusive.

2.4. Debug Module: This example puts the function regtest of the previous example into a module to allow its use in more general contexts.

First is the module containing the function. I'll use the same name: regtest.py for this module. (I don't know if this is a good idea or not, but it works.) Then come two separate programs that make use of this module

Module: regtest.py
# regtest.py: module with function regtest import re import sys def regtest(reg, dat, delim="\|"): sys.stdout.write("Inputs: RegExp: " + delim + reg + delim + "\n String: " + delim + dat + delim + "\n") r = re.compile(reg) m = r.search( dat ) sys.stdout.write("Search: ") if m != None: j = m.lastindex if j != None: for i in range(0,j+1): g = m.group(i) sys.stdout.write("group(" + str(i) + "):" + delim + g); if i != j: sys.stdout.write(delim + ",\n "); sys.stdout.write(delim + "\n") else: sys.stdout.write("ERR: Match, no groups\n") else: sys.stdout.write("ERR: No Match\n") s = r.split( dat ) sys.stdout.write("Split: ") sys.stdout.write(str(s)) sys.stdout.write("\n\n")
Program: regfixed.py
# regfixed.py: fixed calls to regtest import regtest regtest.regtest(r"(\d+)#(\d+)", "12:34post") regtest.regtest(r"\d+:\d+", "12:34post") regtest.regtest(r"(\d+):(\d+)", "12:34post") regtest.regtest(r"(\d+):(\d+)", "pre12:34post") regtest.regtest(r"($\d+$):(\[\d+\])", "Time(12):[34]am") regtest.regtest(r"^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$", "19 Kart, Er J. @007 etc.") regtest.regtest(r"(\w+ \w+)\s+(\d+)\s+(\d+)\s+(\d+)", "Bruce Wayne 85 67 134")
Output: same as before
% python regfixed.py Inputs: RegExp: \|(\d+)#(\d+)\| String: \|12:34post\| Search: ERR: No Match Split: ['12:34post'] ... (etc., same as before)

Items of Interest or for study:

This is mostly the same at the example in Section 2.3, except that the function has been separated out into a separate file. A file containing code like this is called a module in Python. This file is imported into another file that uses it as in Section 2.4. The main difference in this case is that in calling the function regtest inside the other file, you have to append the file name to the function call, as with: regtest.regtest.

Program: reginput.py (two versions)

Program, with Run

String Delimiter is "$"

# reginput.py: input data, call regtest
import regtest
import sys # for final output

while True:
    reg = raw_input( "RegExp-->" )
    if reg == "":
        break
    st  = raw_input( "String-->" )
    regtest.regtest(reg, st)
sys.stdout.write("That's all folks!\n")
% python reginput.py
RegExp-->(\d+)#(\d+)
String-->12:34post
Inputs: RegExp:  |(\d+)#(\d+)|
        String:  |12:34post|
Search: ERR: No Match
Split:  ['12:34post']

RegExp-->\d+:\d+
String-->12:34post
Inputs: RegExp:  |\d+:\d+|
        String:  |12:34post|
Search: ERR: Match, no groups
Split:  ['', 'post']

RegExp-->(\d+):(\d+)
String-->12:34post
Inputs: RegExp:  |(\d+):(\d+)|
        String:  |12:34post|
Search: group(0):|12:34|,
        group(1):|12|,
        group(2):|34|
Split:  ['', '12', '34', 'post']

RegExp-->(\d+):(\d+)
String-->pre12:34post
Inputs: RegExp:  |(\d+):(\d+)|
        String:  |pre12:34post|
Search: group(0):|12:34|,
        group(1):|12|,
        group(2):|34|
Split:  ['pre', '12', '34', 'post']

RegExp-->(return)
That's all folks!

# reginput.py: imput data, call regtest
import regtest
import sys # for final output

while True:
    reg = raw_input( "RegExp-->" )
    if reg == "":
        break
    st  = raw_input( "String-->" )
    regtest.regtest(reg, st, delim="$")
sys.stdout.write("That's all folks!\n")
% python reginput.py
RegExp-->(\d+)#(\d+)
String-->12:34post
Inputs: RegExp:  $(\d+)#(\d+)$
        String:  $12:34post$
Search: ERR: No Match
Split:  ['12:34post']

RegExp-->\d+:\d+
String-->12:34post
Inputs: RegExp:  $\d+:\d+$
        String:  $12:34post$
Search: ERR: Match, no groups
Split:  ['', 'post']

RegExp-->(\d+):(\d+)
String-->12:34post
Inputs: RegExp:  $(\d+):(\d+)$
        String:  $12:34post$
Search: group(0):$12:34$,
        group(1):$12$,
        group(2):$34$
Split:  ['', '12', '34', 'post']

RegExp-->(\d+):(\d+)
String-->pre12:34post
Inputs: RegExp:  $(\d+):(\d+)$
        String:  $pre12:34post$
Search: group(0):$12:34$,
        group(1):$12$,
        group(2):$34$
Split:  ['pre', '12', '34', 'post']

RegExp-->(return)
That's all folks!

Items of Interest or for study:

This uses the same function as before, but the main part of the code does interactive input of the RE and the string.
The other relatively small change was to make the character that delimits strings into a parameter. This is accomplished by using a default value for a new parameter, namely delim="|". If this third parameter is not used in a call, delim gets the default value of "|". Otherwise we call call with a different value for this parameter, shown above with "$". Finally, we can use the name of the parameter in a call, also shown above. Many of these issues are discussed thoroughly at: default parameter values, and calls with the parameter name.
It turns out that constants like "", [ ], ( ), 0, and None are all the same as False in Python. So the loop can be rewritten as:

Program: reginput.py (three versions of the loop)
Program, with Run	String Delimiter is "$"	String Delimiter is "$"
while True: reg = raw_input( "RegExp-->" ) if not reg: break st = raw_input( "String-->" ) regtest.regtest(reg, st)	while True: reg = raw_input( "RegExp-->" ) if reg: st = raw_input( "String-->" ) regtest.regtest(reg, st, "$") else: break	reg = raw_input( "RegExp-->" ) while reg: st = raw_input( "String-->" ) regtest.regtest(reg, st, delim="$") # next iteration reg = raw_input( "RegExp-->" )

Program: reginput.py (three versions of the loop)

Program, with Run

String Delimiter is "$"

while True:
    reg = raw_input( "RegExp-->" )
    if not reg:
        break
    st  = raw_input( "String-->" )
    regtest.regtest(reg, st)

while True:
    reg = raw_input( "RegExp-->" )
    if reg:
        st  = raw_input( "String-->" )
        regtest.regtest(reg, st, "$")
    else:
        break

reg = raw_input( "RegExp-->" )
while reg:
    st  = raw_input( "String-->" )
    regtest.regtest(reg, st, delim="$")
    # next iteration
    reg = raw_input( "RegExp-->" )

In C or Java, you would more commonly use the style below, which doesn't work in Python because you can't have an assignment buried in a while condition.

C/Java style, not Python

while ((dat = get_data()) != EOF) { do_something(); }

2.5. Example, Transforming Class Lists: This section gives a Python program that translates a file of data that I used to get from the UTSA system for each course. Each student had a computer science email account, with the first choice for the account name: the first letter of their first name followed by up to seven letters of their last name. (Most accounts followed this pattern; in case of duplicates the system used a succession of backup patterns.) The "before" and "after" for each line looks as follows. I'm trying to illustrate REs here with this particular structure.

1st RE

Old Line: 19 Kartaltepe, Erhan J. @00777777 (extra stuff) ... New Line: <li>Erhan J. Kartaltepe, Email: ekartalt@cs.utsa.edu 1st RE: r"^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$" is designed for the "old" lines. All three REs are used to produce the "new" lines.

Matches in the 1st RE

RE portion Meaning

^ start of line

(\d+) one or more digits, Match 1

\s+ one or more whitespace chars

([-a-zA-Z]+) one or more letters (or a hyphen), Match 2

, a comma

\s+ one of more whitespace chars

(.+) one or more of any chars up to '@', Match 3

(@\d+) '@', plus one or more digits, Match 4 (unused)

\s+ one or more whitespace chars

.* anything at all

$ end of the line

Here is the Python program that does the translation. Python does not have the Perl style "$" variables. Also, since there are three matches active at the same time, this example uses the fact that one can get the matching characters of all three matches at the same time, something not possible in Perl. (In Perl, the "$" variables would overwrite one another. Of course this is not a real "problem", and you can easily get around it in Perl.) The third match is artificial, just to try out another match, since it just picks off the first character in the string. Even though this example makes use of Python's capabilities, it would be easy to structure it into simple Perl.
This particular example produces the same output with search replaced by match (three times), because in each case the search finds a match starting with the first character.

2nd and 3rd REs

Old String: Kartaltepe New String: Kartalt 2nd RE: r"[a-zA-Z]{1,7}" fetches 1-7 letters from last name.
Old String: Erhan New String: E 3rd RE: r"^([A-Z])" fetches first letter from first name
Finally E and Kartalt are make lc and concatenated to give ekartalt

File Translation Using Three Regular Expressions
File: stud.test.py

#!/usr/bin/python import re regular expression module import sys for I/O below sys.stdout.write("<ul>\n") s = sys.stdin.readline() # fetch next line of input file while s: below is the main regular expression r = re.compile( r"^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$" ) m = r.search( s ) # match line with RE r2 = re.compile( r"([a-zA-Z]{1,7})" ) # a 2nd RE: 1 to 7 letters m2 = r2.search(m.group(2)) # match group of 1st match with 2nd RE r3 = re.compile( r"^([A-Z])" ) # a 3rd RE: single uc letter m3 = r3.search(m.group(3)) sys.stdout.write("<li>") if int(m.group(1)) < 10: # write extra blank sys.stdout.write(" ") # strip removes initial and terminal whitespace sys.stdout.write(m.group(1) + " " + m.group(3).strip() + " " + m.group(2) + ", Email: " + m3.group(1).lower() + m2.group(1).lower() + "@cs.utsa.edu" + "\n") s = sys.stdin.readline() sys.stdout.write("</ul>\nTh-th-th-that's all folks!\n")
Here are the input and output files: input file (text), output file (text), output file (HTML)

2.6. Changing File Names in a Directory: This example shows a very simple systems programming task: alter the file names in a directory in a systematic way, using a regular expression. These are file names for downloaded cartoons, and I wanted to change them so they would be uniform and easier to read. Each original name has 6 digits representing the year, month, and day. I wanted to change "yymmdd" to "20yy-mm-dd", and change everything in front to "bc". Finally leave the ".gif" or ".jpg" alone. Three of the file names were already in the desired format.

Old Names (d) New Names (dr) Conversion Program (conv.py)

% ls -1 admin.wpbcl101011.gif admin.wpbcl101221.gif wpbcl_c110825.gif wpbcl_c110826.gif wpbcl_c111010-2.gif wpbcl111106.jpg wpbcl_c111121.gif wpbcl121017.gif wpbcl130310.jpg wpbcl130311.gif wpbcl130312.gif wpbcl130602.jpg bc2013-06-13.gif bc2013-09-08.jpg wpbcl130909.gif wpbcl130910.gif bc2013-09-18.gif wpbcl131130.gif wpbcl131201.jpg wpbcl131202.gif wpbcl131230.gif wpbcl140127.gif wpbcl140216.jpg conv.py .directory
% ls -1 bc2010-10-11.gif bc2010-12-21.gif bc2011-08-25.gif bc2011-08-26.gif bc2011-10-10.gif bc2011-11-06.jpg bc2011-11-21.gif bc2012-10-17.gif bc2013-03-10.jpg bc2013-03-11.gif bc2013-03-12.gif bc2013-06-02.jpg bc2013-06-13.gif bc2013-09-08.jpg bc2013-09-09.gif bc2013-09-10.gif bc2013-09-18.gif bc2013-11-30.gif bc2013-12-01.jpg bc2013-12-02.gif bc2013-12-30.gif bc2014-01-27.gif bc2014-02-16.jpg conv.py .directory
#!/usr/bin/python import sys # for sys.stdout.write(<str>) import os # for os.listdir(<path>) import shutil # for shutil.move(<src>,<dst>) import re # for for re.compile(<re>), search(<re>) i = 1 for d in os.listdir('.'): # d = original file name r = re.compile(r'.*(\d\d)(\d\d)(\d\d).*(gif|jpg)') m = r.search( d ) if m != None: dr = "bc20" + m.group(1) + "-" + m.group(2)+ \ "-" + m.group(3) + "." + m.group(4) sys.stdout.write(("%2i" % i) + " Match: \n") sys.stdout.write(" Old: " + d + "\n"); sys.stdout.write(" New: " + dr + "\n"); shutil.move(d, dr) # d to dr (new file name) else: sys.stdout.write(("%2i" % i) + " None: ") sys.stdout.write(d + "\n") i += 1

Run of program, showing matches with changes
% python conv.py 1 None: bc2013-06-13.gif 2 None: conv.py 3 Match: Old: wpbcl130310.jpg New: bc2013-03-10.jpg 4 Match: Old: wpbcl130311.gif New: bc2013-03-11.gif 5 Match: Old: admin.wpbcl101011.gif New: bc2010-10-11.gif 6 Match: Old: wpbcl111106.jpg New: bc2011-11-06.jpg 7 Match: Old: wpbcl131201.jpg New: bc2013-12-01.jpg 8 Match: Old: wpbcl131230.gif New: bc2013-12-30.gif 9 Match: Old: wpbcl121017.gif New: bc2012-10-17.gif 10 Match: Old: wpbcl_c111121.gif New: bc2011-11-21.gif 11 None: bc2013-09-08.jpg 12 None: .directory 13 Match: Old: wpbcl130909.gif New: bc2013-09-09.gif
14 Match: Old: wpbcl130602.jpg New: bc2013-06-02.jpg 15 Match: Old: wpbcl140216.jpg New: bc2014-02-16.jpg 16 Match: Old: wpbcl130910.gif New: bc2013-09-10.gif 17 Match: Old: admin.wpbcl101221.gif New: bc2010-12-21.gif 18 Match: Old: wpbcl131202.gif New: bc2013-12-02.gif 19 Match: Old: wpbcl130312.gif New: bc2013-03-12.gif 20 Match: Old: wpbcl140127.gif New: bc2014-01-27.gif 21 Match: Old: wpbcl_c110825.gif New: bc2011-08-25.gif 22 None: bc2013-09-18.gif 23 Match: Old: wpbcl_c110826.gif New: bc2011-08-26.gif 24 Match: Old: wpbcl_c111010-2.gif New: bc2011-10-10.gif 25 Match: Old: wpbcl131130.gif New: bc2013-11-30.gif
Items of Interest or for study:

Python has very extensive systems programming features: The Beazley reference has 100 pages of brief references without examples. This program uses a function listdir from the os module. The function makes a list out of the file names in a given directory. In this case the dot means the current directory, but it could be other directories. (The Python program resided in the directory to be changed, though this would not usually be the case.) Three of the file names were already in the desired form, and in these cases there was no match. There was also no match with the name of the Python program.

The function move from the shutil module allowed the program to rename various files.

The for construct iterates through the strings in the given list. The program doesn't need to set up an integer index and increment it through the elements of the array.

I made a lot of mistakes while writing this code, and it was annoying to test, because the full program changes file names. (I should have proceeded more methodically and carefully. I also edited the data above, changing the order of the initial files so they would match up after the change, and deleting some data that was the result of several mistakes.)

(Revision date: 2014-05-24. Please use ISO 8601, the International Standard.)