CS 3723
  Programming Languages  
  2. Regular Expressions   

For general material about the theory of regular expressions, unrelated to Python, see Regular Expressions.


2.1. The Regular Expression Module, re: Python has extensive facilities for regular expressions, just like other common scripting languages such as Perl or Ruby. These are provided in Python as a library module that must be imported to use regular expressions. (See REs (tutorialspoint) for a detailed discussion, and Regular Expressions for some information.)


    Raw Strings: In Python, a raw string is a string (enclosed in single or double quotes) preceded by an r or R. Such strings leave backslash characters (and all other characters) intact. Raw strings are mainly used in regular expressions and in applications like specifying a Windows filename. A raw string cannot end in a single backslash, as with r"\", since the \" stands for a double-quote character within the string, and so the example is not a completed string, missing the final double-quote. All regular expressions in this write-up will be described this way. Examples: r'(a*)b(c*)' or R"(a|b)abb".


2.2. Initial Examples: Here are three initial examples to get started:

Example 1 Output
import re
r = re.compile( r"(\d\d):(\d\d) (am|pm)" )
m = r.search( "12:45 pm" )
g = m.groups()
print g
('12', '45', 'pm')

Example 2 Output
import re
r = re.compile( r"(\d\d):(\d\d) (am|pm)" )
while True:
    s= raw_input( "-->" )
    m = r.search( s )
    if m != None:
        g = m.groups()
        print g
    else:
        print "No match!"
        break
-->12:45 pm
('12', '45', 'pm')
-->01:22 am
('01', '22', 'am')
-->12:45pm
No match!

Example 3 Output
import re
import sys
r = re.compile( r"(\d\d):(\d\d) (am|pm)" )
while True:
    s= raw_input( "-->" )
    m = r.search( s )
    if m != None:
        j = m.lastindex
        for i in range(0,j+1):
            g = m.group(i)
            sys.stdout.write(g);
            if i != j:
                sys.stdout.write(", ");
        sys.stdout.write("\n")
    else:
        print "No match!"
        break
-->12:45 pm
12:45 pm, 12, 45, pm
-->01:22 am
01:22 am, 01, 22, am
-->12:45pm
No match!

Items of Interest or for study:

  • The examples show the function compile (from the module re) used to take a regular expression (as a raw string) and produce a regular expression object, named r here, that can be used for matches. r has methods match, search, and split among others. These examples use search.

  • In each example the method search is used to produce match data in an object named m. The match data object has methods to extract information about the match. Illustrated here are three methods:
    • groups(), which gives each of the matches that came from ( ) parts of the regular expression.
    • group(), with an integer parameter that gives the group number desired. (group(0) is the entire match.)
    • lastindex, the highest number in the list of groups.

  • If the match fails, the "Null object" None is returned. Examples 2 and 3 check for None, so that the program itself won't fail. Example 3 prints the individual groups separated by commas. This is confusing if there are any commas in an individual group, so the more general program below prints the groups on separate lines. (The group() function puts each group inside quotes, and it even handles quotes themselves correctly if any are in a group.)

  • Examples 2 and 3 show the raw_input() construct of Python 2.x (which becomes just input() in Python 3). This construct uses the string inside parens as a prompt, and then fetches all the characters up to a "return". (See Input Functions.)

  • while True: is often written while 1:

  • match versus search: What Python calls search is what Perl and Ruby use by default and call match. This is what we also will usually want to use. Python's match only matches an initial segment of the input string, while Python's search is willing to skip over any number of initial characters while looking for a match. (Both search and match allow anything at the end.)

    Rule
    Use search instead of match
    in Python regular expressions.


2.3. Debug Example: Here is an example that uses an arbitrary regular expression and an arbitrary string as input. It then prints the matched patterns. All strings are printed between "|" characters.

This is called a Debug Example because you should use it or something similar whenever you write a Python program involving regular expressions. These REs are error prone, so you should first debug the RE you propose to use before going on with the rest of the program.

Debugging Regular Expressions
Program, with Data Output (horizontal lines added)
# regular.py: test regular expressions
import re
import sys

def regtest(reg, dat): 
    sys.stdout.write("Inputs: RegExp:  |" + reg +
          "|\n        String:  |" + dat + "|\n")
    r = re.compile(reg)
    # first search (not match)
    m = r.search( dat )
    sys.stdout.write("Search: ")
    if m != None:
        j = m.lastindex
        if j != None:
            for i in range(0,j+1):
                g = m.group(i)
                sys.stdout.write("group(" + str(i) +
                   "):|" + g);
                if i != j:
                    sys.stdout.write("|,\n        ");
            sys.stdout.write("|\n")
        else:
            sys.stdout.write("ERR: Match, no groups\n")
    else:
        sys.stdout.write("ERR: No Match\n")
    # now try split
    s = r.split( dat )
    sys.stdout.write("Split:  ")
    sys.stdout.write(str(s))
    sys.stdout.write("\n\n")

regtest(r"(\d+)#(\d+)",    "12:34post")
regtest(r"\d+:\d+",    "12:34post")
regtest(r"(\d+):(\d+)",    "12:34post")
regtest(r"(\d+):(\d+)", "pre12:34post")
regtest(r"(\(\d+\)):(\[\d+\])", "Time(12):[34]am")
regtest(r"^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$",
      "19 Kart, Er J. @007 etc.")
regtest(r"(\w+ \w+)\s+(\d+)\s+(\d+)\s+(\d+)",
      "Bruce Wayne      85  67  134")
% python regular.py

Inputs: RegExp: |(\d+)#(\d+)| String: |12:34post| Search: ERR: No Match Split: ['12:34post']
Inputs: RegExp: |\d+:\d+| String: |12:34post| Search: ERR: Match, no groups Split: ['', 'post']
Inputs: RegExp: |(\d+):(\d+)| String: |12:34post| Search: group(0):|12:34|, group(1):|12|, group(2):|34| Split: ['', '12', '34', 'post']
Inputs: RegExp: |(\d+):(\d+)| String: |pre12:34post| Search: group(0):|12:34|, group(1):|12|, group(2):|34| Split: ['pre', '12', '34', 'post']
Inputs: RegExp: |(\(\d+\)):(\[\d+\])| String: |Time(12):[34]am| Search: group(0):|(12):[34]|, group(1):|(12)|, group(2):|[34]| Split: ['Time', '(12)', '[34]', 'am']
Inputs: RegExp:  |^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$|
        String:  |19 Kart, Er J. @007 etc.|
Search: group(0):|19 Kart, Er J. @007 etc.|,
        group(1):|19|,
        group(2):|Kart|,
        group(3):|Er J. |,
        group(4):|@007|
Split:  ['', '19', 'Kart', 'Er J. ', '@007', '']

Inputs: RegExp: |(\w+ \w+)\s+(\d+)\s+(\d+)\s+(\d+)| String: |Bruce Wayne 85 67 134| Search: group(0):|Bruce Wayne 85 67 134|, group(1):|Bruce Wayne|, group(2):|85|, group(3):|67|, group(4):|134| Split: ['', 'Bruce Wayne', '85', '67', '134', '']

Items of Interest or for study:

  • This example is mostly a single function that takes as inputs a regular expression and a string for input to the regular expression. The function tries out search, and split in each case, leaving off Python's match because it is not as useful. Python's split shows what comes before and after the match, and it also splits up multiple matches.

  • The construct range(0, j+1) gives integers from 0 to j inclusive.


2.4. Debug Module: This example puts the function regtest of the previous example into a module to allow its use in more general contexts.

First is the module containing the function. I'll use the same name: regtest.py for this module. (I don't know if this is a good idea or not, but it works.) Then come two separate programs that make use of this module

Module: regtest.py
# regtest.py: module with function regtest
import re
import sys

def regtest(reg, dat, delim="|"): 
    sys.stdout.write("Inputs: RegExp:  " + delim + reg +
           delim + "\n        String:  " + delim + dat +
           delim + "\n")
    r = re.compile(reg)
    m = r.search( dat )
    sys.stdout.write("Search: ")
    if m != None:
        j = m.lastindex
        if j != None:
            for i in range(0,j+1):
                g = m.group(i)
                sys.stdout.write("group(" + str(i) +
                   "):" + delim + g);
                if i != j:
                    sys.stdout.write(delim + ",\n        ");
            sys.stdout.write(delim + "\n")
        else:
            sys.stdout.write("ERR: Match, no groups\n")
    else:
        sys.stdout.write("ERR: No Match\n")
    s = r.split( dat )
    sys.stdout.write("Split:  ")
    sys.stdout.write(str(s))
    sys.stdout.write("\n\n")
Program: regfixed.py
# regfixed.py: fixed calls to regtest
import regtest

regtest.regtest(r"(\d+)#(\d+)",    "12:34post")
regtest.regtest(r"\d+:\d+",    "12:34post")
regtest.regtest(r"(\d+):(\d+)",    "12:34post")
regtest.regtest(r"(\d+):(\d+)", "pre12:34post")
regtest.regtest(r"(\(\d+\)):(\[\d+\])", "Time(12):[34]am")
regtest.regtest(r"^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$",
      "19 Kart, Er J. @007 etc.")
regtest.regtest(r"(\w+ \w+)\s+(\d+)\s+(\d+)\s+(\d+)",
      "Bruce Wayne      85  67  134")
Output: same as before
% python regfixed.py
Inputs: RegExp:  |(\d+)#(\d+)|
        String:  |12:34post|
Search: ERR: No Match
Split:  ['12:34post']
...
(etc., same as before)

Items of Interest or for study:

  • This is mostly the same at the example in Section 2.3, except that the function has been separated out into a separate file. A file containing code like this is called a module in Python. This file is imported into another file that uses it as in Section 2.4. The main difference in this case is that in calling the function regtest inside the other file, you have to append the file name to the function call, as with: regtest.regtest.

Program: reginput.py (two versions)
Program, with Run String Delimiter is "$"
# reginput.py: input data, call regtest
import regtest
import sys # for final output

while True:
    reg = raw_input( "RegExp-->" )
    if reg == "":
        break
    st  = raw_input( "String-->" )
    regtest.regtest(reg, st)
sys.stdout.write("That's all folks!\n")

% python reginput.py RegExp-->(\d+)#(\d+) String-->12:34post Inputs: RegExp: |(\d+)#(\d+)| String: |12:34post| Search: ERR: No Match Split: ['12:34post'] RegExp-->\d+:\d+ String-->12:34post Inputs: RegExp: |\d+:\d+| String: |12:34post| Search: ERR: Match, no groups Split: ['', 'post'] RegExp-->(\d+):(\d+) String-->12:34post Inputs: RegExp: |(\d+):(\d+)| String: |12:34post| Search: group(0):|12:34|, group(1):|12|, group(2):|34| Split: ['', '12', '34', 'post'] RegExp-->(\d+):(\d+) String-->pre12:34post Inputs: RegExp: |(\d+):(\d+)| String: |pre12:34post| Search: group(0):|12:34|, group(1):|12|, group(2):|34| Split: ['pre', '12', '34', 'post'] RegExp-->(return) That's all folks!
# reginput.py: imput data, call regtest
import regtest
import sys # for final output

while True:
    reg = raw_input( "RegExp-->" )
    if reg == "":
        break
    st  = raw_input( "String-->" )
    regtest.regtest(reg, st, delim="$")
sys.stdout.write("That's all folks!\n")

% python reginput.py RegExp-->(\d+)#(\d+) String-->12:34post Inputs: RegExp: $(\d+)#(\d+)$ String: $12:34post$ Search: ERR: No Match Split: ['12:34post'] RegExp-->\d+:\d+ String-->12:34post Inputs: RegExp: $\d+:\d+$ String: $12:34post$ Search: ERR: Match, no groups Split: ['', 'post'] RegExp-->(\d+):(\d+) String-->12:34post Inputs: RegExp: $(\d+):(\d+)$ String: $12:34post$ Search: group(0):$12:34$, group(1):$12$, group(2):$34$ Split: ['', '12', '34', 'post'] RegExp-->(\d+):(\d+) String-->pre12:34post Inputs: RegExp: $(\d+):(\d+)$ String: $pre12:34post$ Search: group(0):$12:34$, group(1):$12$, group(2):$34$ Split: ['pre', '12', '34', 'post'] RegExp-->(return) That's all folks!

Items of Interest or for study:

  • This uses the same function as before, but the main part of the code does interactive input of the RE and the string.

  • The other relatively small change was to make the character that delimits strings into a parameter. This is accomplished by using a default value for a new parameter, namely delim="|". If this third parameter is not used in a call, delim gets the default value of "|". Otherwise we call call with a different value for this parameter, shown above with "$". Finally, we can use the name of the parameter in a call, also shown above. Many of these issues are discussed thoroughly at: default parameter values, and calls with the parameter name.

  • It turns out that constants like "", [ ], ( ), 0, and None are all the same as False in Python. So the loop can be rewritten as:
Program: reginput.py (three versions of the loop)
Program, with Run String Delimiter is "$" String Delimiter is "$"
while True:
    reg = raw_input( "RegExp-->" )
    if not reg:
        break
    st  = raw_input( "String-->" )
    regtest.regtest(reg, st)
while True:
    reg = raw_input( "RegExp-->" )
    if reg:
        st  = raw_input( "String-->" )
        regtest.regtest(reg, st, "$")
    else:
        break
reg = raw_input( "RegExp-->" )
while reg:
    st  = raw_input( "String-->" )
    regtest.regtest(reg, st, delim="$")
    # next iteration
    reg = raw_input( "RegExp-->" )

 
    In C or Java, you would more commonly use the style below, which doesn't work in Python because you can't have an assignment buried in a while condition.

    C/Java style, not Python
    while ((dat = get_data()) != EOF) { 
       do_something();
    }


2.5. Example, Transforming Class Lists: This section gives a Python program that translates a file of data that I used to get from the UTSA system for each course. Each student had a computer science email account, with the first choice for the account name: the first letter of their first name followed by up to seven letters of their last name. (Most accounts followed this pattern; in case of duplicates the system used a succession of backup patterns.) The "before" and "after" for each line looks as follows. I'm trying to illustrate REs here with this particular structure.

1st RE
Old Line: 19 Kartaltepe, Erhan J.  @00777777 (extra stuff) ...
New Line: <li>Erhan J. Kartaltepe, Email: ekartalt@cs.utsa.edu

1st RE: r"^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$"
   is designed for the "old" lines.
All three REs are used to produce the "new" lines.

Matches in the 1st RE
RE portion Meaning
^ start of line
(\d+) one or more digits, Match 1
\s+ one or more whitespace chars
([-a-zA-Z]+) one or more letters (or a hyphen), Match 2
, a comma
\s+ one of more whitespace chars
(.+) one or more of any chars up to '@', Match 3
(@\d+) '@', plus one or more digits, Match 4 (unused)
\s+ one or more whitespace chars
.* anything at all
$ end of the line

Here is the Python program that does the translation. Python does not have the Perl style "$" variables. Also, since there are three matches active at the same time, this example uses the fact that one can get the matching characters of all three matches at the same time, something not possible in Perl. (In Perl, the "$" variables would overwrite one another. Of course this is not a real "problem", and you can easily get around it in Perl.) The third match is artificial, just to try out another match, since it just picks off the first character in the string. Even though this example makes use of Python's capabilities, it would be easy to structure it into simple Perl.

This particular example produces the same output with search replaced by match (three times), because in each case the search finds a match starting with the first character.

2nd and 3rd REs
Old String: Kartaltepe
New String: Kartalt

2nd RE: r"[a-zA-Z]{1,7}"
 fetches 1-7 letters from last name.

Old String: Erhan New String: E 3rd RE: r"^([A-Z])" fetches first letter from first name
Finally E and Kartalt are make lc and concatenated to give ekartalt

File Translation Using Three Regular Expressions
File: stud.test.py
#!/usr/bin/python
import re  regular expression module
import sys for I/O below
sys.stdout.write("<ul>\n")
s = sys.stdin.readline() # fetch next line of input file
while s:
    below is the main regular expression
    r = re.compile( r"^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$" )
    m = r.search( s )  # match line with RE
    
    r2 = re.compile( r"([a-zA-Z]{1,7})" )  # a 2nd RE: 1 to 7 letters
    m2 = r2.search(m.group(2))  # match group of 1st match with 2nd RE
    
    r3 = re.compile( r"^([A-Z])" )  # a 3rd RE: single uc letter
    m3 = r3.search(m.group(3))  

    sys.stdout.write("<li>")
    if int(m.group(1)) < 10:  # write extra blank
        sys.stdout.write(" ")
    # strip removes initial and terminal whitespace
    sys.stdout.write(m.group(1) + " " + m.group(3).strip() +
        " " + m.group(2) + ", Email: " + m3.group(1).lower() +
        m2.group(1).lower() + "@cs.utsa.edu" + "\n")
    s = sys.stdin.readline()
        
sys.stdout.write("</ul>\nTh-th-th-that's all folks!\n")

Here are the input and output files: input file (text),   output file (text),   output file (HTML)


2.6. Changing File Names in a Directory: This example shows a very simple systems programming task: alter the file names in a directory in a systematic way, using a regular expression. These are file names for downloaded cartoons, and I wanted to change them so they would be uniform and easier to read. Each original name has 6 digits representing the year, month, and day. I wanted to change "yymmdd" to "20yy-mm-dd", and change everything in front to "bc". Finally leave the ".gif" or ".jpg" alone. Three of the file names were already in the desired format.

Old Names (d) New Names (dr) Conversion Program (conv.py)
% ls -1
admin.wpbcl101011.gif
admin.wpbcl101221.gif
wpbcl_c110825.gif
wpbcl_c110826.gif
wpbcl_c111010-2.gif
wpbcl111106.jpg
wpbcl_c111121.gif
wpbcl121017.gif
wpbcl130310.jpg
wpbcl130311.gif
wpbcl130312.gif
wpbcl130602.jpg
bc2013-06-13.gif
bc2013-09-08.jpg
wpbcl130909.gif
wpbcl130910.gif
bc2013-09-18.gif
wpbcl131130.gif
wpbcl131201.jpg
wpbcl131202.gif
wpbcl131230.gif
wpbcl140127.gif
wpbcl140216.jpg
conv.py
.directory
% ls -1
bc2010-10-11.gif
bc2010-12-21.gif
bc2011-08-25.gif
bc2011-08-26.gif
bc2011-10-10.gif
bc2011-11-06.jpg
bc2011-11-21.gif
bc2012-10-17.gif
bc2013-03-10.jpg
bc2013-03-11.gif
bc2013-03-12.gif
bc2013-06-02.jpg
bc2013-06-13.gif
bc2013-09-08.jpg
bc2013-09-09.gif
bc2013-09-10.gif
bc2013-09-18.gif
bc2013-11-30.gif
bc2013-12-01.jpg
bc2013-12-02.gif
bc2013-12-30.gif
bc2014-01-27.gif
bc2014-02-16.jpg
conv.py
.directory
#!/usr/bin/python
import sys # for sys.stdout.write(<str>)
import os  # for os.listdir(<path>)
import shutil # for shutil.move(<src>,<dst>)
import re  # for for re.compile(<re>), search(<re>)
i = 1
for d in os.listdir('.'): # d = original file name
    r = re.compile(r'.*(\d\d)(\d\d)(\d\d).*(gif|jpg)')
    m = r.search( d )
    if m != None:
        dr = "bc20" + m.group(1) + "-" + m.group(2)+ \
             "-" + m.group(3) + "." + m.group(4)
        sys.stdout.write(("%2i" % i) + " Match: \n")
        sys.stdout.write("     Old: " + d  + "\n");
        sys.stdout.write("     New: " + dr + "\n");
        shutil.move(d, dr) # d to dr (new file name)
    else:
        sys.stdout.write(("%2i" % i) + " None:  ")
        sys.stdout.write(d + "\n")
    i += 1
Run of program, showing matches with changes
% python conv.py
 1 None:  bc2013-06-13.gif
 2 None:  conv.py
 3 Match: 
     Old: wpbcl130310.jpg
     New: bc2013-03-10.jpg
 4 Match: 
     Old: wpbcl130311.gif
     New: bc2013-03-11.gif
 5 Match: 
     Old: admin.wpbcl101011.gif
     New: bc2010-10-11.gif
 6 Match: 
     Old: wpbcl111106.jpg
     New: bc2011-11-06.jpg
 7 Match: 
     Old: wpbcl131201.jpg
     New: bc2013-12-01.jpg
 8 Match: 
     Old: wpbcl131230.gif
     New: bc2013-12-30.gif
 9 Match: 
     Old: wpbcl121017.gif
     New: bc2012-10-17.gif
10 Match: 
     Old: wpbcl_c111121.gif
     New: bc2011-11-21.gif
11 None:  bc2013-09-08.jpg
12 None:  .directory
13 Match: 
     Old: wpbcl130909.gif
     New: bc2013-09-09.gif
14 Match: 
     Old: wpbcl130602.jpg
     New: bc2013-06-02.jpg
15 Match: 
     Old: wpbcl140216.jpg
     New: bc2014-02-16.jpg
16 Match: 
     Old: wpbcl130910.gif
     New: bc2013-09-10.gif
17 Match: 
     Old: admin.wpbcl101221.gif
     New: bc2010-12-21.gif
18 Match: 
     Old: wpbcl131202.gif
     New: bc2013-12-02.gif
19 Match: 
     Old: wpbcl130312.gif
     New: bc2013-03-12.gif
20 Match: 
     Old: wpbcl140127.gif
     New: bc2014-01-27.gif
21 Match: 
     Old: wpbcl_c110825.gif
     New: bc2011-08-25.gif
22 None:  bc2013-09-18.gif
23 Match: 
     Old: wpbcl_c110826.gif
     New: bc2011-08-26.gif
24 Match: 
     Old: wpbcl_c111010-2.gif
     New: bc2011-10-10.gif
25 Match: 
     Old: wpbcl131130.gif
     New: bc2013-11-30.gif

Items of Interest or for study:

  • Python has very extensive systems programming features: The Beazley reference has 100 pages of brief references without examples. This program uses a function listdir from the os module. The function makes a list out of the file names in a given directory. In this case the dot means the current directory, but it could be other directories. (The Python program resided in the directory to be changed, though this would not usually be the case.) Three of the file names were already in the desired form, and in these cases there was no match. There was also no match with the name of the Python program.

  • The function move from the shutil module allowed the program to rename various files.

  • The for construct iterates through the strings in the given list. The program doesn't need to set up an integer index and increment it through the elements of the array.

  • I made a lot of mistakes while writing this code, and it was annoying to test, because the full program changes file names. (I should have proceeded more methodically and carefully. I also edited the data above, changing the order of the initial files so they would match up after the change, and deleting some data that was the result of several mistakes.)

(Revision date: 2014-05-24. Please use ISO 8601, the International Standard.)