CS 3723
 Programming Languages 
  Regular Expressions  


Introduction: Regular expressions provide a powerful way to describe certain classes of strings (that it, of languages). These are exactly equivalent to those described by finite automata (NFAs and DFAs) as we will see later in the course.

For any language recognized by a regular expression, there is an equivalent CF grammar that will recognize the same language, but not vice-versa. Thus CF grammars are strictly more powerful, but not necessary as easy to use.

Regular expressions are often used in "scripting" languages, such as Perl, Python, Ruby, etc., and Java even has a library implementation.

Presented here are the "bare" regular expressions without the amazing amount of "syntactic sugar" supplied by most implementations. The extra sugar gives a great deal of convenient shorthand notation. However, we are presenting the essentials of the topic.


Definition: Start with an alphabet Σ of symbols, such as {a,b,c,...} or {0,1} or the Ascii characters.

    A regular expression satisfies a recursive definition:

    Definition: Regular Expression
    1. For any symbol a in Σ, the regular expression a represents the symbol itself.
    2. Suppose R, R1, and R2 are arbitrary regular expressions. Then
      • R1 R2 represents all strings that can be written as a string described by the first followed by a string described by the second. (This uses the concatenation operator, which is implicit, that it, not explicitly written.)
      • R1 | R2 represents the union of strings described by the first and by the second (the or operator),
      • R* represents all strings that can be written as the concatenation of any number (including none) of strings from R (the star operator). This includes concatenating zero strings from R, giving the empty string ε. This unary operator appears to the right of its operand.
      • (R), which is the same as R (the parens are used for grouping).

    Precedence of Operators
    Star: highest, Concatenation: next, Or: lowest.

    Precedence: Parentheses are used to alter precedence of operators. Thus a|bc* means a|(b(c*)), whereas a*b|c means ((a*)b)|c. Different meanings require parentheses, as with, say a(b|c), which means an "a" followed by either a "b" or a "c". The expression ab|c would mean either "ab" or "c".

    How Written: Regular expressions are often written between two slash characters: /a(b|c)/
    This is the case for Perl and Ruby. Python puts the regular expression inside a "raw" string:
    r'a(b|c)', or  R"a(b|c)"

    Blanks: A blank in a regular expression always stands for an actual blank character. I may sometimes add blanks to examples to improve readability, as in the table above.

    Metasymbols: Notice that we're using the following symbols as metasymbols:

       |    *    (    )    /
    To refer to the actual character one needs a backslash:
      \|   \*   \(   \)   \/
    and since we want to use a backslash as a metasymbol, we need a way to represent an actual backslash character, which would usually be  \\ .

    When we implement regular expressions, we will not be using any metasymbols as actual characters, and we will not be using the backslash character. (When we use regular expressions in Python, we will need to be careful about these special characters.)

    For our implementation, we will use only one other special symbol:

      @     (for ε, the empty string. Not used in programming languages, so there is no standard notation)

    The Empty String: Of course programming languages make use of the empty string, often denoted by "", that is, nothing inside double quotes. (Notice that the empty string is not some sort of null object; it is a string with zero characters in it, a string of length 0.) However, no programming language has a direct way to refer to the empty string inside a regular expression.


Regular expressions in a programming language: Most programming languages that implement regular expressions have a fairly standard form for the regular expressions themselves. Here there are a number of additional metasymbols, such as
    [   \   ^   $   ?   +  
and very large number of ways to specify classes of characters.

Additional Notation: Just a few examples here. Below, R is any regular expression.

  •  R+ This means "one or more repetitions of" what comes before. We can easily get this by writing (RR*)

  •  R? This means "zero or one repetitions of" what comes before. We can easily get this by writing (R|ε) Notice that this is not available in REs in a programming language.

  •  [a-z] This means any lower-case letter. This could be represented by (a|b|c|...|z), where the dots are not part of the notation (needs all 26 letters). So this and many other similar ranges are handy.

The link: Regular Expressions shows many of the special possibilities in Python, while REs (tutorialspoint) gives a complete list at the end.

Unary Operators: The unary operator * occurs to the right of its operand. Of the various unary operators you know from C, such as

    +   -   !   ~   *   &   ++   − −   sizeof   (double)
only ++ and − − can appear on the right. Of course the C operator * appears on the left. So it is unusual (and perhaps confusing) for the regular expression operator * to appear on the right.

(Revision date: 2014-06-30. Please use ISO 8601, the International Standard.)