CS 3723
 Programming Languages 
  Regular Expressions  


Definition: Regular expressions give another equivalent way to describe languages accepted by NFAs and DFAs. Regular expressions are often used in "scripting" languages, such as Perl, Python, Ruby, etc. Java even has a library implementation.

Presented here are the "bare" regular expressions without the amazing amount of "syntactic sugar" supplied by most implementations. This extra stuff gives a great deal of shorthand notation. However, we are presenting the essentials of the topic.

Start with an alphabet Σ of symbols, such at {a,b,c,...} or {0,1} or the Ascii characters.

A regular expression satisfies a recursive definition:

Definition: Regular Expression
  1. For any symbol a in Σ, the regular expression a represents the symbol itself.
  2. Suppose R, R1, and R2 are arbitrary regular expressions. Then
    • R1 R1 represents all strings that can be written as a string described by the first followed by a string described by the second (the concatenation operator), which is implicit, that it, not explicitly written.
    • R1 | R1 represents the union of strings described by the first and by the second (the or operator),
    • R* represents all strings that can be written as the concatenation of any number of strings from R (the star operator). This includes concatenating zero strings from R, giving the empty string ε. This unary operator appears to the right of its operand.
    • (R), which is the same as R (the parens are used for grouping).

Precedence of Operators
Star: highest, Concatentation: next, Or: lowest

Parentheses are used to alter precedence of operators. Thus a|bc* means a|(b(c*)), whereas a*b|c means ((a*)b)|c. Different meanings require parentheses, as with, say a(b|c), which means an "a" followed by either a "b" or a "c". The expression ab|c would mean either "ab" or "c".

Regular expressions are often written between two slash characters: /a(b|c)/

A blank in a regular expression always stands for an actual blank character. I may sometimes add blanks to examples to improve readability, as in the table above.

Notice that we're using the following symbols as metasymbols:

    |   *   (   )   /
To refer to the actual character one needs a backslash:
    \|   \*   \(   \)   \/
but we won't implement or use this notation.

Regular expressions in programming languages use a number of additional metasymbols, such as

    [   \   ^   $   ?   +  
and very large number of ways to specify classes of characters. Here we will only use two more:
    .        (a dot, refers to any single symbol) 
    @     (for ε, the empty string. Not used in programming languages, so there is no standard notation)

[The unary operator * occurs to the right of its operand. Of the various unary operators you know from C, such as

    +   -   !   ~   *   &   ++   --   sizeof   (double)
only ++ and -- can appear on the right. Of course the C operator * appears on the left. So it is unusual (and perhaps confusing) for the regular expression operator * to appear on the right.]


Revision date: 2013-06-09. (Please use ISO 8601, the International Standard.)