|
 |
CS 3723
Programming Languages |
Regular Expressions |
Introduction:
Regular expressions provide a powerful way to describe certain
classes of strings (that it, of languages). These are exactly
equivalent to those described by finite automata
(NFAs and DFAs) as we will see later in the course.
For any language recognized by a regular expression, there is
an equivalent CF grammar that will recognize the same language, but
not vice-versa. Thus CF grammars are strictly more powerful, but
not necessary as easy to use.
Regular expressions are often used in "scripting" languages,
such as Perl, Python, Ruby, etc., and
Java even has a library implementation.
Presented here are the "bare" regular expressions without
the amazing amount of "syntactic sugar" supplied by most
implementations. The extra sugar gives a great
deal of convenient shorthand notation.
However, we are presenting the essentials of the topic.
Definition:
Start with an alphabet Σ
of symbols,
such as {a,b,c,...} or {0,1} or the Ascii characters.
A regular expression
satisfies a recursive definition:
Definition: Regular Expression
|
- For any symbol a in Σ,
the regular expression a represents
the symbol itself.
- Suppose R,
R1, and
R2
are arbitrary regular expressions. Then
- R1 R2
represents all strings that can be written as
a string described by the first followed by a string
described by the second. (This uses the concatenation operator,
which is implicit, that it, not explicitly written.)
- R1 | R2
represents the union of strings described by the first and by the second
(the or operator),
- R*
represents all strings that can be written as the concatenation
of any number (including none) of strings from R
(the star operator).
This includes concatenating zero strings from
R, giving the empty
string ε. This unary operator appears to
the right of its operand.
- (R), which is the
same as R (the parens
are used for grouping).
|
Precedence of Operators
|
Star: highest, Concatenation: next,
Or: lowest. |
Precedence:
Parentheses are used to alter precedence of operators.
Thus a|bc* means
a|(b(c*)), whereas
a*b|c means
((a*)b)|c. Different meanings
require parentheses, as with, say a(b|c),
which means an "a" followed by either a "b" or a "c".
The expression ab|c would mean
either "ab" or "c".
How Written:
Regular expressions are often written between two slash characters:
/a(b|c)/
This is the case for Perl and Ruby. Python puts the regular
expression inside a "raw" string:
r'a(b|c)', or
R"a(b|c)"
Blanks: A blank in a
regular expression always stands for an actual blank character.
I may sometimes add blanks to examples to improve readability,
as in the table above.
Metasymbols:
Notice that we're using the following symbols as metasymbols:
To refer to the actual character one needs a backslash:
and since we want to use a backslash as a metasymbol, we need a way
to represent an actual backslash character, which would usually
be \\ .
When we implement regular expressions, we will not be using
any metasymbols as actual characters, and we will not be using
the backslash character.
(When we use regular expressions in Python, we will need to be careful about
these special characters.)
For our implementation, we will use only one other special
symbol:
@ (for ε,
the empty string. Not used in programming languages, so there
is no standard notation)
The Empty String:
Of course programming languages make use of the empty string,
often denoted by "",
that is, nothing inside
double quotes. (Notice that the empty string is not some sort of
null object; it is a string with zero characters in it, a string
of length 0.) However, no programming language has a direct way to refer
to the empty string inside a regular expression.
Regular expressions in a programming
language:
Most programming languages that implement regular expressions have
a fairly standard form for the regular expressions themselves.
Here there are a number of additional metasymbols, such as
and very large number of ways to
specify classes of characters.
Additional Notation:
Just a few examples here. Below, R is any regular expression.
- R+
This means "one or more repetitions of" what comes before.
We can easily get this by writing (RR*)
- R?
This means "zero or one repetitions of" what comes before.
We can easily get this by writing
(R|ε)
Notice that this is not available in REs in a programming language.
- [a-z]
This means any lower-case letter. This could be represented
by (a|b|c|...|z), where the dots
are not part of the notation (needs all 26 letters). So this and
many other similar ranges are handy.
The link:
Regular Expressions shows many of the special possibilities
in Python, while REs
(tutorialspoint) gives a complete list at the end.
Unary Operators:
The unary operator * occurs to the
right of its operand. Of the various unary operators you
know from C, such as
+ - ! ~
* &
++ − − sizeof
(double)
only ++ and − −
can appear on the right. Of course the C operator
* appears on the left. So it is
unusual (and perhaps confusing) for the regular expression operator
* to appear on the right.
(Revision date: 2014-06-30.
Please use ISO
8601, the International Standard.)
|