|
 |
CS 3723
Programming Languages |
Regular Expressions |
Definition:
Regular expressions give another equivalent way to describe
languages accepted by NFAs and DFAs.
Regular expressions are often used in "scripting" languages,
such as Perl, Python, Ruby, etc.
Java even has a library implementation.
Presented here are the "bare" regular expressions without
the amazing amount of "syntactic sugar" supplied by most
implementations. This extra stuff gives a great
deal of shorthand notation.
However, we are presenting the essentials of the topic.
Start with an alphabet Σ
of symbols,
such at {a,b,c,...} or {0,1} or the Ascii characters.
A regular expression satisfies a recursive definition:
Definition: Regular Expression
|
- For any symbol a in Σ,
the regular expression a represents
the symbol itself.
- Suppose R,
R1, and
R2
are arbitrary regular expressions. Then
- R1 R1
represents all strings that can be written as
a string described by the first followed by a string
described by the second (the concatenation operator),
which is implicit, that it, not explicitly written.
- R1 | R1
represents the union of strings described by the first and by the second
(the or operator),
- R*
represents all strings that can be written as the concatenation
of any number of strings from R
(the star operator).
This includes concatenating zero strings from
R, giving the empty
string ε. This unary operator appears to
the right of its operand.
- (R), which is the
same as R (the parens
are used for grouping).
|
Precedence of Operators
|
Star: highest, Concatentation: next,
Or: lowest |
Parentheses are used to alter precedence of operators.
Thus a|bc* means
a|(b(c*)), whereas
a*b|c means
((a*)b)|c. Different meanings
require parentheses, as with, say a(b|c),
which means an "a" followed by either a "b" or a "c".
The expression ab|c would mean
either "ab" or "c".
Regular expressions are often written between two slash characters:
/a(b|c)/
A blank in a
regular expression always stands for an actual blank character.
I may sometimes add blanks to examples to improve readability,
as in the table above.
Notice that we're using the following symbols as metasymbols:
To refer to the actual character one needs a backslash:
but we won't implement or use this notation.
Regular expressions in programming languages use a number
of additional metasymbols, such as
and very large number of ways to
specify classes of characters. Here we will only use two more:
. (a dot, refers
to any single symbol)
@ (for ε,
the empty string. Not used in programming languages, so there
is no standard notation)
[The unary operator * occurs to the
right of its operand. Of the various unary operators you
know from C, such as
+ - ! ~
* &
++ -- sizeof (double)
only ++ and --
can appear on the right. Of course the C operator
* appears on the left. So it is
unusual (and perhaps confusing) for the regular expression operator
* to appear on the right.]
Revision date: 2013-06-09.
(Please use ISO
8601, the International Standard.)
|