|
 |
CS 3723
Programming Languages |
Ruby Regular Expressions
|
|
Overview:
This section studies regular expressions in Ruby, with emphasis
on how Ruby's REs are object-oriented. A knowledge
of REs that one might get from the systems programming course
will be helpful but not required.
To quote from the PPG,
"... when Matz designed Ruby, he produced a fully
object-oriented regular expression handling system. He then made it look
familiar to Perl programmers by wrapping all these $-variables on top of it."
References, all from the
PPG:
Everything below is at a link on the left of the PPG page:
- Short introduction (Click on "Ruby.new", and look
at the section "Regular Expressions".)
- REs, Options, and Patterns (Click on "The Ruby Language", and look
at the subsection "Regular Expressions" and following subsections.)
- REs (Click on "Standard Types", and look at the section
"Regular Expressions" starting about halfway down,
a long exposition (all the way to the end).)
- Reference on Regexp class and its methods
(Click on "Built-in Classes and Methods", and then click
on "Regexp".)
Example, hours and minutes:
Consider regular expression matches aimed at time
in the form hh:mm. The regular expression used is:
/(\d+):(\d+)/, which matches any non-zero number of digits
before and after a colon :.
RE
for hh:mm Time -- /(\d+):(\d+)/ |
Desc. |
Ruby |
Ruby/Perl |
Time#12:34pm |
everything matched |
md[0] | $& | 12:34 |
before first match |
md.pre_match | $` | Time# |
first match |
md[1] | $1 | 12 |
second match |
md[2] | $2 | 34 |
after last match |
md.post_match | $' | pm |
Regular
Expression for hh:mm Time |
1st Version, re3.rb, Perl-type loop |
2nd Version, re4.rb, Ruby iterator |
#!/usr/bin/ruby
def out (m)
print "a->",m[0],"--",m.pre_match,"--",
m[1],"--",m[2],"--",m.post_match
end
re = /(\d+):(\d+)/ # match a time hh:mm
while line = gets ; loop as in Perl
md = re.match(line)
out(md) ; To try out a function call
print "b->",md[0],"--",md.pre_match,"--",
md[1],"--",md[2],"--",md.post_match
print "c->",$&, "--", $`, "--",
$1, "--", $2, "--", $'
end
|
#!/usr/bin/ruby
def out (m)
print "a->",m[0],"--",m.pre_match,"--",
m[1],"--",m[2],"--",m.post_match
end
re = /(\d+):(\d+)/ # match a time hh:mm
ARGF.each { |line| ; Ruby iterator
md = re.match(line)
out(md) ; To try out a function call
print "b->",md[0],"--",md.pre_match,"--",
md[1],"--",md[2],"--",md.post_match
print "c->",$&, "--", $`, "--",
$1, "--", $2, "--", $'
}
|
Source File, time.txt |
Common Output |
% cat time.txt
Time#12:34am
Time#10:30pm
Time#23:59
BadA#7:259xm
BadB#239:8ym
BadC#y9:876m
| % ruby re3.rb < time.txt
a->12:34--Time#--12--34--am
b->12:34--Time#--12--34--am
c->12:34--Time#--12--34--am
a->10:30--Time#--10--30--pm
b->10:30--Time#--10--30--pm
c->10:30--Time#--10--30--pm
a->23:59--Time#--23--59--
b->23:59--Time#--23--59--
c->23:59--Time#--23--59--
a->7:259--BadA#--7--259--xm
b->7:259--BadA#--7--259--xm
c->7:259--BadA#--7--259--xm
a->239:8--BadB#--239--8--ym
b->239:8--BadB#--239--8--ym
c->239:8--BadB#--239--8--ym
a->9:876--BadC#y--9--876--m
b->9:876--BadC#y--9--876--m
c->9:876--BadC#y--9--876--m
|
Debug Example:
Here is an example that uses an arbitrary regular
expression and an arbitrary string as input.
It then prints the matched patterns.
Debugging
Regular Expressions |
Program, with Data |
Output |
#!/usr/bin/ruby
def md(re, dat)
print "\nre->", re.inspect
print "\ndat->", dat
m = re.match(dat)
if m != nil
print "\nm.pre_match->" + m.pre_match
for i in 0...m.length
print "\nm[" + i.to_s + "]->"
print m[i]
end
print "\nm.post_match->" + m.post_match
end
print "\n"
end
md(/(\d+):(\d+)/, "Time#12:34am")
md(/^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$/,
"19 Kartaltepe, Erhan J. @00777777 etc.")
md(/(\(\d+\)):(\[\d+\])/, "Time#(12):[34]am")
|
% ruby md.rb
re->/(\d+):(\d+)/
dat->Time#12:34am
m.pre_match->Time#
m[0]->12:34
m[1]->12
m[2]->34
m.post_match->am
re->/^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$/
dat->19 Kartaltepe, Erhan J. @00777777 etc.
m.pre_match->
m[0]->19 Kartaltepe, Erhan J. @00777777 etc.
m[1]->19
m[2]->Kartaltepe
m[3]->Erhan J.
m[4]->@00777777
m.post_match->
re->/(\(\d+\)):(\[\d+\])/
dat->Time#(12):[34]am
m.pre_match->Time#
m[0]->(12):[34]
m[1]->(12)
m[2]->[34]
m.post_match->am
|
Example, Transforming Class Lists:
This section gives a Ruby program that
translates a file of data I get from the UTSA system for each course.
The "before" and "after" for each line looks as follows.
(Actually, I would probably leave the initial name the same, but I'm trying to
illustrate REs here.)
Old Line: 19 Kartaltepe, Erhan J. @00777777 (lots of extra stuff) ...
New Line: <li>Erhan J. Kartaltepe, Email: ekartalt@cs.utsa.edu
First RE: /^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$/ is
designed for the "old" lines:
Matches in the
First RE |
RE portion | Meaning |
^ | start of line |
(\d+) | one or more digits, Match 1 (unused) |
\s+ | one or more whitespace chars |
([-a-zA-Z]+) | one or more letters (or a hyphen), Match 2 |
, | a comma |
\s+ | one of more whitespace chars |
(.+) | one or more of any chars up to '@', Match 3 |
(@\d+) | '@', plus one or more digits, Match 4 (unused) |
\s+ | one or more whitespace chars |
.* | anything at all |
$ | end of the line |
Here is the Ruby program that does the translation. The program makes no use
of the Perl style "$" variables. Also, since there are three matches
active at the same time, this example uses the fact that one can get the
matching characters of all three matches at the same time, something not
possible in Perl.
(In Perl, the "$" variables would overwrite one another. Of course this is not
a real "problem", and you can easily get around it in Perl.)
The third match is artificial, just to try out another match,
since it just picks off the first character in the string.
Even though this example makes use of Ruby's capabilities, it would be
easy to structure it into simple Perl.
Old String: Kartaltepe
New String: Kartalt
Second RE: /[a-zA-Z]{1,7}/ fetches
1-7 letters from last name.
Old String: Erhan
New String: E
Third RE: /^([A-Z])/ fetches
first letter from first name
Finally E and
Kartalt are concatenated
to give EKartalt,
and then the string downcased to give
ekartalt
File Translation Using Three
Regular Expressions File: rexp.rb |
#!/usr/local/bin/ruby
print "<ol type=1>\n"
while line = gets # fetch next line of input file
re = /^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$/ # main RE
m = re.match(line) # match line with RE
re2 = /[a-zA-Z]{1,7}/ # a 2nd RE: at least 1 and at most 7 letters
m2 = re2.match(m[2]) # match portion of 1st match with 2nd RE
re3 = /^([A-Z])/ # a 3rd RE: a single initial uppercase letter
m3 = re3.match(m[3]) # match portion of 1st match with 3rd RE
# output new altered line
# strip removes initial and terminal whitespace from a string
newline = "<li>" + m[3].strip + " " + m[2] + ", " +
"Email: " + (m3[1] + m2[0]).downcase + "@cs.utsa.edu\n"
print newline
end
print "</ol>\n"
|
Here are the input and output files:
input file (text),
output file (text),
output file (HTML),
Revision date: 2013-11-07.
(Please use ISO 8601,
the International Standard Date and Time Notation.)
|