CS 3723
  Programming Languages  
  Ruby Regular Expressions   


Overview: This section studies regular expressions in Ruby, with emphasis on how Ruby's REs are object-oriented. A knowledge of REs that one might get from the systems programming course will be helpful but not required. To quote from the PPG,

"... when Matz designed Ruby, he produced a fully object-oriented regular expression handling system. He then made it look familiar to Perl programmers by wrapping all these $-variables on top of it."


References, all from the PPG: Everything below is at a link on the left of the PPG page:

  • Short introduction (Click on "Ruby.new", and look at the section "Regular Expressions".)
  • REs, Options, and Patterns (Click on "The Ruby Language", and look at the subsection "Regular Expressions" and following subsections.)
  • REs (Click on "Standard Types", and look at the section "Regular Expressions" starting about halfway down, a long exposition (all the way to the end).)
  • Reference on Regexp class and its methods (Click on "Built-in Classes and Methods", and then click on "Regexp".)


Example, hours and minutes: Consider regular expression matches aimed at time in the form hh:mm. The regular expression used is: /(\d+):(\d+)/, which matches any non-zero number of digits before and after a colon :.

RE for hh:mm Time -- /(\d+):(\d+)/
Desc. Ruby Ruby/Perl
Time#12:34pm
everything matched md[0]$&12:34
before first match md.pre_match$`Time#
first match md[1]$112
second match md[2]$234
after last match md.post_match$'pm

Regular Expression for hh:mm Time
1st Version, re3.rb, Perl-type loop 2nd Version, re4.rb, Ruby iterator
#!/usr/bin/ruby
def out (m)
  print "a->",m[0],"--",m.pre_match,"--",
    m[1],"--",m[2],"--",m.post_match
end

re = /(\d+):(\d+)/ # match a time hh:mm

while line = gets  ; loop as in Perl
  md = re.match(line)
  out(md)  ; To try out a function call
  print "b->",md[0],"--",md.pre_match,"--",
    md[1],"--",md[2],"--",md.post_match
  print "c->",$&,   "--", $`,         "--",
     $1,  "--", $2,  "--", $'
end
#!/usr/bin/ruby
def out (m)
  print "a->",m[0],"--",m.pre_match,"--",
    m[1],"--",m[2],"--",m.post_match
end

re = /(\d+):(\d+)/  # match a time hh:mm

ARGF.each { |line|  ; Ruby iterator
  md = re.match(line)
  out(md)  ; To try out a function call
  print "b->",md[0],"--",md.pre_match,"--",
    md[1],"--",md[2],"--",md.post_match
  print "c->",$&,   "--", $`,         "--",
     $1,  "--", $2,  "--", $'
}
Source File, time.txt Common Output
% cat time.txt
Time#12:34am
Time#10:30pm
Time#23:59
BadA#7:259xm
BadB#239:8ym
BadC#y9:876m
% ruby re3.rb < time.txt
a->12:34--Time#--12--34--am
b->12:34--Time#--12--34--am
c->12:34--Time#--12--34--am
a->10:30--Time#--10--30--pm
b->10:30--Time#--10--30--pm
c->10:30--Time#--10--30--pm
a->23:59--Time#--23--59--
b->23:59--Time#--23--59--
c->23:59--Time#--23--59--
a->7:259--BadA#--7--259--xm
b->7:259--BadA#--7--259--xm
c->7:259--BadA#--7--259--xm
a->239:8--BadB#--239--8--ym
b->239:8--BadB#--239--8--ym
c->239:8--BadB#--239--8--ym
a->9:876--BadC#y--9--876--m
b->9:876--BadC#y--9--876--m
c->9:876--BadC#y--9--876--m


Debug Example: Here is an example that uses an arbitrary regular expression and an arbitrary string as input. It then prints the matched patterns.

Debugging Regular Expressions
Program, with Data Output
#!/usr/bin/ruby
def md(re, dat)
  print "\nre->", re.inspect
  print "\ndat->", dat
  m = re.match(dat)
  if m != nil
    print "\nm.pre_match->" + m.pre_match
    for i in 0...m.length
      print "\nm[" + i.to_s + "]->"
      print m[i]
    end
    print "\nm.post_match->" + m.post_match
  end
  print "\n"
end

md(/(\d+):(\d+)/, "Time#12:34am")
md(/^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$/,
     "19 Kartaltepe, Erhan J. @00777777 etc.")
md(/(\(\d+\)):(\[\d+\])/, "Time#(12):[34]am")
% ruby md.rb
re->/(\d+):(\d+)/
dat->Time#12:34am
m.pre_match->Time#
m[0]->12:34
m[1]->12
m[2]->34
m.post_match->am

re->/^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$/
dat->19 Kartaltepe, Erhan J. @00777777 etc.
m.pre_match->
m[0]->19 Kartaltepe, Erhan J. @00777777 etc.
m[1]->19
m[2]->Kartaltepe
m[3]->Erhan J. 
m[4]->@00777777
m.post_match->

re->/(\(\d+\)):(\[\d+\])/
dat->Time#(12):[34]am
m.pre_match->Time#
m[0]->(12):[34]
m[1]->(12)
m[2]->[34]
m.post_match->am


Example, Transforming Class Lists: This section gives a Ruby program that translates a file of data I get from the UTSA system for each course. The "before" and "after" for each line looks as follows. (Actually, I would probably leave the initial name the same, but I'm trying to illustrate REs here.)

    Old Line: 19 Kartaltepe, Erhan J.  @00777777 (lots of extra stuff) ...
    New Line: <li>Erhan J. Kartaltepe, Email: ekartalt@cs.utsa.edu

    First RE: /^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$/ is designed for the "old" lines:

Matches in the First RE
RE portionMeaning
^ start of line
(\d+) one or more digits, Match 1 (unused)
\s+ one or more whitespace chars
([-a-zA-Z]+) one or more letters (or a hyphen), Match 2
, a comma
\s+ one of more whitespace chars
(.+) one or more of any chars up to '@', Match 3
(@\d+) '@', plus one or more digits, Match 4 (unused)
\s+ one or more whitespace chars
.* anything at all
$ end of the line

Here is the Ruby program that does the translation. The program makes no use of the Perl style "$" variables. Also, since there are three matches active at the same time, this example uses the fact that one can get the matching characters of all three matches at the same time, something not possible in Perl. (In Perl, the "$" variables would overwrite one another. Of course this is not a real "problem", and you can easily get around it in Perl.) The third match is artificial, just to try out another match, since it just picks off the first character in the string. Even though this example makes use of Ruby's capabilities, it would be easy to structure it into simple Perl.

    Old String: Kartaltepe
    New String: Kartalt

    Second RE: /[a-zA-Z]{1,7}/ fetches 1-7 letters from last name.


    Old String: Erhan
    New String: E

    Third RE: /^([A-Z])/ fetches first letter from first name


    Finally E and Kartalt are concatenated to give EKartalt,
    and then the string downcased to give ekartalt

File Translation Using Three Regular Expressions
File: rexp.rb
#!/usr/local/bin/ruby

print "<ol type=1>\n"
while line = gets  # fetch next line of input file
   re = /^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$/  # main RE
   m = re.match(line) # match line with RE
   re2 = /[a-zA-Z]{1,7}/ # a 2nd RE: at least 1 and at most 7 letters
   m2 = re2.match(m[2]) # match portion of 1st match with 2nd RE
   re3 = /^([A-Z])/ # a 3rd RE: a single initial uppercase letter
   m3 = re3.match(m[3]) # match portion of 1st match with 3rd RE
   # output new altered line
   # strip removes initial and terminal whitespace from a string
   newline = "<li>" + m[3].strip + " " + m[2] + ", " +
      "Email: " + (m3[1] + m2[0]).downcase + "@cs.utsa.edu\n"
   print newline
end
print "</ol>\n"

Here are the input and output files: input file (text),   output file (text),   output file (HTML),  


Revision date: 2013-11-07. (Please use ISO 8601, the International Standard Date and Time Notation.)