CS 3723 Ruby Regular Expressions

CS 3723
Programming Languages

Ruby Regular Expressions

Overview: This section studies regular expressions in Ruby, with emphasis on how Ruby's REs are object-oriented. A knowledge of REs that one might get from the systems programming course will be helpful but not required. To quote from the PPG,

"... when Matz designed Ruby, he produced a fully object-oriented regular expression handling system. He then made it look familiar to Perl programmers by wrapping all these $-variables on top of it."

References, all from the PPG: Everything below is at a link on the left of the PPG page:

Short introduction (Click on "Ruby.new", and look at the section "Regular Expressions".)
REs, Options, and Patterns (Click on "The Ruby Language", and look at the subsection "Regular Expressions" and following subsections.)
REs (Click on "Standard Types", and look at the section "Regular Expressions" starting about halfway down, a long exposition (all the way to the end).)
Reference on Regexp class and its methods (Click on "Built-in Classes and Methods", and then click on "Regexp".)

Example, hours and minutes: Consider regular expression matches aimed at time in the form hh:mm. The regular expression used is: /(\d+):(\d+)/, which matches any non-zero number of digits before and after a colon :.

RE for hh:mm Time -- /(\d+):(\d+)/
Desc.	Ruby	Ruby/Perl	Time#12:34pm
everything matched	md[0]	$&	12:34
before first match	md.pre_match	$`	Time#
first match	md[1]	$1	12
second match	md[2]	$2	34
after last match	md.post_match	$'	pm

Regular Expression for hh:mm Time
1st Version, re3.rb, Perl-type loop	2nd Version, re4.rb, Ruby iterator
#!/usr/bin/ruby def out (m) print "a->",m[0],"--",m.pre_match,"--", m[1],"--",m[2],"--",m.post_match end re = /(\d+):(\d+)/ # match a time hh:mm while line = gets ; loop as in Perl md = re.match(line) out(md) ; To try out a function call print "b->",md[0],"--",md.pre_match,"--", md[1],"--",md[2],"--",md.post_match print "c->",$&, "--", $`, "--", $1, "--", $2, "--", $' end	#!/usr/bin/ruby def out (m) print "a->",m[0],"--",m.pre_match,"--", m[1],"--",m[2],"--",m.post_match end re = /(\d+):(\d+)/ # match a time hh:mm ARGF.each { \|line\| ; Ruby iterator md = re.match(line) out(md) ; To try out a function call print "b->",md[0],"--",md.pre_match,"--", md[1],"--",md[2],"--",md.post_match print "c->",$&, "--", $`, "--", $1, "--", $2, "--", $' }
Source File, time.txt	Common Output
% cat time.txt Time#12:34am Time#10:30pm Time#23:59 BadA#7:259xm BadB#239:8ym BadC#y9:876m	% ruby re3.rb < time.txt a->12:34--Time#--12--34--am b->12:34--Time#--12--34--am c->12:34--Time#--12--34--am a->10:30--Time#--10--30--pm b->10:30--Time#--10--30--pm c->10:30--Time#--10--30--pm a->23:59--Time#--23--59-- b->23:59--Time#--23--59-- c->23:59--Time#--23--59-- a->7:259--BadA#--7--259--xm b->7:259--BadA#--7--259--xm c->7:259--BadA#--7--259--xm a->239:8--BadB#--239--8--ym b->239:8--BadB#--239--8--ym c->239:8--BadB#--239--8--ym a->9:876--BadC#y--9--876--m b->9:876--BadC#y--9--876--m c->9:876--BadC#y--9--876--m

Debug Example: Here is an example that uses an arbitrary regular expression and an arbitrary string as input. It then prints the matched patterns.

Debugging Regular Expressions

Program, with Data

Output

#!/usr/bin/ruby
def md(re, dat)
  print "\nre->", re.inspect
  print "\ndat->", dat
  m = re.match(dat)
  if m != nil
    print "\nm.pre_match->" + m.pre_match
    for i in 0...m.length
      print "\nm[" + i.to_s + "]->"
      print m[i]
    end
    print "\nm.post_match->" + m.post_match
  end
  print "\n"
end

md(/(\d+):(\d+)/, "Time#12:34am")
md(/^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$/,
     "19 Kartaltepe, Erhan J. @00777777 etc.")
md(/(\(\d+\)):(\[\d+\])/, "Time#(12):[34]am")

% ruby md.rb
re->/(\d+):(\d+)/
dat->Time#12:34am
m.pre_match->Time#
m[0]->12:34
m[1]->12
m[2]->34
m.post_match->am

re->/^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$/
dat->19 Kartaltepe, Erhan J. @00777777 etc.
m.pre_match->
m[0]->19 Kartaltepe, Erhan J. @00777777 etc.
m[1]->19
m[2]->Kartaltepe
m[3]->Erhan J. 
m[4]->@00777777
m.post_match->

re->/(\(\d+\)):(\[\d+\])/
dat->Time#(12):[34]am
m.pre_match->Time#
m[0]->(12):[34]
m[1]->(12)
m[2]->[34]
m.post_match->am

Example, Transforming Class Lists: This section gives a Ruby program that translates a file of data I get from the UTSA system for each course. The "before" and "after" for each line looks as follows. (Actually, I would probably leave the initial name the same, but I'm trying to illustrate REs here.)

Old Line: 19 Kartaltepe, Erhan J.  @00777777 (lots of extra stuff) ...
New Line: <li>Erhan J. Kartaltepe, Email: ekartalt@cs.utsa.edu

First RE: /^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$/ is designed for the "old" lines:

Matches in the First RE
RE portion	Meaning
`^`	start of line
`(\d+)`	one or more digits, Match 1 (unused)
`\s+`	one or more whitespace chars
`([-a-zA-Z]+)`	one or more letters (or a hyphen), Match 2
`,`	a comma
`\s+`	one of more whitespace chars
`(.+)`	one or more of any chars up to '@', Match 3
`(@\d+)`	'@', plus one or more digits, Match 4 (unused)
`\s+`	one or more whitespace chars
`.*`	anything at all
`$`	end of the line

Here is the Ruby program that does the translation. The program makes no use of the Perl style "$" variables. Also, since there are three matches active at the same time, this example uses the fact that one can get the matching characters of all three matches at the same time, something not possible in Perl. (In Perl, the "$" variables would overwrite one another. Of course this is not a real "problem", and you can easily get around it in Perl.) The third match is artificial, just to try out another match, since it just picks off the first character in the string. Even though this example makes use of Ruby's capabilities, it would be easy to structure it into simple Perl.

Old String: Kartaltepe
New String: Kartalt

Second RE: /[a-zA-Z]{1,7}/ fetches 1-7 letters from last name.

Old String: Erhan
New String: E

Third RE: /^([A-Z])/ fetches first letter from first name

E

Kartalt

EKartalt

ekartalt

File Translation Using Three Regular Expressions
File: rexp.rb

#!/usr/local/bin/ruby

print "<ol type=1>\n"
while line = gets  # fetch next line of input file
   re = /^(\d+)\s+([-a-zA-Z]+),\s+(.+)(@\d+)\s+.*$/  # main RE
   m = re.match(line) # match line with RE
   re2 = /[a-zA-Z]{1,7}/ # a 2nd RE: at least 1 and at most 7 letters
   m2 = re2.match(m[2]) # match portion of 1st match with 2nd RE
   re3 = /^([A-Z])/ # a 3rd RE: a single initial uppercase letter
   m3 = re3.match(m[3]) # match portion of 1st match with 3rd RE
   # output new altered line
   # strip removes initial and terminal whitespace from a string
   newline = "<li>" + m[3].strip + " " + m[2] + ", " +
      "Email: " + (m3[1] + m2[0]).downcase + "@cs.utsa.edu\n"
   print newline
end
print "</ol>\n"

Here are the input and output files: input file (text), output file (text), output file (HTML),

Revision date: 2013-11-07. (Please use ISO 8601, the International Standard Date and Time Notation.)