|
CS 2213/2211
Advanced Programming
Spring 2005
Recitation 11:
Merging Files
Week 11: Apr 5-7
Due (on time):
2005-04-12 23:59:59
Due (late):
2005-04-17 23:59:59
|
Recitation 11 should be submitted
following directions at: submissions with deadlines
- 2005-04-12 23:59:59 (that's Tuesday, 12 April 2005, 11:59:59 pm)
for full credit.
- 2005-04-17 23:59:59 (that's Sunday, 17 April 2005, 11:59:59 pm)
for 75% credit.
|
Introduction:
This recitation works on areas:
- File input/output.
- Use of EOF during input.
- Use of a sentinel at the end of a file.
- A merge algorithm for files.
Overview:
This recitation asks you to merge two Unix/Linux dictionary files.
(Actually, the files are just alphabetical word lists.)
You are to use a merge algorithm to accomplish this.
Two technical problems are handling EOF on the two files
and handling the mixture of upper and lower case letters in
the files, which are alphatized as if all upper case letters
were lower case. Here is a writeup about file input/output
:
File Input/Output.
You should also refer to the text for this topic.
Unix/Linux word lists:
Traditionally, Unix and Linux have maintained a list of words
to use for spell checking and other purposes. Spell checking
is best done without too long a list, because otherwise misspelled
words will be found as "exotic" words on a long list.
- On the Sun OS, the word list is at:
/usr/share/lib/dict/words (25143 words in 206K). It is
not reachable from the Linux systems.
- On the Linux systems, the word list is at: /usr/share/dict/words
(45427 words in 409K). It is
not reachable from the Sun systems. (Actually below I am using the Linux file
from my own home system, which is slightly different from this one --
45378 words.)
Interesting enough, these lists
when merged result in over 51000 distinct words.
The word lists have a single word per line,
each pair of words separated by a newline, and the final record terminated
by a newline. (Unix text files are usually terminated with a
newline. This is expected but not required.) As mentioned above,
the entries have intermingled upper and lower case letters, and are
alphabetized as if the upper case were lower case.
The merge algorithm for sorting:
The use of a merge of two or more arrays or files gives
a simple high-performance sort when one is sorting in computer
memory (called internal sorts),
although there are many other good sorts. If one
is sorting sequential files, whether on disc or on tape, a
merge sort is the only option.
(These are called external sorts.) This recitation is concerned
with just one step: given two files in alphabetic order, combine
them into a single file in alphabetic order.
(A mergesort takes small files or arrays and merges them into larger
ones until just one sorted object remains.)
The basic algorithm is what you would do if you had two (physical) stacks
of papers, where each stack is in alphabetic order.
Read the first record (= word) from each file into separate
buffers. Then determine which one comes first in alphabetic order,
output it to the destination file, and refill its buffer from its file, leaving the
other buffer unchanged. Then repeat.
In the example here, if both buffers contain the same word,
then you should output one copy, and refill both buffers from their
respective files.
Eventually you will get to EOF on one or both of the input files.
If the same word occurs at the end of both files, then you
will need to stop at that point. Otherwise, if you get to EOF on
one file and not the other, then you need to read the remaining
words from the other file and write them to the destination file.
"Alphabetic order" causes another problem here, because our files
have mixed upper case and lower case. In one of our files, three
successive words are abbot, Abbott, and
abbreviate. But just using alphabetic order given
by the Ascii order, any word starting with "A" comes before all
words starting with "a". To make this work out, you need some sort
of little tricky addition. I'll give you a hint later if you don't figure
out something.
Finally, there is an alternative to using EOF to find the end of these
files. Instead one can insert a sentinel record (a word here)
at the end of the list. The sentinel must come after any other possible
record in the list. I used zzzzzzzzzz as my sentinel,
and I just edited it in at the end of each of the two files to be
merged. The program logic is simpler if one checks only for the
sentinel record in both files and then stops.
A. Initial test with small files:
The files used in this recitation are fairly large, so it is
important to carry out an initial test and to do initial debugging
using small artificial files.
Requirements for Part A |
- If you are using sentinels, then on either a Sun machine or a Linux machine,
you should merge the first
two files below to produce the third in a directory of your own somewhere:
- If you are not using a sentinel, the work with the following files:
- You are not to make copies of these files, but your program should
access these files directly using the full path names above.
- For reading each word, I used a single function with the file name
as one parameter, and with the already allocated string buffer as the other
parameter. This function reads chars up to a newline,
and returns either the number of chars read or EOF.
You don't have to do it this way though.
- For this initial debug run each output word should have an extra " U"
if it appears only in the first file, an extra " L"
if it appears only in the second file, and an extra " D"
if it appears in both files. (This is illustrated in the table below.)
- You should also count the number of words in each category.
Print these counts to stdout when you are done.
Do the arithmetic to check that the counts are consistent, that is,
the U plus the D counts should be the number
of words in the words_unix file, and
the L plus the D counts should be the number
of words in the words_linux file.
The sum of all three counts should be the number of words
in the words_merge file.
|
The table below shows the sample data with sentinels: the two files to merge and the
resulting merged file. I inserted extra blank lines to make everything
line up, but this might not work on your browser. This table shows the files
with sentinels. Without sentinels just leave off the zzzzzzzzzz
at the end of each file.
Two files to merge and merged file (with extra blank lines) |
---|
File 1: words_unix |
File 2: words_linux |
Merged file: words_merge |
10th
1st
2nd
a
AAAS
Aarhus
Aaron
Ababa
aback
abacus
abalone
abandon
abase
abash
abate
abater
abbas
abbey
abbot
Abbott
abbreviate
zzzzzzzzzz
|
Aaron
Ababa
aback
abaft
abandon
abandoned
abandoning
abase
abased
abasement
abasements
abases
abash
abashed
abate
abated
abatement
abatements
abater
abating
Abba
zzzzzzzzzz
| 10th U
1st U
2nd U
a U
AAAS U
Aarhus U
Aaron D
Ababa D
aback D
abacus U
abaft L
abalone U
abandon D
abandoned L
abandoning L
abase D
abased L
abasement L
abasements L
abases L
abash D
abashed L
abate D
abated L
abatement L
abatements L
abater D
abating L
Abba L
abbas U
abbey U
abbot U
Abbott U
abbreviate U
zzzzzzzzzz D
|
B. Full merging run:
With a little luck, this will be easy, and the program
in Part A will just run here.
Requirements for Part B |
- If you are using sentinels, then on either a Sun machine or a Linux machine,
you should merge the first
two files below to produce the third in a directory of your own somewhere:
- If you are not using a sentinel, the work with the following files:
- You are not to make copies of these files, but your program should
access these files directly using the full path names above. This will
keep from having each student use up storage for these files.
- The merged file should also have the "U", "L", or "D" tags
at the end of each line. (Of course for the real merged file,
we would leave these off.)
- Get the same counts in this case as in Part A. Run the
wc utility on each file, and do the arithmetic
to check if the counts are consistent with the file sizes.
|
Comments about debugging:
One possibility with this assignment, even Part A, and Part B more so,
is to start an infinite write loop, and write a file until something
gives out. You should realize that Part B itself should complete on
whatever machine you use in less than a second of wall-clock time.
If your program excutes for several minutes, you may burn up all your
available storage. Exactly this happened to me with my own solution
to the problem. By mistake I put xxxxxxxxxx at the
end of one of the files, rather than what should have been there.
I was doing something else at the same time, and before I knew it,
I had written a file of size 275 Mbyte. It was no problem on my system
at home, but it might screw things up at UTSA (or perhaps some storage
maximum will be tripped; I don't know). Anyway, be careful.
What you should submit:
Refer to the submissions directions and to
deadlines at the top of this page. The text file that you submit
should first have Your Name, the Course Number,
and the Recitation Number. The rest of the file
should have the following in it, in the order below, and clearly labeled,
including at the beginning the appropriate item letters:
a, b, c, etc.
Contents of email submission
for Recitation 11:
Last Name, First Name; Course Number; Recitation Number.
a. Indicate clearly whether you are using a sentinel or not.
Give the source for your merge program for Part A, including in particular
fopen function calls that give
access to the correct files.
b. Run the merge program in Part A, showing the counts
produced. Print the output file words_merge.
c. Give the source for your merge program for Part B. The only
difference should be the full paths to the input files.
d. Run the merge program in Part B, showing the counts
produced. Show wc invoked on each of the three
files and compare results with the counts.
Print the first 20 and the last 20 lines of the merged file.
|
For those crazy about word lists:
Here are two much longer lists of words and compounds from Apple's OS X.
They evidently are a part of BSD Unix.
Revision date: 2005-04-02.
(Please use ISO
8601, the International Standard.)
Copyright © 2011, Neal R. Wagner. Permission is granted
to access, download, share, and distribute, as long as this notice remains.