CS 2213/2211 -- Recitation 11

CS 2213/2211
Advanced Programming
Spring 2005
Recitation 11: Merging Files
    Week 11: Apr 5-7
Due (on time): 2005-04-12  23:59:59
Due (late):        2005-04-17  23:59:59

Recitation 11 should be submitted following directions at: submissions with deadlines

2005-04-12 23:59:59 (that's Tuesday, 12 April 2005, 11:59:59 pm) for full credit.
2005-04-17 23:59:59 (that's Sunday, 17 April 2005, 11:59:59 pm) for 75% credit.

Introduction: This recitation works on areas:

File input/output.
Use of EOF during input.
Use of a sentinel at the end of a file.
A merge algorithm for files.

Overview: This recitation asks you to merge two Unix/Linux dictionary files. (Actually, the files are just alphabetical word lists.) You are to use a merge algorithm to accomplish this. Two technical problems are handling EOF on the two files and handling the mixture of upper and lower case letters in the files, which are alphatized as if all upper case letters were lower case. Here is a writeup about file input/output : File Input/Output. You should also refer to the text for this topic.

Unix/Linux word lists: Traditionally, Unix and Linux have maintained a list of words to use for spell checking and other purposes. Spell checking is best done without too long a list, because otherwise misspelled words will be found as "exotic" words on a long list.

On the Sun OS, the word list is at: /usr/share/lib/dict/words (25143 words in 206K). It is not reachable from the Linux systems.
On the Linux systems, the word list is at: /usr/share/dict/words (45427 words in 409K). It is not reachable from the Sun systems. (Actually below I am using the Linux file from my own home system, which is slightly different from this one -- 45378 words.)

Interesting enough, these lists when merged result in over 51000 distinct words.

The word lists have a single word per line, each pair of words separated by a newline, and the final record terminated by a newline. (Unix text files are usually terminated with a newline. This is expected but not required.) As mentioned above, the entries have intermingled upper and lower case letters, and are alphabetized as if the upper case were lower case.

The merge algorithm for sorting: The use of a merge of two or more arrays or files gives a simple high-performance sort when one is sorting in computer memory (called internal sorts), although there are many other good sorts. If one is sorting sequential files, whether on disc or on tape, a merge sort is the only option. (These are called external sorts.) This recitation is concerned with just one step: given two files in alphabetic order, combine them into a single file in alphabetic order. (A mergesort takes small files or arrays and merges them into larger ones until just one sorted object remains.)

The basic algorithm is what you would do if you had two (physical) stacks of papers, where each stack is in alphabetic order. Read the first record (= word) from each file into separate buffers. Then determine which one comes first in alphabetic order, output it to the destination file, and refill its buffer from its file, leaving the other buffer unchanged. Then repeat.

In the example here, if both buffers contain the same word, then you should output one copy, and refill both buffers from their respective files.

Eventually you will get to EOF on one or both of the input files. If the same word occurs at the end of both files, then you will need to stop at that point. Otherwise, if you get to EOF on one file and not the other, then you need to read the remaining words from the other file and write them to the destination file.

"Alphabetic order" causes another problem here, because our files have mixed upper case and lower case. In one of our files, three successive words are abbot, Abbott, and abbreviate. But just using alphabetic order given by the Ascii order, any word starting with "A" comes before all words starting with "a". To make this work out, you need some sort of little tricky addition. I'll give you a hint later if you don't figure out something.

Finally, there is an alternative to using EOF to find the end of these files. Instead one can insert a sentinel record (a word here) at the end of the list. The sentinel must come after any other possible record in the list. I used zzzzzzzzzz as my sentinel, and I just edited it in at the end of each of the two files to be merged. The program logic is simpler if one checks only for the sentinel record in both files and then stops.

A. Initial test with small files: The files used in this recitation are fairly large, so it is important to carry out an initial test and to do initial debugging using small artificial files.

Requirements for Part A

Requirements for Part A
If you are using sentinels, then on either a Sun machine or a Linux machine, you should merge the first two files below to produce the third in a directory of your own somewhere: `/home/wagner/public_html/CS2213/wordst/words_unix` `/home/wagner/public_html/CS2213/wordst/words_linux` `words_merge` If you are not using a sentinel, the work with the following files: `/home/wagner/public_html/CS2213/wordst/words_unix_ns` `/home/wagner/public_html/CS2213/wordst/words_linux_ns` `words_merge_ns` You are not to make copies of these files, but your program should access these files directly using the full path names above. For reading each word, I used a single function with the file name as one parameter, and with the already allocated string buffer as the other parameter. This function reads chars up to a newline, and returns either the number of chars read or EOF. You don't have to do it this way though. For this initial debug run each output word should have an extra `" U"` if it appears only in the first file, an extra `" L"` if it appears only in the second file, and an extra `" D"` if it appears in both files. (This is illustrated in the table below.) You should also count the number of words in each category. Print these counts to `stdout` when you are done. Do the arithmetic to check that the counts are consistent, that is, the `U` plus the `D` counts should be the number of words in the `words_unix` file, and the `L` plus the `D` counts should be the number of words in the `words_linux` file. The sum of all three counts should be the number of words in the `words_merge` file.

If you are using sentinels, then on either a Sun machine or a Linux machine, you should merge the first two files below to produce the third in a directory of your own somewhere:
If you are not using a sentinel, the work with the following files:
You are not to make copies of these files, but your program should access these files directly using the full path names above.
For reading each word, I used a single function with the file name as one parameter, and with the already allocated string buffer as the other parameter. This function reads chars up to a newline, and returns either the number of chars read or EOF. You don't have to do it this way though.
For this initial debug run each output word should have an extra " U" if it appears only in the first file, an extra " L" if it appears only in the second file, and an extra " D" if it appears in both files. (This is illustrated in the table below.)
You should also count the number of words in each category. Print these counts to stdout when you are done. Do the arithmetic to check that the counts are consistent, that is, the U plus the D counts should be the number of words in the words_unix file, and the L plus the D counts should be the number of words in the words_linux file. The sum of all three counts should be the number of words in the words_merge file.

The table below shows the sample data with sentinels: the two files to merge and the resulting merged file. I inserted extra blank lines to make everything line up, but this might not work on your browser. This table shows the files with sentinels. Without sentinels just leave off the zzzzzzzzzz at the end of each file.

Two files to merge and merged file (with extra blank lines)
File 1: `words_unix`	File 2: `words_linux`	Merged file: `words_merge`
10th 1st 2nd a AAAS Aarhus Aaron Ababa aback abacus abalone abandon abase abash abate abater abbas abbey abbot Abbott abbreviate zzzzzzzzzz	Aaron Ababa aback abaft abandon abandoned abandoning abase abased abasement abasements abases abash abashed abate abated abatement abatements abater abating Abba zzzzzzzzzz	10th U 1st U 2nd U a U AAAS U Aarhus U Aaron D Ababa D aback D abacus U abaft L abalone U abandon D abandoned L abandoning L abase D abased L abasement L abasements L abases L abash D abashed L abate D abated L abatement L abatements L abater D abating L Abba L abbas U abbey U abbot U Abbott U abbreviate U zzzzzzzzzz D

Two files to merge and merged file (with extra blank lines)

File 1: words_unix File 2: words_linux Merged file: words_merge

10th
1st
2nd
a
AAAS
Aarhus
Aaron
Ababa
aback
abacus

abalone
abandon


abase




abash

abate



abater


abbas
abbey
abbot
Abbott
abbreviate
zzzzzzzzzz







Aaron
Ababa
aback

abaft

abandon
abandoned
abandoning
abase
abased
abasement
abasements
abases
abash
abashed
abate
abated
abatement
abatements
abater
abating
Abba





zzzzzzzzzz

10th U
1st U
2nd U
a U
AAAS U
Aarhus U
Aaron D
Ababa D
aback D
abacus U
abaft L
abalone U
abandon D
abandoned L
abandoning L
abase D
abased L
abasement L
abasements L
abases L
abash D
abashed L
abate D
abated L
abatement L
abatements L
abater D
abating L
Abba L
abbas U
abbey U
abbot U
Abbott U
abbreviate U
zzzzzzzzzz D

B. Full merging run: With a little luck, this will be easy, and the program in Part A will just run here.

Requirements for Part B

Requirements for Part B
If you are using sentinels, then on either a Sun machine or a Linux machine, you should merge the first two files below to produce the third in a directory of your own somewhere: `/home/wagner/public_html/CS2213/words/words_unix` `/home/wagner/public_html/CS2213/words/words_linux` `words_merge` If you are not using a sentinel, the work with the following files: `/home/wagner/public_html/CS2213/words/words_unix_ns` `/home/wagner/public_html/CS2213/words/words_linux_ns` `words_merge_ns` You are not to make copies of these files, but your program should access these files directly using the full path names above. This will keep from having each student use up storage for these files. The merged file should also have the "U", "L", or "D" tags at the end of each line. (Of course for the real merged file, we would leave these off.) Get the same counts in this case as in Part A. Run the `wc` utility on each file, and do the arithmetic to check if the counts are consistent with the file sizes.

If you are using sentinels, then on either a Sun machine or a Linux machine, you should merge the first two files below to produce the third in a directory of your own somewhere:
If you are not using a sentinel, the work with the following files:
You are not to make copies of these files, but your program should access these files directly using the full path names above. This will keep from having each student use up storage for these files.
The merged file should also have the "U", "L", or "D" tags at the end of each line. (Of course for the real merged file, we would leave these off.)
Get the same counts in this case as in Part A. Run the wc utility on each file, and do the arithmetic to check if the counts are consistent with the file sizes.

Comments about debugging: One possibility with this assignment, even Part A, and Part B more so, is to start an infinite write loop, and write a file until something gives out. You should realize that Part B itself should complete on whatever machine you use in less than a second of wall-clock time. If your program excutes for several minutes, you may burn up all your available storage. Exactly this happened to me with my own solution to the problem. By mistake I put xxxxxxxxxx at the end of one of the files, rather than what should have been there. I was doing something else at the same time, and before I knew it, I had written a file of size 275 Mbyte. It was no problem on my system at home, but it might screw things up at UTSA (or perhaps some storage maximum will be tripped; I don't know). Anyway, be careful.

What you should submit: Refer to the submissions directions and to deadlines at the top of this page. The text file that you submit should first have Your Name, the Course Number, and the Recitation Number. The rest of the file should have the following in it, in the order below, and clearly labeled, including at the beginning the appropriate item letters: a, b, c, etc.

Contents of email submission for Recitation 11:
Last Name, First Name; Course Number; Recitation Number.
a. Indicate clearly whether you are using a sentinel or not. Give the source for your merge program for Part A, including in particular fopen function calls that give access to the correct files.
b. Run the merge program in Part A, showing the counts produced. Print the output file words_merge.
c. Give the source for your merge program for Part B. The only difference should be the full paths to the input files.
d. Run the merge program in Part B, showing the counts produced. Show wc invoked on each of the three files and compare results with the counts. Print the first 20 and the last 20 lines of the merged file.

For those crazy about word lists: Here are two much longer lists of words and compounds from Apple's OS X. They evidently are a part of BSD Unix.

/home/neal/public_html/CS2213/wordst/words_bsd: a much longer list of 234937 words in 2.5 Meg.
/home/neal/public_html/CS2213/wordst/words_bsd_comp: a long list of 76205 compound words in 1 Meg.

Revision date: 2005-04-02. (Please use ISO 8601, the International Standard.)