Python: Concordance Programs

Python

6.3 Concordance
Programs

6.3.1. Performance of Different Concordance Programs: Section 6 studies binary trees in Python in two implementations: As a Python list (6.1 Trees as Lists:), and the ordinary pointer-based implementation (6.2 Trees Using Pointers:). I did two fairly large trials to quantify the performance. I wanted a real-world application, and used my favorite: a concordance based on a binary tree. Later I wondered if Python supported hash-based lookup, and then thought about dictionaries. Python's dictionary is supposed to be the most efficient of its features.

I right away came to realize that implementing a dictionary-based concordance is almost trivial. In the program below, the whole concordance construction is just the four lines 21 - 24.

The times given below are "wall-clock" times. In my next life I'll redo this with CPU times, but for now these shouldn't be so bad. My computer's CPU is dual processor, and a demanding process seems to get one entire processor. (Without programming threads, I can't get more than one processor.) I found the results below interesting.

Results of Tests

Source File
(With line
numbers) Source
Size
(bytes) Words
(Distinct
Words) Time in Seconds (wall-clock) Resulting
Concor-
dance

Tree
as List
(6.1) Tree with
Pointers
(6.2) Using
Dict-
ionary Using
C

Jerusalem
(Poem by Blake) 0.5 K 97
(68) ~0 ~0 ~0 ~0 Jerusalem

Declaration of
Independence 8 K 1 339
(538) ~0 ~0 ~0 ~0 DoI

Emma (Novel
by Jane Austen) 908 K 160 705
(7 510) 11.6 4 < 2 < 0.5 Emma

Little Women
(Novel by Alcott) 1011 K 186 225
(11 208) - - < 2 - Little
Women

Bleak House
(Novel by Dickens) 1945 K 357 704
(16 282) - - < 4 - Bleak
House

Bible (KJV)
Index of Books 4400 K 823 019
(12 970) 330 18 7.5 < 1.5 KJV

My son thought that the use of a List to represent a binary tree would be an insanely inefficient method that would die using larger examples. In fact, it worked satisfactorily on the large example novel Emma above. Given that, the 330 seconds for the Bible using a List is a big surprise. More data would be needed to draw conclusions. I was also surprised that a tree constructed from pointers was much more efficient than one made from a list. The List type in Python is notoriously slow. It also uses a complex algorithm, almost a kludge, so it would be hard to estimate or predict its performance. Nevertheless,

I expected the dictionary approach to be more efficient, and it was much more efficient. This may be part of the reason Python programmers like to use dictionaries.

Of course C was very fast, with a completely different program also using a binary tree,

6.3.2. Concordance Using a Dictionary:

n Concordance Using a Dictionary

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

# con.py: Concordance using a dictionary import sys import string def concord(filepre): # construct concordance n = 0 c = {} fsrc = open(filepre + ".src.txt",'r') # open source file flist = open(filepre + ".list.txt",'w') # open listing file for line in fsrc: # process source file if len(line) == 1: # don't number empty lines flist.write((' '*5) + line) else: n += 1 flist.write(("%4i" % n) + ' ' + line) wp = line.split() for w in wp: w = w.strip(string.punctuation).lower() if len(w) > 0: # insert w and n into c, the concordance if not(w in c): # new entry c[w] = [n] elif c[w][-1] != n: # first-time line no c[w].append(n) return c def write_concord(filepre): # output concordance in good form c = concord(filepre) fcon = open(filepre + ".con.txt",'w') # open concordance file ckeys = c.keys() # get list of keys ckeys.sort() # sort list of keys in place for j in range(0,len(ckeys)): # print in sorted order y = 15 fcon.write(("%4i" % j) + ' ') fcon.write(str(ckeys[j]) + ' '*(15-len(ckeys[j]))) x = 0 for i in c[ckeys[j]]: # print list of line nos x += len(str(i)) + 1 if x < 100: # max line length fcon.write(str(i) + " ") else: x = len(str(i)) fcon.write('\n' + ' '*(y+5) + str(i) + " ") fcon.write('\n') write_concord(sys.argv[1]) # uses command line parameter

Items of Interest or for study:

As mentioned above, creating the dictionary to use as a concordance was almost trivial: lines 21-22 above insert a new word with an initial line number; lines 23-24 add a line number to an existing entry. Practially all there is to it.
It is necessary to sort the keys into order. Then given each key, the list of line numbers can be retrieved from the dictionary. (Several other annoying details.)

(Revision date: 2015-04-14. Please use ISO 8601, the International Standard.)