Python
  6.3 Concordance   
Programs


6.3.1. Performance of Different Concordance Programs: Section 6 studies binary trees in Python in two implementations: As a Python list (6.1 Trees as Lists:), and the ordinary pointer-based implementation (6.2 Trees Using Pointers:). I did two fairly large trials to quantify the performance. I wanted a real-world application, and used my favorite: a concordance based on a binary tree. Later I wondered if Python supported hash-based lookup, and then thought about dictionaries. Python's dictionary is supposed to be the most efficient of its features.

I right away came to realize that implementing a dictionary-based concordance is almost trivial. In the program below, the whole concordance construction is just the four lines 21 - 24.

The times given below are "wall-clock" times. In my next life I'll redo this with CPU times, but for now these shouldn't be so bad. My computer's CPU is dual processor, and a demanding process seems to get one entire processor. (Without programming threads, I can't get more than one processor.) I found the results below interesting.

Results of Tests
Source File
(With line
numbers)
Source
Size
(bytes)
Words
(Distinct
Words)
Time in Seconds (wall-clock) Resulting
Concor-
dance
Tree
as List
(6.1)
Tree with
Pointers
(6.2)
Using
Dict-
ionary
Using
C
Jerusalem
(Poem by Blake)
0.5 K 97 
(68)
~0~0~0~0 Jerusalem
Declaration of
Independence
8 K 1 339 
(538)
~0~0~0~0 DoI
Emma (Novel
by Jane Austen)
908 K 160 705 
(7 510)
11.64< 2< 0.5 Emma
Little Women
(Novel by Alcott)
1011 K 186 225 
(11 208)
--< 2 - Little
Women
Bleak House
(Novel by Dickens)
1945 K 357 704 
(16 282)
--< 4 - Bleak
House
Bible (KJV)
Index of Books
4400 K 823 019 
(12 970)
330187.5< 1.5 KJV

My son thought that the use of a List to represent a binary tree would be an insanely inefficient method that would die using larger examples. In fact, it worked satisfactorily on the large example novel Emma above. Given that, the 330 seconds for the Bible using a List is a big surprise. More data would be needed to draw conclusions. I was also surprised that a tree constructed from pointers was much more efficient than one made from a list. The List type in Python is notoriously slow. It also uses a complex algorithm, almost a kludge, so it would be hard to estimate or predict its performance. Nevertheless,

I expected the dictionary approach to be more efficient, and it was much more efficient. This may be part of the reason Python programmers like to use dictionaries.

Of course C was very fast, with a completely different program also using a binary tree,


6.3.2. Concordance Using a Dictionary:

n Concordance Using a Dictionary
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# con.py: Concordance using a dictionary
import sys
import string

def concord(filepre): # construct concordance
    n = 0
    c = {}
    fsrc  = open(filepre + ".src.txt",'r') # open source file
    flist = open(filepre + ".list.txt",'w') # open listing file
    for line in fsrc: # process source file
        if len(line) == 1: # don't number empty lines
            flist.write((' '*5) + line)
        else:
            n += 1
            flist.write(("%4i" % n) + ' ' + line)
            wp = line.split()
            for w in wp:
                w = w.strip(string.punctuation).lower()
                if len(w) > 0:
                    # insert w and n into c, the concordance
                    if not(w in c): # new entry
                        c[w] = [n]
                    elif c[w][-1] != n: # first-time line no
                        c[w].append(n)
    return c

def write_concord(filepre): # output concordance in good form
    c = concord(filepre)
    fcon = open(filepre + ".con.txt",'w') # open concordance file
    ckeys = c.keys() # get list of keys
    ckeys.sort() # sort list of keys in place
    for j in range(0,len(ckeys)): # print in sorted order
        y = 15
        fcon.write(("%4i" % j) + ' ')
        fcon.write(str(ckeys[j]) + ' '*(15-len(ckeys[j])))
        x = 0
        for i in c[ckeys[j]]: # print list of line nos
            x += len(str(i)) + 1
            if x < 100: # max line length
                fcon.write(str(i) + " ")
            else:
                x = len(str(i))
                fcon.write('\n' + ' '*(y+5) + str(i) + " ")
        fcon.write('\n')

write_concord(sys.argv[1]) # uses command line parameter       

Items of Interest or for study:

  • As mentioned above, creating the dictionary to use as a concordance was almost trivial: lines 21-22 above insert a new word with an initial line number; lines 23-24 add a line number to an existing entry. Practially all there is to it.

  • It is necessary to sort the keys into order. Then given each key, the list of line numbers can be retrieved from the dictionary. (Several other annoying details.)

(Revision date: 2015-04-14. Please use ISO 8601, the International Standard.)