|
 |
Python |
6.3 Concordance
Programs |
|
6.3.1. Performance of Different Concordance
Programs: Section 6 studies binary trees
in Python in two implementations: As a Python list
(6.1 Trees as Lists:), and the
ordinary pointer-based implementation (6.2 Trees Using Pointers:).
I did two fairly large
trials to quantify the performance. I wanted a real-world application,
and used my favorite: a concordance based on a binary tree.
Later I wondered if Python supported hash-based lookup, and then
thought about dictionaries. Python's dictionary is supposed to be
the most efficient of its features.
I right away came to realize that
implementing a dictionary-based concordance is almost trivial.
In the program below, the whole concordance construction is just the four
lines 21 - 24.
The times given below are "wall-clock" times. In my next life I'll
redo this with CPU times, but for now these shouldn't be so bad.
My computer's CPU is dual processor, and a demanding process seems to
get one entire processor. (Without programming threads, I can't
get more than one processor.) I found the results below interesting.
My son thought that the use of a List to represent a binary tree
would be an insanely inefficient method that would die using larger examples.
In fact, it worked satisfactorily on the large example novel
Emma above. Given that, the 330 seconds for the Bible using a List is a
big surprise. More data would be needed to draw conclusions.
I was also surprised that a tree constructed from pointers was much more
efficient than one made from a list. The List type in Python is
notoriously slow. It also uses a complex algorithm, almost a kludge,
so it would be hard to estimate or predict its performance.
Nevertheless,
I expected the dictionary approach to be more efficient, and it was
much more efficient. This may be part of the reason Python programmers
like to use dictionaries.
Of course C was very fast, with a completely different program also
using a binary tree,
6.3.2. Concordance Using a
Dictionary:
n |
Concordance Using a Dictionary |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46 |
# con.py: Concordance using a dictionary
import sys
import string
def concord(filepre): # construct concordance
n = 0
c = {}
fsrc = open(filepre + ".src.txt",'r') # open source file
flist = open(filepre + ".list.txt",'w') # open listing file
for line in fsrc: # process source file
if len(line) == 1: # don't number empty lines
flist.write((' '*5) + line)
else:
n += 1
flist.write(("%4i" % n) + ' ' + line)
wp = line.split()
for w in wp:
w = w.strip(string.punctuation).lower()
if len(w) > 0:
# insert w and n into c, the concordance
if not(w in c): # new entry
c[w] = [n]
elif c[w][-1] != n: # first-time line no
c[w].append(n)
return c
def write_concord(filepre): # output concordance in good form
c = concord(filepre)
fcon = open(filepre + ".con.txt",'w') # open concordance file
ckeys = c.keys() # get list of keys
ckeys.sort() # sort list of keys in place
for j in range(0,len(ckeys)): # print in sorted order
y = 15
fcon.write(("%4i" % j) + ' ')
fcon.write(str(ckeys[j]) + ' '*(15-len(ckeys[j])))
x = 0
for i in c[ckeys[j]]: # print list of line nos
x += len(str(i)) + 1
if x < 100: # max line length
fcon.write(str(i) + " ")
else:
x = len(str(i))
fcon.write('\n' + ' '*(y+5) + str(i) + " ")
fcon.write('\n')
write_concord(sys.argv[1]) # uses command line parameter
|
Items of Interest or for study:
- As mentioned above, creating the dictionary to use as a concordance
was almost trivial: lines 21-22 above insert a new word with an
initial line number; lines 23-24 add a line number to an existing
entry. Practially all there is to it.
- It is necessary to sort the keys into order. Then given each
key, the list of line numbers can be retrieved from the dictionary.
(Several other annoying details.)
(Revision date: 2015-04-14.
Please use ISO 8601,
the International Standard.)
|