CS 3723 Python: Trees as Lists

CS 3723
Programming Languages

6.1 Trees as Lists

6.1. Using Lists to Represent Trees: For this page's inspiration, see especially: Trees From Lists, which is a part of: Python Data Structures. They suggested that a rooted tree with data at each node could be represented recursively by a list, with 0th element the data, 1st element the leftmost subtree, 2nd element the next subtree, and so forth. The absence of a subtree would be an empty list [ ]. In the case of a binary tree, the recursive definition of each node would be:

[data, left subtree as list, right subtree as list]

For example, using ["Mon","Tue","Wed","Thu","Fri"] as data to build a binary search tree yields the list:

['Mon', ['Fri', [ ], [ ] ], ['Tue', ['Thu', [ ], [ ] ], ['Wed', [ ], [ ] ] ] ]

If a is a reference to a list = subtree, then a[0],a[1],a[2] give respectively, the data, the left subtree, and the right subtree. Either subtree can be empty.

6.2. Binary Search Tree with Inorder Traversal: Here is a program implementing two algorithms: creation of a binary search tree and an inorder traversal:

n Representing a Tree Using a List, Binary Search Tree and Inorder Traversal

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

# tree.py: tree using lists import sys def treeInsert(s,a): if s == []: # first time through s.append(a) s.append([]) s.append([]) else: while True: if a == s[0]: # found return 1 elif a < s[0]: # left if s[1] == []: # not found, insert s[1] = [a,[],[]] return 0 else: s = s[1] # go left elif a > s[0]: # right if s[2] == []: # not found, insert s[2] = [a,[],[]] return 0 else: s = s[2] # go right
def binarySearchTree(m): r = [] # build tree from scratch for i in range(0,len(m)): stat = treeInsert(r,m[i]) return r def inorderTraversal(s): if (s[1] != []): # left subtree inorderTraversal(s[1]) sys.stdout.write(str(s[0]) + " ") if (s[2] != []): # right subtree inorderTraversal(s[2]) def process(m): sys.stdout.write("INPUT DATA: " + str(m) + '\n') r = binarySearchTree(m) sys.stdout.write("TREE AS STRING: " + str(r) + '\n') sys.stdout.write("INORDER TRAVERSAL: ") inorderTraversal(r) sys.stdout.write("\n\n") m = ["Mon","Tue","Wed","Thu","Fri","Sat","Sun", "Jan","Feb","Mar","Apr","May","Jun", "Jul","Aug","Sep","Oct","Nov","Dec",] process(m) n = [10.5, 3, 11.25, 17, 6.75, 7, 19, 23.5, 14, 19.75, 12, 4, 1, 5.5] process(n)

Output (4 newlines inserted into the lists)

% python tree.py INPUT DATA: ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'] TREE AS STRING: ['Mon', ['Fri', ['Feb', ['Apr', [], ['Aug', [], ['Dec', [], []]]], []], ['Jan', [], ['Mar', ['Jun', ['Jul', [], []], []], ['May', [], []]]]], ['Tue', ['Thu', ['Sat', ['Oct', ['Nov', [], []], []], ['Sun', ['Sep', [], []], []]], []], ['Wed', [], []]]] INORDER TRAVERSAL: Apr Aug Dec Feb Fri Jan Jul Jun Mar May Mon Nov Oct Sat Sep Sun Thu Tue Wed INPUT DATA: [10.5, 3, 11.25, 17, 6.75, 7, 19, 23.5, 14, 19.75, 12, 4, 1, 5.5] TREE AS STRING: [10.5, [3, [1, [], []], [6.75, [4, [], [5.5, [], []]], [7, [], []]]], [11.25, [], [17, [14, [12, [], []], []], [19, [], [23.5, [19.75, [], []], []]]]]] INORDER TRAVERSAL: 1 3 4 5.5 6.75 7 10.5 11.25 12 14 17 19 19.75 23.5

Output (4 newlines inserted into the lists)
% python tree.py INPUT DATA: ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'] TREE AS STRING: ['Mon', ['Fri', ['Feb', ['Apr', [], ['Aug', [], ['Dec', [], []]]], []], ['Jan', [], ['Mar', ['Jun', ['Jul', [], []], []], ['May', [], []]]]], ['Tue', ['Thu', ['Sat', ['Oct', ['Nov', [], []], []], ['Sun', ['Sep', [], []], []]], []], ['Wed', [], []]]] INORDER TRAVERSAL: Apr Aug Dec Feb Fri Jan Jul Jun Mar May Mon Nov Oct Sat Sep Sun Thu Tue Wed INPUT DATA: [10.5, 3, 11.25, 17, 6.75, 7, 19, 23.5, 14, 19.75, 12, 4, 1, 5.5] TREE AS STRING: [10.5, [3, [1, [], []], [6.75, [4, [], [5.5, [], []]], [7, [], []]]], [11.25, [], [17, [14, [12, [], []], []], [19, [], [23.5, [19.75, [], []], []]]]]] INORDER TRAVERSAL: 1 3 4 5.5 6.75 7 10.5 11.25 12 14 17 19 19.75 23.5

Items of Interest or for study:

Notice that the same program works with an array of strings or with a mixed array of floats and ints. (I had to add one str() call, needed for the floats and ints, but doing nothing to the strings.)

The reference above created the tree using code like the following, but I needed to write the treeInsert function so that it also handled the initial case. (Later, when I scan for words, it can be awkward to produce the first element by itself.)

def binaryTree(a): # insert root return [a, [], []] def treeInsert(s,a): while True: if a == s[0]: # found ... (assumes root present) r = binaryTree(m[0]) for i in range(1,len(m)): treeInsert(r,m[i])

I had trouble writing the part of treeInsert that handles the root first. It was essential to pass the (empty) list as a parameter and modify that list. In this way, changes to the list appear outside the function. Alternatively, one could return the list. (Probably better to do it that way.)

6.3. Concordance Program: A concordance is a listing of all words from a given text, along with the lines of the text where each word occurs. Below I use for the text well-known poem by William Blake. In order to get the listing of line numbers, I only needed to patch in another list (of line numbers) at each node.
Here is a fairly simple program showing the main ideas.

n Concordance Using Lists, Source Program

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

# poem.py: create concordance import sys import string num = 0 # global line count word_count = 0 # global def treeInsert(s, w, n): if s == []: # first time through s.append([]) s.append([]) s.append([]) s[0] = (w, [n]) else: while True: if w == s[0][0]: # found if not n in s[0][1]: s[0][1].insert(len(s[0][1]), n) return elif w < s[0][0]: # left if s[1] == []: # not found, insert s[1] = [(w, [n]), [], []] else: s = s[1] # go left elif w > s[0][0]: # right if s[2] == []: # not found, insert s[2] = [(w, [n]), [], []] else: s = s[2] # go right
def inorderTraversal(s): global word_count if (s[1] != []): inorderTraversal(s[1]) word_count += 1 sys.stdout.write(("%4i" % word_count) + " " + s[0][0]) for i in range(0, 17 - len(s[0][0])): sys.stdout.write(' ') for i in s[0][1]: sys.stdout.write(str(i) + " ") sys.stdout.write('\n') if (s[2] != []): inorderTraversal(s[2]) def binaryTree(filename): global num f = open(filename,'r') # open source file r = [] for line in f: num += 1 wp = line.split() for w in wp: w = w.strip(string.punctuation).lower() if len(w) > 0: treeInsert(r, w, num) return r r = binaryTree("poem.txt") inorderTraversal(r)
Here's a run with a small input text:

Concordance for a Short Poem

Source Text Concordance

01 And did those feet in ancient time 02 Walk upon England's mountains green? 03 And was the holy Lamb of God 04 On England's pleasant pastures seen? 05 06 And did the Countenance Divine 07 Shine forth upon our clouded hills? 08 And was Jerusalem builded here 09 Among these dark Satanic Mills? 10 11 Bring me my bow of burning gold! 12 Bring me my arrows of desire! 13 Bring me my spear! O clouds, unfold! 14 Bring me my chariot of fire! 15 16 I will not cease from mental fight, 17 Nor shall my sword sleep in my hand, 18 Till we have built Jerusalem 19 In England's green and pleasant land.
1 among 9 2 ancient 1 3 and 1 3 6 8 19 4 arrows 12 5 bow 11 6 bring 11 12 13 14 7 builded 8 8 built 18 9 burning 11 10 cease 16 11 chariot 14 12 clouded 7 13 clouds 13 14 countenance 6 15 dark 9 16 desire 12 17 did 1 6 18 divine 6 19 england's 2 4 19 20 feet 1 21 fight 16 22 fire 14 23 forth 7 24 from 16 25 god 3 26 gold 11 27 green 2 19 28 hand 17 29 have 18 30 here 8 31 hills 7 32 holy 3 33 i 16 34 in 1 17 19
35 jerusalem 8 18 36 lamb 3 37 land 19 38 me 11 12 13 14 39 mental 16 40 mills 9 41 mountains 2 42 my 11 12 13 14 17 43 nor 17 44 not 16 45 o 13 46 of 3 11 12 14 47 on 4 48 our 7 49 pastures 4 50 pleasant 4 19 51 satanic 9 52 seen 4 53 shall 17 54 shine 7 55 sleep 17 56 spear 13 57 sword 17 58 the 3 6 59 these 9 60 those 1 61 till 18 62 time 1 63 unfold 13 64 upon 2 7 65 walk 2 66 was 3 8 67 we 18 68 will 16
Here is the list chopped into 80-character lines:

Main String for Short Poem, Length=1630

[('and', [1, 3, 5, 7, 16]), [('ancient', [1]), [('among', [8]), [], []], []], [( 'did', [1, 5]), [('countenance', [5]), [('clouded', [6]), [('builded', [7]), [(' bring', [9, 10, 11, 12]), [('bow', [9]), [('arrows', [10]), [], []], []], []], [ ('burning', [9]), [('built', [15]), [], []], [('chariot', [12]), [('cease', [13] ), [], []], []]]], [('clouds', [11]), [], []]], [('dark', [8]), [], [('desire', [10]), [], []]]], [('those', [1]), [('feet', [1]), [('england', [2, 4, 16]), [(' divine', [5]), [], []], []], [('in', [1, 14, 16]), [('green', [2, 16]), [('god', [3]), [('forth', [6]), [('fire', [12]), [('fight', [13]), [], []], []], [('from ', [13]), [], []]], [('gold', [9]), [], []]], [('holy', [3]), [('hills', [6]), [ ('here', [7]), [('hand', [14]), [], [('have', [15]), [], []]], []], []], [('i', [13]), [], []]]], [('s', [2, 4, 16]), [('mountains', [2]), [('lamb', [3]), [('je rusalem', [7, 15]), [], []], [('mills', [8]), [('me', [9, 10, 11, 12]), [('land' , [16]), [], []], [('mental', [13]), [], []]], []]], [('of', [3, 9, 10, 12]), [( 'my', [9, 10, 11, 12, 14]), [], [('o', [11]), [('not', [13]), [('nor', [14]), [] , []], []], []]], [('on', [4]), [], [('pleasant', [4, 16]), [('pastures', [4]), [('our', [6]), [], []], []], []]]]], [('the', [3, 5]), [('seen', [4]), [('satani c', [8]), [], []], [('shine', [6]), [('shall', [14]), [], []], [('spear', [11]), [('sleep', [14]), [], []], [('sword', [14]), [], []]]]], [('these', [8]), [], [ ]]]]]], [('time', [1]), [('till', [15]), [], []], [('walk', [2]), [('upon', [2, 6]), [('unfold', [11]), [], []], []], [('was', [3, 7]), [], [('will', [13]), [(' we', [15]), [], []], []]]]]]]]
Items of Interest or for study:

I especially had trouble using split() to break the source into words. At first I was using regular expressions along with split(), and there were too many options, with none of them seeming exactly what I wanted. It was very confusing. split() is also a string function. The approach I use above was recommended online and is very simple: The split() with no arguments breaks up only on whitespace, and the strip(string.punctuation) gets rid of punctuation at either end, but leaves embedded chars like a hyphen or an apostrophe. The string.punctuation is !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
At the high end, there is a fancy downloadable library NLTK that processes words in many sophisticated ways. (I haven't looked at it.)

Here are results of a run with a longer text, the Declaration of Independence: Source text (1335 words) and the Condordance text (539 distinct words).

Here is a concordance program written in C: tree.c. It has extra features not present in the Python program above, but still: 147 lines instead of 60, and parts of the C program are pretty complicated.

Patching the list of line numbers (the main change) was easy to do.

6.4. Concordance Program as a Class: Here I rewrite the concordance program as a class, adding various features.

n Concordance Program as a Class

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44

concordance.py: create concordance import sys import string class Concordance(object): def __init__(self, fp, ind=20, numw=5, linel=50): self.file_prefix = fp self.indent = ind # indent of line numbers self.num_width = numw # space for line #s self.line_length = linel # width of line # self.fsrc = open(fp + ".src.txt", 'r') # src self.flist = open(fp + ".list.txt", 'w') # l#s self.fcon = open(fp + ".con.txt", 'w') # con self.r = [] # list representing concordance self.num = 0 # line # of source self.word_count = 0 # count of distinct words def treeInsert(self, w, n): # self.r == [] first time through if self.r == []: self.r.append([]) self.r.append([]) self.r.append([]) self.r[0] = (w, [n]) else: s = self.r while True: if w == s[0][0]: # found if not n in s[0][1]: s[0][1].insert(len(s[0][1]),n) return elif w < s[0][0]: # left if s[1] == []: # not found, insert s[1] = [(w, [n]), [], []] else: s = s[1] # go left elif w > s[0][0]: # right if s[2] == []: # not found, insert s[2] = [(w, [n]), [], []] else: s = s[2] # go right
def inorderTraversal(self, s): if (s[1] != []): self.inorderTraversal(s[1]) self.word_count += 1 strfor = "%"+str(self.num_width)+"i" self.fcon.write((strfor % self.word_count) + " " + s[0][0]) self.fcon.write(' '*(self.indent - len(s[0][0])-1)) self.fcon.write(' ') x = 0 for i in s[0][1]: x += len(str(i)) + 1 if x < self.line_length: self.fcon.write(str(i) + " ") else: x = len(str(i)) self.fcon.write('\n' + ' '*(self.indent+self.num_width+1)+ str(i) + " ") self.fcon.write('\n') if (s[2] != []): self.inorderTraversal(s[2]) def binaryTree(self): for line in self.fsrc: if len(line) == 2: self.flist.write('\n') else: self.num += 1 # Produce a listing, numbered lines strfor = "%"+str(self.num_width)+"i" self.flist.write((strfor % self.num)+ " " + line) wp = line.split() for w in wp: w = w.strip(string.punctuation).\ lower() if len(w) > 0: self.treeInsert(w, self.num) c = Concordance("emma3", 16, 5, 100) c.binaryTree() c.inorderTraversal(c.r)
Items of Interest or for study:

This wasn't very hard to convert to a class, but I made a large number of small mistakes. I was using "%4i" in formatting, but wanted the size to be a variable (self.num_width). In the end I had to create the string "%"+str(self.num_width)+"i" first before using it.

I ran this with a good-sized novel (160,000 words, 7155 distinct words): "Emma" by Jane Austen. The results are: listing and concordance. I edited the original source to replace all references such as "Mr John Knightly" with "Mr+John+Knightly", which is then lowercased to "mr+john+knightly". In this way the references to people were greatly improved. (The split() only separates on whitespace as mentioned above.) Afterwards, I replaced each plus with a blank, so any references with a blank in them initially had a plus instead of the blank.

The source text must have used a carrige-return/line-feed at the end of each line, so I had to change my check for a blank line.

I added 4 arguments to the class, the last 3 with default values. This also reads source from a file, and writes a listing (with line numbers) and the concordance itself to separate files.

(Revision date: 2015-02-23. Please use ISO 8601, the International Standard.)