CS 3723
  Programming Languages  
  6.1 Trees as Lists   


6.1. Using Lists to Represent Trees: For this page's inspiration, see especially: Trees From Lists, which is a part of: Python Data Structures. They suggested that a rooted tree with data at each node could be represented recursively by a list, with 0th element the data, 1st element the leftmost subtree, 2nd element the next subtree, and so forth. The absence of a subtree would be an empty list [ ]. In the case of a binary tree, the recursive definition of each node would be:

    [data, left subtree as list, right subtree as list]

For example, using ["Mon","Tue","Wed","Thu","Fri"] as data to build a binary search tree yields the list:

    ['Mon', ['Fri', [ ], [ ] ], ['Tue', ['Thu', [ ], [ ] ], ['Wed', [ ], [ ] ] ] ]

If a is a reference to a list = subtree, then a[0],a[1],a[2] give respectively, the data, the left subtree, and the right subtree. Either subtree can be empty.


6.2. Binary Search Tree with Inorder Traversal: Here is a program implementing two algorithms: creation of a binary search tree and an inorder traversal:

nRepresenting a Tree Using a List, Binary Search Tree and Inorder Traversal
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# tree.py: tree using lists
import sys

def treeInsert(s,a):
    if s == []: # first time through
        s.append(a)
        s.append([])
        s.append([])
    else:
        while True:
            if a == s[0]: # found
                return 1
            elif a < s[0]: # left
                if s[1] == []:
                    # not found, insert
                    s[1] = [a,[],[]]
                    return 0
                else:
                    s = s[1] # go left
            elif a > s[0]: # right
                if s[2] == []:
                    # not found, insert
                    s[2] = [a,[],[]]
                    return 0
                else:
                    s = s[2] # go right
def binarySearchTree(m):
    r = [] # build tree from scratch
    for i in range(0,len(m)):
        stat = treeInsert(r,m[i])
    return r

def inorderTraversal(s):
    if (s[1] != []): # left subtree
        inorderTraversal(s[1])
    sys.stdout.write(str(s[0]) + " ")
    if (s[2] != []): # right subtree
        inorderTraversal(s[2])

def process(m):
    sys.stdout.write("INPUT DATA:        " + str(m) + '\n')
    r = binarySearchTree(m)
    sys.stdout.write("TREE AS STRING:    " + str(r) + '\n')
    sys.stdout.write("INORDER TRAVERSAL: ")
    inorderTraversal(r)
    sys.stdout.write("\n\n")

m = ["Mon","Tue","Wed","Thu","Fri","Sat","Sun",
     "Jan","Feb","Mar","Apr","May","Jun",
     "Jul","Aug","Sep","Oct","Nov","Dec",]
process(m)

n = [10.5,  3, 11.25, 17, 6.75, 7,  19,
     23.5, 14, 19.75, 12, 4,    1, 5.5]
process(n)

Output (4 newlines inserted into the lists)
% python tree.py
INPUT DATA:        ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Jan', 'Feb', 'Mar', 
                    'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
TREE AS STRING:    ['Mon', ['Fri', ['Feb', ['Apr', [], ['Aug', [], ['Dec', [], []]]], []], 
                           ['Jan', [], ['Mar', ['Jun', ['Jul', [], []], []], ['May', [], []]]]],
                   ['Tue', ['Thu', ['Sat', ['Oct', ['Nov', [], []], []], ['Sun', ['Sep', [], []], []]], []],
                           ['Wed', [], []]]]
INORDER TRAVERSAL: Apr Aug Dec Feb Fri Jan Jul Jun Mar May Mon Nov Oct Sat Sep Sun Thu Tue Wed 

INPUT DATA:        [10.5, 3, 11.25, 17, 6.75, 7, 19, 23.5, 14, 19.75, 12, 4, 1, 5.5]
TREE AS STRING:    [10.5, [3, [1, [], []], [6.75, [4, [], [5.5, [], []]], [7, [], []]]],
                          [11.25, [], [17, [14, [12, [], []], []], [19, [], [23.5, [19.75, [], []], []]]]]]
INORDER TRAVERSAL: 1 3 4 5.5 6.75 7 10.5 11.25 12 14 17 19 19.75 23.5 

  Items of Interest or for study:
  • Notice that the same program works with an array of strings or with a mixed array of floats and ints. (I had to add one str() call, needed for the floats and ints, but doing nothing to the strings.)

  • The reference above created the tree using code like the following, but I needed to write the treeInsert function so that it also handled the initial case. (Later, when I scan for words, it can be awkward to produce the first element by itself.)

      def binaryTree(a): # insert root
          return [a, [], []]
      def treeInsert(s,a):
          while True:
              if a == s[0]: # found
              ... (assumes root present)
      r = binaryTree(m[0])
      for i in range(1,len(m)):
          treeInsert(r,m[i])
      

  • I had trouble writing the part of treeInsert that handles the root first. It was essential to pass the (empty) list as a parameter and modify that list. In this way, changes to the list appear outside the function. Alternatively, one could return the list. (Probably better to do it that way.)


6.3. Concordance Program: A concordance is a listing of all words from a given text, along with the lines of the text where each word occurs. Below I use for the text well-known poem by William Blake. In order to get the listing of line numbers, I only needed to patch in another list (of line numbers) at each node.

Here is a fairly simple program showing the main ideas.

nConcordance Using Lists, Source Program
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# poem.py: create concordance
import sys
import string

num = 0 # global line count
word_count = 0 # global

def treeInsert(s, w, n):
    if s == []: # first time through
        s.append([])
        s.append([])
        s.append([])
        s[0] = (w, [n])
    else:
        while True:
            if w == s[0][0]: # found
                if not n in s[0][1]:
                    s[0][1].insert(len(s[0][1]), n)
                return
            elif w < s[0][0]: # left
                if s[1] == []: # not found, insert
                    s[1] = [(w, [n]), [], []]
                else:
                    s = s[1] # go left
            elif w > s[0][0]: # right
                if s[2] == []: # not found, insert
                    s[2] = [(w, [n]), [], []]
                else:
                    s = s[2] # go right
def inorderTraversal(s):
    global word_count
    if (s[1] != []):
        inorderTraversal(s[1])
    word_count += 1
    sys.stdout.write(("%4i" % word_count) +
      " " + s[0][0])
    for i in range(0, 17 - len(s[0][0])):
        sys.stdout.write(' ')
    for i in s[0][1]:
        sys.stdout.write(str(i) + " ")
    sys.stdout.write('\n')
    if (s[2] != []):
        inorderTraversal(s[2])

def binaryTree(filename):
    global num
    f = open(filename,'r') # open source file
    r = []
    for line in f:
        num += 1
        wp = line.split()
        for w in wp:
            w = w.strip(string.punctuation).lower()
            if len(w) > 0:
                treeInsert(r, w, num)
    return r
        
r = binaryTree("poem.txt")
inorderTraversal(r)

Here's a run with a small input text:

Concordance for a Short Poem
Source Text Concordance
01 And did those feet in ancient time
02 Walk upon England's mountains green?
03 And was the holy Lamb of God
04 On England's pleasant pastures seen?
05
06 And did the Countenance Divine
07 Shine forth upon our clouded hills?
08 And was Jerusalem builded here
09 Among these dark Satanic Mills?
10
11 Bring me my bow of burning gold!
12 Bring me my arrows of desire!
13 Bring me my spear! O clouds, unfold!
14 Bring me my chariot of fire!
15
16 I will not cease from mental fight,
17 Nor shall my sword sleep in my hand,
18 Till we have built Jerusalem
19 In England's green and pleasant land.
 1 among       9 
 2 ancient     1 
 3 and         1 3 6 8 19 
 4 arrows      12 
 5 bow         11 
 6 bring       11 12 13 14 
 7 builded     8 
 8 built       18 
 9 burning     11 
10 cease       16 
11 chariot     14 
12 clouded     7 
13 clouds      13 
14 countenance 6 
15 dark        9 
16 desire      12 
17 did         1 6 
18 divine      6 
19 england's   2 4 19 
20 feet        1 
21 fight       16 
22 fire        14 
23 forth       7 
24 from        16 
25 god         3 
26 gold        11 
27 green       2 19 
28 hand        17 
29 have        18 
30 here        8 
31 hills       7 
32 holy        3 
33 i           16 
34 in          1 17 19 
35 jerusalem   8 18 
36 lamb        3 
37 land        19 
38 me          11 12 13 14 
39 mental      16 
40 mills       9 
41 mountains   2 
42 my          11 12 13 14 17 
43 nor         17 
44 not         16 
45 o           13 
46 of          3 11 12 14 
47 on          4 
48 our         7 
49 pastures    4 
50 pleasant    4 19 
51 satanic     9 
52 seen        4 
53 shall       17 
54 shine       7 
55 sleep       17 
56 spear       13 
57 sword       17 
58 the         3 6 
59 these       9 
60 those       1 
61 till        18 
62 time        1 
63 unfold      13 
64 upon        2 7 
65 walk        2 
66 was         3 8 
67 we          18 
68 will        16 

Here is the list chopped into 80-character lines:

Main String for Short Poem, Length=1630
[('and', [1, 3, 5, 7, 16]), [('ancient', [1]), [('among', [8]), [], []], []], [(
'did', [1, 5]), [('countenance', [5]), [('clouded', [6]), [('builded', [7]), [('
bring', [9, 10, 11, 12]), [('bow', [9]), [('arrows', [10]), [], []], []], []], [
('burning', [9]), [('built', [15]), [], []], [('chariot', [12]), [('cease', [13]
), [], []], []]]], [('clouds', [11]), [], []]], [('dark', [8]), [], [('desire', 
[10]), [], []]]], [('those', [1]), [('feet', [1]), [('england', [2, 4, 16]), [('
divine', [5]), [], []], []], [('in', [1, 14, 16]), [('green', [2, 16]), [('god',
 [3]), [('forth', [6]), [('fire', [12]), [('fight', [13]), [], []], []], [('from
', [13]), [], []]], [('gold', [9]), [], []]], [('holy', [3]), [('hills', [6]), [
('here', [7]), [('hand', [14]), [], [('have', [15]), [], []]], []], []], [('i', 
[13]), [], []]]], [('s', [2, 4, 16]), [('mountains', [2]), [('lamb', [3]), [('je
rusalem', [7, 15]), [], []], [('mills', [8]), [('me', [9, 10, 11, 12]), [('land'
, [16]), [], []], [('mental', [13]), [], []]], []]], [('of', [3, 9, 10, 12]), [(
'my', [9, 10, 11, 12, 14]), [], [('o', [11]), [('not', [13]), [('nor', [14]), []
, []], []], []]], [('on', [4]), [], [('pleasant', [4, 16]), [('pastures', [4]), 
[('our', [6]), [], []], []], []]]]], [('the', [3, 5]), [('seen', [4]), [('satani
c', [8]), [], []], [('shine', [6]), [('shall', [14]), [], []], [('spear', [11]),
 [('sleep', [14]), [], []], [('sword', [14]), [], []]]]], [('these', [8]), [], [
]]]]]], [('time', [1]), [('till', [15]), [], []], [('walk', [2]), [('upon', [2, 
6]), [('unfold', [11]), [], []], []], [('was', [3, 7]), [], [('will', [13]), [('
we', [15]), [], []], []]]]]]]]

Items of Interest or for study:

  • I especially had trouble using split() to break the source into words. At first I was using regular expressions along with split(), and there were too many options, with none of them seeming exactly what I wanted. It was very confusing. split() is also a string function. The approach I use above was recommended online and is very simple: The split() with no arguments breaks up only on whitespace, and the strip(string.punctuation) gets rid of punctuation at either end, but leaves embedded chars like a hyphen or an apostrophe. The string.punctuation is !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

    At the high end, there is a fancy downloadable library NLTK that processes words in many sophisticated ways. (I haven't looked at it.)

  • Here are results of a run with a longer text, the Declaration of Independence: Source text (1335 words) and the Condordance text (539 distinct words).

  • Here is a concordance program written in C: tree.c. It has extra features not present in the Python program above, but still: 147 lines instead of 60, and parts of the C program are pretty complicated.

  • Patching the list of line numbers (the main change) was easy to do.


6.4. Concordance Program as a Class: Here I rewrite the concordance program as a class, adding various features.

nConcordance Program as a Class
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
concordance.py: create concordance
import sys
import string

class Concordance(object):
    def __init__(self, fp, ind=20, numw=5, linel=50):
        self.file_prefix = fp
        self.indent = ind # indent of line numbers
        self.num_width = numw # space for line #s
        self.line_length = linel # width of line #
        self.fsrc  = open(fp + ".src.txt",  'r') # src
        self.flist = open(fp + ".list.txt", 'w') # l#s
        self.fcon  = open(fp + ".con.txt",  'w') # con
        self.r = [] # list representing concordance
        self.num = 0 # line # of source
        self.word_count = 0 # count of distinct words
    
    def treeInsert(self, w, n):
        # self.r == [] first time through
        if self.r == []:
            self.r.append([])
            self.r.append([])
            self.r.append([])
            self.r[0] = (w, [n])
        else:
            s = self.r
            while True:
                if w == s[0][0]: # found
                    if not n in s[0][1]:
                        s[0][1].insert(len(s[0][1]),n)
                    return
                elif w < s[0][0]: # left
                    if s[1] == []: # not found, insert
                        s[1] = [(w, [n]), [], []]
                    else:
                        s = s[1] # go left
                elif w > s[0][0]: # right
                    if s[2] == []: # not found, insert
                        s[2] = [(w, [n]), [], []]
                    else:
                        s = s[2] # go right
    def inorderTraversal(self, s):
        if (s[1] != []):
            self.inorderTraversal(s[1])
        self.word_count += 1
        strfor = "%"+str(self.num_width)+"i"
        self.fcon.write((strfor % self.word_count) +
          " " + s[0][0])
        self.fcon.write(' '*(self.indent -
          len(s[0][0])-1))
        self.fcon.write(' ')
        x = 0
        for i in s[0][1]:
            x += len(str(i)) + 1
            if x < self.line_length:
                self.fcon.write(str(i) + " ")
            else:
                x = len(str(i))
                self.fcon.write('\n' +
                  ' '*(self.indent+self.num_width+1)+
                  str(i) + " ")
        self.fcon.write('\n')
        if (s[2] != []):
            self.inorderTraversal(s[2])
    
    def binaryTree(self):
        for line in self.fsrc:
            if len(line) == 2:
                self.flist.write('\n')
            else:
                self.num += 1
                # Produce a listing, numbered lines
                strfor = "%"+str(self.num_width)+"i"
                self.flist.write((strfor % self.num)+
                  " " + line)
                wp = line.split()
                for w in wp:
                    w = w.strip(string.punctuation).\
                      lower()
                    if len(w) > 0:
                        self.treeInsert(w, self.num)

c = Concordance("emma3", 16, 5, 100)
c.binaryTree()
c.inorderTraversal(c.r)

Items of Interest or for study:

  • This wasn't very hard to convert to a class, but I made a large number of small mistakes. I was using "%4i" in formatting, but wanted the size to be a variable (self.num_width). In the end I had to create the string "%"+str(self.num_width)+"i" first before using it.

  • I ran this with a good-sized novel (160,000 words, 7155 distinct words): "Emma" by Jane Austen. The results are: listing and concordance. I edited the original source to replace all references such as "Mr John Knightly" with "Mr+John+Knightly", which is then lowercased to "mr+john+knightly". In this way the references to people were greatly improved. (The split() only separates on whitespace as mentioned above.) Afterwards, I replaced each plus with a blank, so any references with a blank in them initially had a plus instead of the blank.

  • The source text must have used a carrige-return/line-feed at the end of each line, so I had to change my check for a blank line.

  • I added 4 arguments to the class, the last 3 with default values. This also reads source from a file, and writes a listing (with line numbers) and the concordance itself to separate files.

(Revision date: 2015-02-23. Please use ISO 8601, the International Standard.)