tree.c

tree.tar.gz
- DOC.ftree
- Makefile
- dumpftree.c
- ftree.c
- ftree.h
- main.c (command line utility for ftree.c)
- main2.c (test harness for tree.c)
- tree.c
- tree.h

tree

This is a memory-efficient dictionary tree with support for wildcards. Source is 11K, header is 2.5K (mostly documentation), which compiles down to about 4.5K of object code. Included in the tarball is a testing program (check the source to see what it does). This version supports substrings and removes the implicit wildcard at the end of each string.

In terms of memory usage, it is very easy to have it actually use less memory than just a flat file:

1409379 ASCII flat file
1032633 s[1] without http:// with bytecount
1032652 s[1] with http:// with bytecount
866443 s[0] with http:// with bytecount
960120 s[0] with http:// with TNSIZE()
The TNSIZE() reading is much more accurate than the others, but still does not take into account malloc() overhead, so I don't know the true figure. Such a figure would also vary greatly (including possibly lower) if the program using the tree also used malloc().

Speed appears to be quite good. I tested it by having it match its original input data file, once per line, and it took 4.32 usec per match on average. There is a read-write lock on each tree (currently just emulated with an int) so that many readers can traverse the tree, but allow for a writer to change it while new readers wait. The #defines in tree.c should be replaced with the appropriate calls to your thread library's read-write lock functions.

There is now support for storing an arbitrary non-structure data type along with each string, which is returned by tree_match (0 is returned on no match, so you are not allowed to add anything with a value of 0).

ftree

In addition to all the stuff above, there is a version of tree which uses an mmap'd file instead of malloc(), which allows for persistent storage and multi-process update. Synchronization is handled via flock(). Differences of ftree from tree:

lock()/match()/unlock() takes about 8.57 usec on the same data as used in the tree speed tests (tree uses msync and can be made to lock with fcntl, so it can be used over non-braindead NFS implementations).
The source is slightly larger; 20K source, 7K object code.
Storage is 1564672, which is more than above, but as that includes the "malloc() overhead" it's a 100% accurate figure.

The dumpftree utility will dump a node-by-node listing of an ftree, so that you can get an idea as to how the file is layed out, and how segmented it is. I may make an optimizeftree utility at some point, if there's enough demand for it.

To Do

Some minor FIXME's in the code
Make optimizeftree utility (may increase locality on accesses, to improve caching)
Use static and const in all applicable places

Questions? Concerns? Bugfixes? Please submit feedback to mike@pedantic.org.

1409379	ASCII flat file
1032633	s[1] without http:// with bytecount
1032652	s[1] with http:// with bytecount
866443	s[0] with http:// with bytecount
960120	s[0] with http:// with TNSIZE()