Jul 15, 2012

Teaching data structures with real-world examples

I loved the data structures courses. They were just so cute. We would all write our own snowflake implementations of AVL trees and have fun inserting and deleting integers from them.

What was missing was an appreciation of how, and why, particular data structures are used in the real world, embedded in real systems.

So I’ve been wondering, while teaching a DS course, what would be a good way to show the students non-toy examples of how they’re used?

I think the Linux kernel is an excellent example.

Let’s look at how the kernel puts some data structures through the paces.

Red-black trees

From the kernel docs:

There are a number of red-black trees in use in the kernel. The deadline and CFQ IO schedulers employ rbtrees to track requests; the packet CD/DVD driver does the same. The high-resolution timer code uses an rbtree to organize outstanding timer requests. The ext3 filesystem tracks directory entries in a red-black tree. Virtual memory areas (VMAs) are tracked with red-black trees, as are epoll file descriptors, cryptographic keys, and network packets in the “hierarchical token bucket” scheduler.

The kernel implementation is different from most naive implementations, and that teaches you something about performance and cache locality:

The Linux rbtree implementation is optimized for speed, and thus has one less layer of indirection (and better cache locality) than more traditional tree implementations. Instead of using pointers to separate rb_node and data structures, each instance of struct rb_node is embedded in the data structure it organizes.

Radix trees

Again, the kernel imlementation is different from the standard one:

A Linux radix tree is a mechanism by which a (pointer) value can be associated with a (long) integer key. It is reasonably efficient in terms of storage, and is quite quick on lookups. Additionally, radix trees in the Linux kernel have some features driven by kernel-specific needs, including the ability to associate tags with specific entries.

The most widespread use of radix trees, however, is in the memory management code. The address_space structure used to keep track of backing store contains a radix tree which tracks in-core pages tied to that mapping. Among other things, this tree allows the memory management code to quickly find pages which are dirty or under writeback.

Possible Assignments and Questions

The kernel doesn’t have any “fancy” data structures. For example, there are no skip lists? Why do you think that is? (E.g. see these attempts)
Swap out a data structure with another one (AVL trees instead of red-black?) and see if can improve performance, or simplify the code.

Such exercises place the data structure in the larger context of a whole application and real workloads, which is much more interesting than just implementing a standalone data structure and exercising it.