When do you organize your data? Every write, or... later?

Hash tables and B-trees said: every write. LSM trees said: later. Turns out, procrastinating is sometimes the right call.

Hash tables: The baseline

Hash tables give you O(1) lookups. You hash a key, go directly to that slot, done. Hard to beat.

But they have limitations:

  • No ordering: you can't ask "give me all keys between A and M"
  • No range queries: scanning requires touching every slot
  • Resizing is expensive: when the table fills up, you rehash everything

For pure key-value lookups (memcached, Redis), hash tables are great. But databases need more.

B-trees: The workhorse

B-trees solved the ordering problem. They keep keys sorted in a tree structure, giving you:

  • O(log n) lookups
  • Efficient range scans (just walk the leaves)
  • Good cache/disk utilization (nodes sized to pages)

This is the backbone of most relational databases: PostgreSQL, MySQL InnoDB, SQLite. On every write, you:

  1. Traverse the tree to find the right page
  2. Read that page from disk
  3. Modify it in place
  4. Write it back
  5. Maybe rebalance, triggering more writes

Your data is always organized. Lookups are predictable. But you're paying the bookkeeping cost on every operation, and that cost is random I/O.[1]

The log-structured insight

In 1991, Rosenblum and Ousterhout asked a different question: what if we never modified data in place?[2]

Their Log-structured File System (LFS) treated the entire disk as an append-only log. Writes always go to the end. Old data gets garbage collected later.

This sounds wasteful, but it converts random writes into sequential writes. Sequential I/O is 10-100x faster on spinning disks, still 5-20x faster on SSDs.[3]

The same insight shows up in filesystems:

Filesystem Write strategy Philosophy
ext4, XFS In-place + journal Established: fix it now
btrfs, ZFS Copy-on-write Next-gen: never overwrite, clean up later
bcachefs Copy-on-write + LSM Next-gen: both techniques combined

Copy-on-write filesystems are basically applying log-structured patterns to the filesystem layer.

LSM trees: Log-structured meets sorted

LSM trees (Log-Structured Merge trees) took the append-only insight and applied it to sorted indexes.[4] This is what powers RocksDB, LevelDB, Cassandra, and ScyllaDB.

The design:

  1. Writes go to memory first, in a sorted buffer called a memtable
  2. When it fills up, flush to disk as a sorted file called an SSTable
  3. Periodically merge SSTables in the background (compaction)

Instead of organizing on every write, you batch it up and do it asynchronously. The bookkeeping still happens, just later, in bulk, when sequential I/O makes it cheap.

Don't you end up doing the same work?

No. Way less.

By batching hundreds of writes into a single sequential flush, you're amortizing the I/O cost across all of them. One disk seek instead of hundreds.

But reads must be slower?

In theory, yes. You might check multiple SSTables to find your data.

In practice:

  • Bloom filters on each SSTable tell you which files definitely don't have your key
  • Recent data is usually still in the memtable (in memory)
  • Hot data gets cached at multiple levels

For many workloads, read performance is comparable to B-trees. For write-heavy workloads, LSM trees dominate.

Temporal vs spatial locality

Here's where it gets interesting.

Spatial locality assumes data near each other in address space should be stored together. B-trees optimize for this: record #1000 is stored near record #1001.

Temporal locality assumes data written at the same time should be stored together. LSM trees optimize for this: everything flushed in the same batch ends up in the same SSTable.

Why does temporal locality matter? Because it matches how systems actually behave:

Scenario Why temporal locality helps
npm install Thousands of files written in a burst; they belong together
App startup Reads config and state that was all persisted together
Checkpoint restore Reading a consistent snapshot
Transactions Records committed together are accessed together

Data you need right now was usually written together at some point. Grouping by write-time often matches access patterns better than grouping by key.

The trade-offs (there's always trade-offs)

LSM trees optimize for writes at some cost:

  • Write amplification: data gets rewritten multiple times during compaction
  • Space amplification: you temporarily store redundant data
  • Read amplification: worst-case reads check multiple levels

For write-heavy workloads (logging, time-series, event streams), the trade-off is worth it. For read-heavy OLTP with lots of point queries, B-trees still make sense.

So what?

Back to our opening question: when do you pay the bookkeeping cost? B-trees say now, on every write. LSM trees say later, in bulk, when sequential I/O makes it cheap.

That's the trick. Procrastination, done right, can be an advantage.

We're seeing this pattern show up in unexpected places. Cloud block storage systems are adopting log-structured designs to handle virtualized storage at scale.[5] And with AI training flushing massive checkpoints every few minutes, the demand for write-optimized storage is only growing. This might be the architecture for the next generation of storage infrastructure.

And that's my love letter to the LSM tree. It's not real magic, but it's damn close.

[1] Graefe. "Modern B-Tree Techniques". Foundations and Trends in Databases, 2011.

[2] Rosenblum & Ousterhout. "The Design and Implementation of a Log-Structured File System". SOSP, 1991.

[3] Jacobs. "The Pathologies of Big Data". ACM Queue, 2009.

[4] O'Neil et al. "The Log-Structured Merge-Tree". Acta Informatica, 1996.

[5] Zhang et al. "What's the Story in EBS Glory: Evolutions and Lessons in Building Cloud Block Store". USENIX FAST, 2024.