Gaurav Sarma

Your database issues a 4 KiB page write. The SSD physically writes 18.85 KiB. That is not a rounding error: on a Samsung PM9A3 filled to 90% capacity, running YCSB-A with a zipfian skew of 0.8, a single logical B-tree page update in LeanStore triggers nearly five times the data you intended to persist. MySQL and PostgreSQL behave similarly.

The culprit is not one layer. Writes get amplified inside the DBMS, then amplified again inside the SSD. Most systems optimize each layer independently, or not at all. A recent paper from TUM and TigerBeetle argues that the DBMS, not the filesystem or the SSD firmware, is best positioned to coordinate both. Their answer starts with a deceptively simple shift: stop doing in-place updates.

The Problem

Modern databases have spent years tuning for SSD reads: exploiting internal parallelism, hiding read latency, prefetching. Writes are a different story.

SSDs wear out. Unlike spinning disks, flash cells tolerate only a finite number of program/erase cycles. Enterprise drives are rated in DWPD (Drive Writes Per Day). A Samsung PM9A3 rated for five years at 1 DWPD on an 894 GB device averages about 11 MB/s of sustainable writes. LeanStore in the in-place configuration from the paper writes around 400 MB/s under YCSB-A. At that rate the drive exhausts its rated endurance in roughly six weeks.

Write amplification is multiplicative. When the authors benchmarked LeanStore, MySQL, and PostgreSQL on a 90%-full enterprise SSD until cumulative DB writes exceeded four times device capacity, they decomposed every flash write into three categories (excluding WAL, which is comparatively easy to reason about):

User writes: evictions and checkpoint flushes, the bytes you actually intend to persist.
Other DBMS writes: overhead the engine adds on top, most notably doublewrite buffering.
Internal flash writes: amplification inside the SSD itself, measured via OCP telemetry when available.

For in-place LeanStore, a 4 KiB page write becomes 18.85 KiB at the flash layer. DB WAF is about 2.0 (doublewrite). SSD WAF is about 2.36. Total WAF is the product: roughly 4.7×.

The paper's central claim: Total WAF = DB WAF × SSD WAF, and optimizing one layer while ignoring the other can make things worse. An LSM-tree that reduces DB amplification via aggressive size-tiering may consume more SSD space, raise SSD WAF, and end up writing more flash overall.

Prerequisites

Familiarity with B-tree storage engines, buffer pools, and checkpoint/eviction
Basic SSD concepts: pages, erase blocks, overprovisioning, garbage collection
Optional but helpful: awareness of ZNS (Zoned Namespace) and FDP (Flexible Data Placement) NVMe extensions

Technical Decisions

The authors implement their ideas in ZLeanStore, a fork of LeanStore (a high-performance B-tree engine for NVMe) extended with vmcache for 1:N page-id-to-offset mappings. The design choices below are the paper's, not generic best practices.

Why out-of-place instead of in-place?

In-place B-trees (InnoDB, PostgreSQL heap pages, LeanStore's original design) fix each page at a stable offset. Every update overwrites the same location. That creates two problems on flash:

Doublewrite buffering: Before overwriting a page, the DBMS copies it to a safe area so a torn write cannot corrupt the database. MySQL's doublewrite buffer, PostgreSQL's full-page writes, and similar mechanisms roughly double DB-issued bytes.
No control over SSD placement: The DBMS cannot choose where on the device its writes land. SSD WAF becomes whatever the device's internal GC produces for your access pattern, which can be surprisingly bad under skewed workloads.

Out-of-place writes append new versions elsewhere and update a mapping table. The old page stays valid until the new one is durable, so doublewrite goes away. The DBMS also gains freedom to group, compress, and place writes deliberately.

The trade-off: you now need database-level garbage collection to reclaim space from stale page versions. That GC has its own write amplification. The rest of the paper is about making that GC cheap and shaping the resulting write stream so the SSD stops amplifying it further.

Why the DBMS, not the filesystem or SSD?

SSDs and filesystems see a stream of writes. They do not know which pages are hot, which will be overwritten in milliseconds, or which belong to the same B-tree index. The DBMS does.

Prior work on SSD WAF mitigation lives mostly at lower layers: F2FS hot/cold separation, SSD firmware lifetime prediction, multi-stream SSDs. These approaches infer workload properties. The DBMS already knows them from transaction semantics, index structure, and page access history.

The authors' position: treat total WAF as the optimization target and let the DBMS reshape its write pattern with both DB-level and SSD-level amplification in mind.

Why optimize DB WAF and SSD WAF together?

This is the counterintuitive part. Naïve out-of-place LeanStore increases total WAF before the other optimizations kick in. Removing doublewrite helps, but DB GC copyback dominates, pushing total WAF to roughly 1.66× the in-place baseline on an 800 GB dataset.

Compression alone does not fix SSD WAF. NoWA alone slightly increases DB WAF via compensation writes. The full stack is what drives total WAF from 4.72 down to 0.60 on the same benchmark.

Implementation

The paper's write path has four cooperating components: buffer manager, I/O interface, space manager, and garbage collector. Here is how the optimizations fit together.

Phase 1: Page-wise compression and page packing

Compression reduces the bytes the DBMS actually writes. The authors compress each 4 KiB page independently with LZ4 or ZSTD before persistence. On TPC-C, YCSB, and several real-world datasets, compression ratios range from 14% to 49% of original size depending on algorithm and workload.

In-place engines struggle here. A compressed page is often smaller than 4 KiB, but SSDs write in 4 KiB units. Overwrite the same offset in place and you still write a full block. Variable-length compressed pages also drift as data changes, making in-place layouts expensive to maintain. PostgreSQL largely skips page-level compression for this reason; MySQL pushes it to the filesystem layer, which does not solve the 4 KiB granularity problem.

Out-of-place writes sidestep this: compress a batch, append sequentially, record new offsets in a PID→offset table.

Page packing addresses read amplification. Enterprise SSDs read most efficiently at 4 KiB aligned boundaries. A 3,000-byte compressed page that straddles a 4 KiB boundary costs two physical reads (~2.73× amplification). Page packing uses best-fit bin packing to place compressed pages into 4 KiB slots so each page is fetched with exactly one read. You pay a small amount of internal slack per slot, but reads stay predictable and fast.

Phase 2: Grouping by deathtime (GDT)

Database GC reclaims zones containing a mix of valid and stale pages. If a victim zone is 75% valid, reclaiming 25% of its space requires rewriting the other 75%: WAF of 4× for that cycle.

Greedy victim selection (pick the zone with the fewest valid pages) works under uniform random access. Real OLTP workloads are skewed. Hot and cold pages end up in the same zone, victim zones stay mostly valid, and GC becomes expensive.

Grouping by Deathtime (GDT) uses DB semantics to colocate pages that will become invalid around the same time:

Each page header stores write timestamps from recent persist operations.
Expected Deathtime (EDT) extrapolates when the page will next be overwritten: current_lsn + (WH₀ − WH₁)⁻¹ from the last few write intervals.
On flush, pages with similar EDT are packed into the same zone.
GC selects victims whose pages have mostly died, minimizing copyback.

Initial writes group by B-tree index ID as a cold-start heuristic. During GC, valid pages are sorted by descending EDT and rewritten into zones with matching temperature. Read-only pages get maximum EDT and are treated as coldest data.

GDT only works paired with GDT-aware GC. If GC interleaves its rewrites with normal writes without respecting deathtime grouping, the placement benefit erodes.

Phase 3: Aligning DB and SSD garbage collection units

Even with low DB WAF, SSD internal GC can undo your gains. The key insight: writes from the same DB zone share similar deathtimes, so if you can make a zone's worth of data land in a single SSD superblock, invalidating that zone can invalidate an entire superblock. SSD GC then reclaims without copying surviving pages.

SSDs append at superblock granularity (grouping erase blocks across planes/dies) but may GC at a coarser or finer unit depending on vendor firmware. Misalignment interleaves pages from different zones inside the same superblock. When one zone is garbage-collected, only half a superblock dies, and the SSD must copy the rest.

How to pick zone size:

FDP-enabled SSDs: query the Reclaim Unit (RU) size via NVMe and set DB zone size to match.
Standard SSDs: use a ZNS-like single-active-zone write pattern and increase zone size until measured SSD WAF drops to 1.0. On six enterprise drives tested, the inferred GC unit upper bound typically falls between 4 GB and 8 GB. Without telemetry, 32 GB is a conservative default.

Phase 4: NoWA on commodity SSDs

With multiple concurrent zones, write streams multiplex across superblocks. Zone A and zone C pages interleave in the same physical blocks. When DB GC reclaims zones at different rates, the SSD inherits partially valid superblocks and must amplify internally.

NoWA (No Write Amplification) enforces two rules:

Do not open a new active group of zones until all currently open zones are completely full.
Detect frequency imbalance among concurrently written zones and issue compensation writes to recompact underrepresented zones before the SSD hits its minimum free-space threshold.

Compensation writes shift a small amount of amplification from the SSD back to the DB layer. The DB GC can avoid counterproductive compensation by checking whether a write would raise valid ratios in future rounds. The net effect on the Samsung PM9A3: SSD WAF drops from 2.36 (in-place) to 1.0 with the full stack, while DB WAF stays near 0.60.

NoWA is not perfect. Wear leveling, open superblock limits, and internal scheduling can still cause minor reordering. But the authors report SSD WAF = 1.0 on six enterprise SSDs from five vendors under YCSB-A.

Phase 5: ZNS and FDP as first-class backends

ZNS SSDs push GC to the host and enforce sequential writes within zones, guaranteeing SSD WAF = 1 by design. ZLeanStore maps DB zones directly to ZNS zones, uses zone-append commands, and resets zones after GC copyback. ZNS also returns overprovisioning space to the host (commodity SSDs hide 7–28% for internal GC), which gives DB GC more headroom and further lowers DB WAF.

FDP SSDs expose Reclaim Unit Handles (RUHs). Assign each DB zone a placement ID modulo the RUH count, and writes to different zones land in independent reclaim streams with no multiplexing. When FDP is available, NoWA becomes unnecessary: placement hints achieve SSD WAF = 1 with slightly lower DB WAF (0.54 vs 0.57 on FDP SSD A) because compensation writes are avoided.

The I/O layer detects device type at startup and selects among standard io_uring, ZNS zone-append, or FDP placement backends.

Logging and recovery

Out-of-place writes require logging PID→offset mapping changes, not just page contents. ZLeanStore uses per-thread WAL with continuous checkpointing. Page data is persisted before its mapping update commits. Checkpoints snapshot the mapping table and active-group history. Recovery reloads the checkpoint and replays WAL to reconstruct storage layout.

GC shares the buffer pool with worker threads. A reserve of clean frames (integrated with fuzzy checkpointing) prevents eviction from fighting GC reads for buffer space.

How It All Fits Together

The full write path for a batch of dirty pages flushed on eviction:

Buffer pool                Space manager              SSD
───────────                ─────────────              ───
Compute EDT ──► Compress ──► Pick zone by EDT ──► Append packed pages
from WH        + pack         (or trigger GC)        (io_uring / ZNS / FDP)
     │              │                │
     └──────────────┴── Update PID→offset map ──► WAL

End-to-end amplification breaks down like this on an 800 GB YCSB-A workload (Samsung PM9A3, zipf = 0.8):

Configuration	OPS (K)	DB WAF	SSD WAF	Total WAF	Flash bytes/op
In-place	229	2.00	2.36	4.72	4,378
Out-of-place (naïve)	230	4.06	1.94	7.89	7,274
+ compression + packing	380	0.62	1.95	1.21	566
+ GDT	458	0.59	1.96	1.16	1,110
+ NoWA + aligned GC unit	535	0.60	1.00	0.60	567

The naïve out-of-place row is the cautionary tale: removing doublewrite without controlling DB GC and SSD placement makes things worse. The final row writes fewer physical bytes per operation than the in-place engine issues logically.

On TPC-C with 15,000 warehouses on an FDP SSD, the optimized out-of-place configuration completes 2.45× more new-order transactions in the same runtime, while writing 7.2× fewer flash bytes per transaction.

Across five additional enterprise SSD models (894 GB to 7.2 TB), total WAF improvements range from 6.2× to 9.8× over in-place writes.

Lessons Learned

In-place is the wrong default for flash-heavy OLTP. Doublewrite buffering alone costs ~2× DB WAF. That made sense when disks were cheap to rewrite and had no endurance limit. On modern NVMe it is an expensive legacy tax.

Out-of-place without a plan is worse than in-place. DB GC copyback can dominate. The paper's incremental evaluation (Figure 13 in the original) shows total WAF rising before falling as optimizations stack. Do not assume append-only writes are automatically SSD-friendly.

Compression and GC interact. Compression shrinks the dataset, which increases effective overprovisioning, which lowers GC valid ratios, which reduces DB WAF further. It also widens the timestamp window for deathtime estimation. On compressible OLTP data, page-wise LZ4/ZSTD is not optional in this design.

SSD WAF = 1 on commodity drives is achievable. That was surprising to me. Prior work (including the authors' own SSD-iq paper) showed enterprise SSDs with WAF of 4 under simple hot/cold patterns. NoWA plus aligned GC units gets to 1.0 without ZNS hardware, though FDP makes it cleaner.

The DBMS should own the write pattern. Filesystems and SSD firmware can shuffle bytes, but they cannot know that these eight pages will die together at the next checkpoint. That knowledge lives in the engine.

Costs the paper is honest about: out-of-place metadata can consume up to ~11 GB at full device utilization, CPU usage rises from 5% to 8.3% (mostly buffer frame management for GC), and buffer hit ratio dips slightly when switching from in-place (93.2% to 91.8%) before recovering with GDT.

What's Next

The authors identify open directions: extending the techniques to LSM-tree engines, multi-device deployments, and HM-SMR disks. LSM-trees already write out-of-place but reclaim space on compaction schedules that are harder to control than zone-level GC. Bridging GDT and NoWA concepts to leveled compaction is non-trivial.

For practitioners today, the actionable takeaway is diagnostic: measure total WAF, not just logical write rate. If you have OCP telemetry on your drives, compare host writes to physical writes. If the ratio is above 1.5 under your actual workload, the problem may be write placement, not query plans.

ZLeanStore source is available at github.com/LeeBohyun/ZLeanStore.

References

How to Write to SSDs (Lee, Ziegler, Leis; PVLDB 2026), the paper this post covers
Extended version on arXiv
ZLeanStore artifact
LeanStore: A High-Performance Storage Engine for NVMe SSDs (Leis; PVLDB 2024)
SSD-iq: Uncovering the Hidden Side of SSD Performance (Haas, Lee, Bonnet, Leis; PVLDB 2025)
ZNS: Avoiding the Block Interface Tax for Flash-based SSDs (Bjørling et al.; USENIX ATC 2021)
Principles of Database and Solid-State Drive Co-Design (Lerner & Bonnet; Springer 2024)

How to Write to SSDs - Co-Designing DBMS and Flash Storage