Once upon a time, I was working on an in-memory datastore and creating the memory pool for the datastore. As usual, there were multiple design ideas that people put forth to solve it.
Somebody mentioned about using mmap
to keep data in memory backed by a file on the disk. Having worked on a
networking system using mmap
for its ring buffers, I was enthusiastic about using it.
However, folks mentioned that mmap
shouldn't be used for databases. I remembered that DBs like MongoDB
and
LevelDB
had moved away from using mmap
based storage engines to managing the memory themselves.
This blog will cover how most databases store their data on the operating system and their memory usage and
how mmap
would interact with the database operations.
Data layers for a write operation in the database
Every system needs a storage device like a SSD
, HDD
, NAS
, etc to actually store data.
Given the plethora of operating systems and different storage devices, database systems don't interact with
the storage device directly.
Instead, the database system leverages the Virtual File System (VFS)
provided by the operating system
to interact with the actual storage layer.
Every layer has their own optimisations to provide better performance to the user.
For example, when you write to a file in normal mode, the data is not written to the storage device synchronously.
Instead, the operating system has data buffers which is flushed to the storage device periodically.
This means that if you emit a write
system call and there is a system crash, there are chances of data failure.
That's one of the main reasons why developers call fsync
after every major write operation when they want to
ensure the data resiliency of the system. A great note I received on Twitter was that one should also call fsync
when they read the data as there may be a discrepancy in the data buffer and the actual storage file.
The operating system writes data to the storage device in chunks or blocks called pages
. The common block size for
most operating systems is 4kB.
The database layer also has its own blocks of storage called pages
and they are different from the operating system's pages.
The data belonging to different tables are usually stored in different pages
. Most pages follow the data layout of
slotted pages
to support variable length tuples with different compaction strategies.
Introducing mmap
to the mix
mmap
is used to map contents from the secondary storage directly to main memory, thus leading to faster access and better
performance since the contents are accessed from the main memory if there is sufficient space in the operating system
page cache.
The operating system removes or adds the data from the main memory based on the usage. This means that mmaped data can be removed transparently in case the memory is needed for other use cases.
Database systems try to optimise query performance by various methods like data prefetching of sequential data,
scan sharing, etc. To do the same thing in mmap
, the madvise
system call can be used for different kinds of data
access like sequential or random access using MADV_SEQUENTIAL
or MADV_RANDOM
flags. However, it is not sufficient for
common SQL queries operating on ranges and orders since it doesn't have context of where the data is stored in the pages.
Another disadvantage is that mmap
is not designed to have asynchronous reads which means that the data access blocks
the fetching which can be easily circumvented in a traditional buffer pool using io_uring
or libaio
.
In certain cases, the database may want to keep certain pages in memory for a certain duration. To do this with mmap
,
there is a system call mlock
which attempts to keep the page locked in memory. However, in cases of operation system
memory overload, the pages can still be removed which leads to page faults, thus leading to slower IO.
The database system usually uses the buffer pool to store data temporarily till it decides to commit the data to
disk. Using mmap
, the control over this process is minimal since the operating system can decide to flush the dirty
page to the secondary system anytime. In cases of concurrent transactions, the database system using mmap
would be
unable to detect data conflicts to either commit or rollback the data.
The paper does a performance comparison between random
and sequential reads for mmap
and buffer pools implemented by the database. In both cases, it was seen that the
performance is comparable in the beginning of the load test. However, when the operating system cache becomes full
and the pages are removed from the cache, the performance of mmap
becomes quite bad compared to a traditional
buffer pool, especially when multiple storage devices are involved.
Conclusion
Most popular databases like MongoDB and LevelDB have moved away from mmap
based memory management as it remains
unpredictable on how the operating system is going to behave in different scenarios.
There are certain smaller use cases within database systems to use mmap
for small data transfers which are not
performance and transaction dependent.
But if you are implementing a serious traditional database, it's better to stay away from mmap
.
References
Hope you liked reading the article.
Please reach out to me here for more ideas or improvements.