Gaurav Sarma

Once upon a time, I was working on an in-memory datastore and creating the memory pool for the datastore. As usual, there were multiple design ideas that people put forth to solve it.

Somebody mentioned about using mmap to keep data in memory backed by a file on the disk. Having worked on a networking system using mmap for its ring buffers, I was enthusiastic about using it. However, folks mentioned that mmap shouldn't be used for databases. I remembered that DBs like MongoDB and LevelDB had moved away from using mmap based storage engines to managing the memory themselves.

This blog will cover how most databases store their data on the operating system and their memory usage and how mmap would interact with the database operations.

Data layers for a write operation in the database

Every system needs a storage device like a SSD, HDD, NAS, etc to actually store data. Given the plethora of operating systems and different storage devices, database systems don't interact with the storage device directly. Instead, the database system leverages the Virtual File System (VFS) provided by the operating system to interact with the actual storage layer. Every layer has their own optimisations to provide better performance to the user. For example, when you write to a file in normal mode, the data is not written to the storage device synchronously. Instead, the operating system has data buffers which is flushed to the storage device periodically. This means that if you emit a write system call and there is a system crash, there are chances of data failure.

That's one of the main reasons why developers call fsync after every major write operation when they want to ensure the data resiliency of the system. A great note I received on Twitter was that one should also call fsync when they read the data as there may be a discrepancy in the data buffer and the actual storage file.

The operating system writes data to the storage device in chunks or blocks called pages. The common block size for most operating systems is 4kB.

The database layer also has its own blocks of storage called pages and they are different from the operating system's pages. The data belonging to different tables are usually stored in different pages. Most pages follow the data layout of slotted pages to support variable length tuples with different compaction strategies.

Introducing `mmap` to the mix

mmap is used to map contents from the secondary storage directly to main memory, thus leading to faster access and better performance since the contents are accessed from the main memory if there is sufficient space in the operating system page cache.

The operating system removes or adds the data from the main memory based on the usage. This means that mmaped data can be removed transparently in case the memory is needed for other use cases.

Database systems try to optimise query performance by various methods like data prefetching of sequential data, scan sharing, etc. To do the same thing in mmap, the madvise system call can be used for different kinds of data access like sequential or random access using MADV_SEQUENTIAL or MADV_RANDOM flags. However, it is not sufficient for common SQL queries operating on ranges and orders since it doesn't have context of where the data is stored in the pages.

Another disadvantage is that mmap is not designed to have asynchronous reads which means that the data access blocks the fetching which can be easily circumvented in a traditional buffer pool using io_uring or libaio.

In certain cases, the database may want to keep certain pages in memory for a certain duration. To do this with mmap, there is a system call mlock which attempts to keep the page locked in memory. However, in cases of operation system memory overload, the pages can still be removed which leads to page faults, thus leading to slower IO.

The database system usually uses the buffer pool to store data temporarily till it decides to commit the data to disk. Using mmap, the control over this process is minimal since the operating system can decide to flush the dirty page to the secondary system anytime. In cases of concurrent transactions, the database system using mmap would be unable to detect data conflicts to either commit or rollback the data.

The paper does a performance comparison between random and sequential reads for mmap and buffer pools implemented by the database. In both cases, it was seen that the performance is comparable in the beginning of the load test. However, when the operating system cache becomes full and the pages are removed from the cache, the performance of mmap becomes quite bad compared to a traditional buffer pool, especially when multiple storage devices are involved.

Conclusion

Most popular databases like MongoDB and LevelDB have moved away from mmap based memory management as it remains unpredictable on how the operating system is going to behave in different scenarios. There are certain smaller use cases within database systems to use mmap for small data transfers which are not performance and transaction dependent. But if you are implementing a serious traditional database, it's better to stay away from mmap.

References

Hope you liked reading the article.

Please reach out to me here for more ideas or improvements.

Data layers for a write operation in the database

Introducing mmap to the mix

Conclusion

References

Introducing `mmap` to the mix