Cross-Failure Bug Detection in Persistent Memory Programs

Session: Persistence and correctness--Or... persistent correctness?

Authors: Sihang Liu (University of Virginia); Korakit Seemakhupt (University of Virginia); Yizhou Wei (University of Virginia); Thomas Wenisch (University of Michigan); Aasheesh Kolli (Pennsylvania State University & VMware Research); Samira Khan (University of Virginia)

Persistent memory (PM) technologies, such as Intelís Optane memory, deliver high performance, byte-addressability, and persistence, allowing programs to directly manipulate persistent data in memory without any OS overhead. An important requirement of these programs is that persistent data must remain consistent across a failure, which we refer to as the crash consistency guarantee. However, maintaining crash consistency is not trivial. We identify that a consistent recovery critically depends not only on the execution before the failure, but also on the recovery and resumption after failure, which we refer to as the pre- and post-failure execution stages. In order to holistically detect crash consistency bugs, we categorize the underlying causes behind inconsistent recovery due to incorrect interactions between the pre- and post-failure execution. First, a program is not crash-consistent if the post-failure stage reads from locations that are not guaranteed to have persisted in all possible access interleavings during the pre-failure stage - a type of programming error that leads to a race that we refer to as a cross-failure race. Second, a program is not crash-consistent if the post-failure stage reads persistent data that has been left semantically inconsistent during the pre-failure stage, such as a stale log or uncommitted data. We refer to this type of bugs as a cross-failure semantic bug. Together, they form the cross-failure bugs in PM applications. In this work, we propose XFDetector, a tool that detects cross-failure bugs by automatically injecting failures in the pre-failure execution, and checking for cross-failure races and semantic bugs in the post-failure continuation. XFDetector has detected four new bugs in three pieces of PM software: one of PMDK examples, a PM-optimized Redis database, and a PMDK library function.