All the times listed below are in Pacific Standard Time (PST).
The full Proceedings published by USENIX for the conference are available for download below. Individual papers can also be downloaded from their respective presentation pages. Copyright to the individual works is retained by the author(s).
Proceedings Front Matter
Proceedings Cover | Title Page and List of Organizers | Message from the Program Co-Chairs | Table of Contents
Full Proceedings PDFs
FAST '21 Full Proceedings (PDF)
FAST '21 Proceedings Interior (PDF, best for mobile devices)
7:45 am–8:55 am
Indexing and Key-Value Store
Session Chair: Song Jiang, The University of Texas at Arlington
ROART: Range-query Optimized Persistent ART
Shaonan Ma and Kang Chen, Tsinghua University; Shimin Chen, SKL of Computer Architecture, ICT, CAS, and University of Chinese Academy of Sciences; Mengxing Liu, Jianglang Zhu, Hongbo Kang, and Yongwei Wu, Tsinghua University
With the availability of commercial NVM devices such as Intel Optane DC PMM, it is time to start thinking about applying the existing persistent data structures in practice. This paper considers three practical aspects, which have significant influences on the design of persistent indexes, including functionality, performance and correctness.
We design a new persistent index, ROART, based on adaptive radix tree (ART), taking all these practical aspects into account. ROART (i) proposes a leaf compaction method to reduce pointer chasing for range queries, (ii) minimizes persistence overhead with three optimizations, i.e., entry compression, selective metadata persistence and minimally ordered split, and (iii) designs a fast memory management to prevent memory leaks, and eliminates the long recovery time by proposing an instant restart strategy. Evaluations show that ROART outperforms the state-of-the-art radix tree by up to 1.65x and B+-Trees by 1.17∼8.27x respectively.
SpanDB: A Fast, Cost-Effective LSM-tree Based KV Store on Hybrid Storage
Hao Chen, University of Science and Technology of China & Qatar Computing Research Institute, HBKU; Chaoyi Ruan and Cheng Li, University of Science and Technology of China; Xiaosong Ma, Qatar Computing Research Institute, HBKU; Yinlong Xu, University of Science and Technology of China & Anhui Province Key Laboratory of High Performance Computing
Key-Value (KV) stores support many crucial applications and services. They perform fast in-memory processing, but are still often limited by I/O performance. The recent emergence of high-speed commodity NVMe SSDs has propelled new KV system designs that take advantage of their ultra-low latency and high bandwidth. Meanwhile, to switch to entirely new data layouts and scale up entire databases to high-end SSDs requires considerable investment. As a compromise, we propose SpanDB, an LSM-tree-based KV store that adapts the popular RocksDB system to utilize selective deployment of high-speed SSDs. SpanDB allows users to host the bulk of their data on cheaper and larger SSDs, while relocating write-ahead logs (WAL) and the top levels of the LSM-tree to a much smaller and faster NVMe SSD. To better utilize this fast disk, SpanDB provides high-speed, parallel WAL writes via SPDK, and enables asynchronous request processing to mitigate inter-thread synchronization over-head and work efficiently with polling-based I/O. Our evaluation shows that SpanDB simultaneously improves RocksDB’s throughput by up to 8.8x and reduces its latency by 9.5- 58.3%. Compared with KVell, a system designed for high-end SSDs, SpanDB achieves 96-140% of its throughput, with a 2.3-21.6x lower latency, at a cheaper storage configuration.
Evolution of Development Priorities in Key-value Stores Serving Large-scale Applications: The RocksDB Experience
Siying Dong, Andrew Kryczka, and Yanqin Jin, Facebook Inc.; Michael Stumm, University of Toronto
RocksDB is a key-value store targeting large-scale distributed systems and optimized for Solid State Drives (SSDs). This paper describes how our priorities in developing RocksDB have evolved over the last eight years. The evolution is the result both of hardware trends and of extensive experience running RocksDB at scale in production at a number of organizations. We describe how and why RocksDB's resource optimization target migrated from write amplification, to space amplification, to CPU utilization. Lessons from running large-scale applications taught us that resource allocation needs to be managed across different RocksDB instances, that data format needs to remain backward and forward compatible to allow incremental software rollout, and that appropriate support for database replication and backups are needed. Lessons from failure handling taught us that data corruption errors needed to be detected earlier and at every layer of the system.
REMIX: Efficient Range Query for LSM-trees
Wenshao Zhong, Chen Chen, and Xingbo Wu, University of Illinois at Chicago; Song Jiang, University of Texas at Arlington
LSM-tree based key-value (KV) stores organize data in a multi-level structure for high-speed writes. Range queries on traditional LSM-trees must seek and sort-merge data from multiple table files on the fly, which is expensive and often leads to mediocre read performance. To improve range query efficiency on LSM-trees, we introduce a space-efficient KV index data structure, named REMIX, that records a globally sorted view of KV data spanning multiple table files. A range query on multiple REMIX-indexed data files can quickly locate the target key using a binary search, and retrieve subsequent keys in sorted order without key comparisons. We build RemixDB, an LSM-tree based KV-store that adopts a write-efficient compaction strategy and employs REMIXes for fast point and range queries. Experimental results show that REMIXes can substantially improve range query performance in a write-optimized LSM-tree based KV-store.
8:55 am–9:05 am
9:05 am–10:30 am
Advanced File Systems
Session Chair: Keith A. Smith, MongoDB
High Velocity Kernel File Systems with Bento
Samantha Miller, Kaiyuan Zhang, Mengqi Chen, and Ryan Jennings, University of Washington; Ang Chen, Rice University; Danyang Zhuo, Duke University; Thomas Anderson, University of Washington
Awarded Best Paper!
High development velocity is critical for modern systems. This is especially true for Linux file systems which are seeing increased pressure from new storage devices and new demands on storage systems. However, high velocity Linux kernel development is challenging due to the ease of introducing bugs, the difficulty of testing and debugging, and the lack of support for redeployment without service disruption. Existing approaches to high-velocity development of file systems for Linux have major downsides, such as the high performance penalty for FUSE file systems, slowing the deployment cycle for new file system functionality.
We propose Bento, a framework for high velocity development of Linux kernel file systems. It enables file systems written in safe Rust to be installed in the Linux kernel, with errors largely sandboxed to the file system. Bento file systems can be replaced with no disruption to running applications, allowing daily or weekly upgrades in a cloud server setting. Bento also supports userspace debugging. We implement a simple file system using Bento and show that it performs similarly to VFS-native ext4 on a variety of benchmarks and outperforms a FUSE version by 7x on 'git clone'. We also show that we can dynamically add file provenance tracking to a running kernel file system with only 15ms of service interruption.
Scalable Persistent Memory File System with Kernel-Userspace Collaboration
Youmin Chen, Youyou Lu, and Bohong Zhu, Tsinghua University; Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau, University of Wisconsin–Madison; Jiwu Shu, Tsinghua University
We introduce Kuco, a novel direct-access file system architecture whose main goal is scalability. Kuco utilizes three key techniques – collaborative indexing, two-level locking, and versioned reads – to offload time-consuming tasks, such as pathname resolution and concurrency control, from the kernel to userspace, thus avoiding kernel processing bottlenecks. Upon Kuco, we present the design and implementation of KucoFS, and then experimentally show that KucoFS has excellent performance in a wide range of experiments; importantly, KucoFS scales better than existing file systems by up to an order of magnitude for metadata operations, and fully exploits device bandwidth for data operations.
Rethinking File Mapping for Persistent Memory
Ian Neal, Gefei Zuo, Eric Shiple, and Tanvir Ahmed Khan, University of Michigan; Youngjin Kwon, School of Computing, KAIST; Simon Peter, University of Texas at Austin; Baris Kasikci, University of Michigan
Persistent main memory (PM) dramatically improves IO performance. We find that this results in file systems on PM spending as much as 70% of the IO path performing file mapping (mapping file offsets to physical locations on storage media) on real workloads. However, even PM-optimized file systems perform file mapping based on decades-old assumptions. It is now critical to revisit file mapping for PM.
We explore the design space for PM file mapping by building and evaluating several file-mapping designs, including different data structure, caching, as well as meta-data and block allocation approaches, within the context of a PM-optimized file system. Based on our findings, we design HashFS, a hash-based file mapping approach. HashFS uses a single hash operation for all mapping and allocation operations, bypassing the file system cache, instead prefetching mappings via SIMD parallelism and caching translations explicitly. HashFS’s resulting low latency provides superior performance compared to alternatives. HashFS increases the throughput of YCSB on LevelDB by up to 45% over page-cached extent trees in the state-of-the-art Strata PM-optimized file system
pFSCK: Accelerating File System Checking and Repair for Modern Storage
David Domingo and Sudarsun Kannan, Rutgers University
We propose and design pFSCK, a parallel file system checking and recovery (C/R) tool designed to exploit compute and storage parallelism in modern storage devices. pFSCK enables fine-grained parallelism at the granularity of inodes and directory blocks without impacting the C/R’s correctness. pFSCK first employs data parallelism by identifying functional operations in each stage of the checking logic and then isolating dependent operations and shared data structures. However, full isolation of shared structures is infeasible and requires serialized updates. To reduce serialization bottlenecks, pFSCK introduces pipeline parallelism, allowing multiple stages of C/R to run concurrently without impacting correctness. Further, pFSCK provides per-thread I/O cache management, dynamic thread placement across C/R stages, and a resource-aware scheduler to reduce the impact of C/R on other applications sharing CPUs and the file system. Evaluation of pFSCK shows more than 2.6x gains over e2fsck (Ext file system C/R) and more than 1.8x over XFS’s C/R that provides coarse-grained parallelism.
Pattern-Guided File Compression with User-Experience Enhancement for Log-Structured File System on Mobile Devices
Cheng Ji, Nanjing University of Science and Technology; Li-Pin Chang, National Chiao Tung University, National Yang Ming Chiao Tung University; Riwei Pan and Chao Wu, City University of Hong Kong; Congming Gao, Tsinghua University; Liang Shi, East China Normal University; Tei-Wei Kuo and Chun Jason Xue, City University of Hong Kong
Mobile applications exhibit unique file access patterns, often involving random accesses of write-mostly files and read-only files. The high write stress of mobile applications significantly impacts on the lifespan of flash-based mobile storage. To reduce write stress and save space without sacrificing user-perceived latency, this study introduces FPC, file access pattern guided compression. FPC is optimized for the random-writes and fragmented-reads of mobile applications. It features dual-mode compression: Foreground compression handles write-mostly files for write stress reduction, while background compression packs random-reading file blocks for boosted read performance. FPC exploits the out-of-place updating design in F2FS, a log-structured file system for mobile devices, for the best effect of the proposed dual-mode compression. Experimental results showed that FPC reduced the volume of total write traffic and executable file size by 26.1% and 23.7% on average, respectively, and improved the application launching time by up to 14.8%.
10:30 am–10:45 am
10:45 am–11:45 am
Netflix: Streaming Entertainment to 200 Million Members Around the World
Jonathan Looney, Netflix
Netflix has built a content delivery network (CDN), called Open Connect, to stream video to its customers around the world. Open Connect Appliances (or, simply, OCAs) are the content caches that serve this data to Netflix subscribers. In some ways, the work which the OCAs perform is fairly normal for a web server. However, Netflix engineers have spent years refining the OCAs to perform this work very efficiently. Through a combination of hardware and software engineering, Netflix is able to deliver very high volumes of data from a relatively modest fleet of servers, achieving industry-leading efficiency in content delivery of streaming video.
In this presentation, Jonathan Looney will describe some of the optimizations which Netflix has deployed to allow the OCAs to deliver data very efficiently.
Jonathan Looney, Netflix
Jonathan Looney manages a development team at Netflix responsible for developing and maintaining the operating system which runs on the Open Connect Appliances. Prior to joining Netflix, Jonathan worked for Juniper Networks. He is also a FreeBSD committer active in the transport protocols area.
11:45 am–12:00 pm
2021 FAST Test of Time Award Presentation
A Five-Year Study of File-System Metadata
Nitin Agrawal, William J. Bolosky, John R. Douceur, Jacob R. Lorch
Published in the Proceedings of the 5th USENIX Conference on File and Storage Technologies, February 2007
12:00 pm–12:40 pm
Wednesday, February 24, 2021
7:30 am–8:55 am
Transactions, Deduplication, and More
Session Chair: Dalit Naor, The Academic College of Tel Aviv–Yaffo
ArchTM: Architecture-Aware, High Performance Transaction for Persistent Memory
Kai Wu and Jie Ren, University of California, Merced; Ivy Peng, Lawrence Livermore National Laboratory; Dong Li, University of California, Merced
Failure-atomic transactions are a critical mechanism for accessing and manipulating data on persistent memory (PM) with crash consistency. We identify that small random writes in metadata modifications and locality-oblivious memory allocation in traditional PM transaction systems mismatch PM architecture. We present ArchTM, a PM transaction system based on two design principles: avoiding small writes and encouraging sequential writes. ArchTM is a variant of copy-on-write (CoW) system to reduce write traffic to PM. Unlike conventional CoW schemes, ArchTM reduces metadata modifications through a scalable lookup table on DRAM. ArchTM introduces an annotation mechanism to ensure crash consistency and a locality-aware data path in memory allocation to increases coalesable writes inside PM devices. We evaluate ArchTM against four state-of-the-art transaction systems (one in PMDK, Romulus, DudeTM, and one from Oracle. ArchTM outperforms the competitor systems by 58x, 5x, 3x and 7x on average, using micro-benchmarks and real-world workloads on real PM.
SPHT: Scalable Persistent Hardware Transactions
Daniel Castro, INESC-ID & Instituto Superior Técnico; Alexandro Baldassin, UNESP - Universidade Estadual Paulista; João Barreto and Paolo Romano, INESC-ID & Instituto Superior Técnico
With the emergence of byte-addressable Persistent Memory(PM), a number of works have recently addressed the problem of how to implement persistent transactional memory using off-the-shelf hardware transactional memory systems. Using Intel Optane DC PM we show, for the first time in the literature, experimental results highlighting several scalability bottlenecks of state of the art approaches, which so far have been evaluated only via PM emulation. We tackle these limitations by proposing SPHT (ScalablePersistent Hardware Transactions), an innovative PersistentTransactional Memory that exploits a set of novel mechanisms aimed at enhancing scalability both during transaction processing and recovery. We show that SPHT enhances through-put by up to 2.6x on STAMP and achieve speedups of 2.8x in the log replay phase vs state of the art solutions.
The Dilemma between Deduplication and Locality: Can Both be Achieved?
Xiangyu Zou and Jingsong Yuan, Harbin Institute of Technology, Shenzhen; Philip Shilane, Dell Technologies; Wen Xia, Harbin Institute of Technology, Shenzhen, and Wuhan National Laboratory for Optoelectronics; Haijun Zhang and Xuan Wang, Harbin Institute of Technology, Shenzhen
Data deduplication is widely used to reduce the size of backup workloads, but it has the known disadvantage of causing poor data locality, also referred to as the fragmentation problem, which leads to poor restore and garbage collection (GC) performance. Current research has considered writing duplicates to maintain locality (e.g. rewriting) or caching data in memory or SSD, but fragmentation continues to hurt restore and GC performance.
Investigating the locality issue, we observed that most duplicate chunks in a backup are directly from its previous backup. We therefore propose a novel management-friendly deduplication framework, called MFDedup, that maintains the locality of backup workloads by using a data classification approach to generate an optimal data layout. Specifically, we use two key techniques: Neighbor-Duplicate-Focus indexing (NDF) and Across-Version-Aware Reorganization scheme (AVAR), to perform duplicate detection against a previous backup and then rearrange chunks with an offline and iterative algorithm into a compact, sequential layout that nearly eliminates random I/O during restoration.
Evaluation results with four backup datasets demonstrates that, compared with state-of-the-art techniques, MFDedup achieves deduplication ratios that are 1.12x to 2.19x higher and restore throughputs that are 2.63x to 11.64x faster due to the optimal data layout we achieve. While the rearranging stage introduces overheads, it is more than offset by a nearly-zero overhead GC process. Moreover, the NDF index only requires indexes for two backup versions, while the traditional index grows with the number of versions retained.
Remap-SSD: Safely and Efficiently Exploiting SSD Address Remapping to Eliminate Duplicate Writes
You Zhou, Qiulin Wu, and Fei Wu, Huazhong University of Science and Technology; Hong Jiang, University of Texas at Arlington; Jian Zhou and Changsheng Xie, Huazhong University of Science and Technology
Duplicate writes are prevalent in diverse storage systems, originating from data duplication, journaling, and data relocations, etc. As flash-based SSDs have been widely deployed, these writes can significantly degrade their performance and lifetime. To eliminate duplicate writes, prior studies have proposed innovative approaches that exploit the address remapping utility inside SSDs. However, remap operations lead to a mapping inconsistency problem, which may cause data loss and has not been properly addressed in existing studies. In this paper, we propose a novel SSD design, called Remap-SSD, with two notable features. First, it provides a remap primitive, which allows the host software and SSD firmware to perform logical writes of duplicate data at almost zero cost. Second, a hybrid storage architecture is employed to maintain the mapping consistency. Small byte-addressable non-volatile RAM (NVRAM) is used to persist remapping metadata in a log-structured manner and is managed synergistically with flash memory. We verify Remap-SSD on a software SSD emulator with three case studies: intra-SSD deduplication, SQLite journaling, and F2FS cleaning. Experimental results show that Remap-SSD can realize the full potential of address remapping to improve SSD performance and lifetime.
CheckFreq: Frequent, Fine-Grained DNN Checkpointing
Jayashree Mohan, UT Austin; Amar Phanishayee, Microsoft Research; Vijay Chidambaram, UT Austin and VMware research
Training Deep Neural Networks (DNNs) is a resource-hungry and time-consuming task. During training, the model performs computation at the GPU to learn weights, repeatedly, over several epochs. The learned weights reside in GPU memory, and are occasionally checkpointed (written to persistent storage) for fault-tolerance. Traditionally, model parameters are checkpointed at epoch boundaries; for modern deep networks, an epoch runs for several hours. An interruption to the training job due to preemption, node failure, or process failure, therefore results in the loss of several hours worth of GPU work on recovery.
We present CheckFreq, an automatic, fine-grained checkpointing framework that (1) algorithmically determines the checkpointing frequency at the granularity of iterations using systematic online profiling, (2) dynamically tunes checkpointing frequency at runtime to bound the checkpointing overhead using adaptive rate tuning, (3) maintains the training data invariant of using each item in the dataset exactly once per epoch by checkpointing data loader state using a light-weight resumable iterator, and (4) carefully pipelines checkpointing with computation to reduce the checkpoint cost by introducing two-phase checkpointing. Our experiments on a variety of models, storage backends, and GPU generations show that CheckFreq can reduce the recovery time from hours to seconds while bounding the runtime overhead within 3.5%.
8:55 am–9:05 am
9:05 am–9:40 am
Work-in-Progress Reports (WiPs)
FAST '21 Work-in-Progress Reports (WiPs)
Work-in-Progress Reports (WiPs) are short presentations about work in progress, new results, or timely topics.
An Application-Local Library File System for Persistent Memory
Keiichi Matsuzawa, Hitachi, Ltd.; Takahiro Shinagawa, The University of Tokyo
Automated I/O Parameter Tuning of Scientific Applications with Parametrizable Workload Replays
Azat Nurgaliev and Marcus Paradies, German Aerospace Center
Consistent WAL Performances on Shared Drives
Lalitha Donga, Ben Reed, and Kayla Walton, San Jose State University
Semantics-Aware Shadow Paging: Handling Transaction Conflict in EXT4 Journaling
Joontaek Oh, Hojin Nam, Kyoungho Koo, and Youjip Won, KAIST
RAID2.0++: Fast Reconstruction for Large Disk Enclosures Based on RAID2.0
Qiliang Li, Yinlong Xu, and Min Lyu, University of Science and Technology of China
Update-friendly Encoding from Replication to Erasure Coding in Clustered File Systems
Wei Wang, Min Lyu, and Yinlong Xu, University of Science and Technology of China
Building a Reusable Privacy Substrate for Data Analytics Using Smart Storage Nodes
Zsolt István, IT University, Copenhagen
9:40 am–9:50 am
9:50 am–11:15 am
Cloud and Distributed Systems
Session Chair: Daniel Ellard, Raytheon BBN Technologies
Facebook's Tectonic Filesystem: Efficiency from Exascale
Satadru Pan, Facebook, Inc.; Theano Stavrinos, Facebook, Inc. and Princeton University; Yunqiao Zhang, Atul Sikaria, Pavel Zakharov, Abhinav Sharma, Shiva Shankar P, Mike Shuey, Richard Wareing, Monika Gangapuram, Guanglei Cao, Christian Preseau, Pratap Singh, Kestutis Patiejunas, and JR Tipton, Facebook, Inc.; Ethan Katz-Bassett, Columbia University; Wyatt Lloyd, Princeton University
Tectonic is Facebook’s exabyte-scale distributed filesystem. Tectonic consolidates large tenants that previously used service-specific systems into general multitenant filesystem instances that achieve performance comparable to the specialized systems. The exabyte-scale consolidated instances enable better resource utilization, simpler services, and less operational complexity than our previous approach. This paper describes Tectonic’s design, explaining how it achieves scalability, supports multitenancy, and allows tenants to specialize operations to optimize for diverse workloads. The paper also presents insights from designing, deploying, and operating Tectonic.
Exploiting Combined Locality for Wide-Stripe Erasure Coding in Distributed Storage
Yuchong Hu, Liangfeng Cheng, and Qiaori Yao, Huazhong University of Science & Technology; Patrick P. C. Lee, The Chinese University of Hong Kong; Weichun Wang and Wei Chen, HIKVISION
Erasure coding is a low-cost redundancy mechanism for distributed storage systems by storing stripes of data and parity chunks. Wide stripes are recently proposed to suppress the fraction of parity chunks in a stripe to achieve extreme storage savings. However, wide stripes aggravate the repair penalty, while existing repair-efficient approaches for erasure coding cannot effectively address wide stripes. In this paper, we propose combined locality, the first mechanism that systematically addresses the wide-stripe repair problem via the combination of both parity locality and topology locality. We further augment combined locality with efficient encoding and update schemes. Experiments on Amazon EC2 show that combined locality reduces the single-chunk repair time by up to 90.5% compared to locality-based state-of-the-arts, with only a redundancy of as low as 1:063x.
On the Feasibility of Parser-based Log Compression in Large-Scale Cloud Systems
Junyu Wei and Guangyan Zhang, Tsinghua University; Yang Wang, The Ohio State University; Zhiwei Liu, China University of Geosciences; Zhanyang Zhu and Junchao Chen, Tsinghua University; Tingtao Sun and Qi Zhou, Alibaba Cloud
Given the tremendous scale of today’s system logs, compression is widely used to save space. While parser-based log compressor reported promising results, we observe less intriguing performance when applying it to our production logs.
Our detailed analysis shows that, first, some problems are caused by a combination of sub-optimal implementation and assumptions that do not hold on our large-scale logs. We address these issues with a more efficient implementation. Furthermore, our analysis reveals new opportunities for further improvement. In particular, numerical values account for a significant percentage of space and classic compression algorithms, which try to identify duplicate bytes, do not work well on numerical values. We propose three techniques, namely delta timestamps, correlation identification, and elastic encoding, to further compress numerical values.
Based on these techniques, we have built LogReducer. Our evaluation on 18 types of production logs and 16 types of public logs shows that LogReducer achieves the highest compression ratio in almost all cases and on large logs, its speed is comparable to the general-purpose compression algorithm that targets a high compression ratio.
CNSBench: A Cloud Native Storage Benchmark
Alex Merenstein, Stony Brook University; Vasily Tarasov, Ali Anwar, and Deepavali Bhagwat, IBM Research–Almaden; Julie Lee, Stony Brook University; Lukas Rupprecht and Dimitris Skourtis, IBM Research–Almaden; Yang Yang and Erez Zadok, Stony Brook University
Modern hybrid cloud infrastructures require software to be easily portable between heterogeneous clusters. Application containerization is a proven technology to provide this portability for the functionalities of an application. However, to ensure performance portability, dependable verification of a cluster's performance under realistic workloads is required. Such verification is usually achieved through benchmarking the target environment and its storage in particular, as I/O is often the slowest component in an application. Alas, existing storage benchmarks are not suitable to generate cloud native workloads as they do not generate any storage control operations (e.g., volume or snapshot creation), cannot easily orchestrate a high number of simultaneously running distinct workloads, and are limited in their ability to dynamically change workload characteristics during a run.
In this paper, we present the design and prototype for the first-ever Cloud Native Storage Benchmark—CNSBench. CNSBench treats control operations as first-class citizens and allows to easily combine traditional storage benchmark workloads with user-defined control operation workloads. As CNSBench is a cloud native application itself, it natively supports orchestration of different control and I/O workload combinations at scale. We built a prototype of CNSBench for Kubernetes, leveraging several existing containerized storage benchmarks for data and metadata I/O generation. We demonstrate CNSBench's usefulness with case studies of Ceph and OpenEBS, two popular storage providers for Kubernetes, uncovering and analyzing previously unknown performance characteristics.
Concordia: Distributed Shared Memory with In-Network Cache Coherence
Qing Wang, Youyou Lu, Erci Xu, Junru Li, Youmin Chen, and Jiwu Shu, Tsinghua University
Distributed shared memory (DSM) is experiencing a resurgence with emerging fast network stacks. Caching, which is still needed for reducing frequent remote access and balancing load, can incur high coherence overhead. In this paper, we propose CONCORDIA, a DSM with fast in-network cache coherence backed by programmable switches. At the core of CONCORDIA is FLOWCC, a hybrid cache coherence protocol, enabled by a collaborative effort from switches and servers. Moreover, to overcome limitations of programmable switches, we also introduce two techniques: (i) an ownership migration mechanism to address the problem of limited memory capacity on switches and (ii) idempotent operations to handle packet loss in the case that switches are stateful. To demonstrate CONCORDIA’s practical benefits, we build a distributed key-value store and a distributed graph engine on it, and port a distributed transaction processing system to it. Evaluation shows that CONCORDIA obtains up to 4.2x, 2.3x and 2x speedup over state-of-the-art DSMs on key-value store, graph engine and transaction processing workloads, respectively.
11:15 am–12:00 pm
Thursday, February 25, 2021
7:30 am–8:40 am
Session Chair: Carl Waldspurger, Carl Waldspurger Consulting
eMRC: Efficient Miss Ratio Approximation for Multi-Tier Caching
Zhang Liu, University of Colorado Boulder; Hee Won Lee, Samsung Electronics; Yu Xiang, AT&T Labs Research; Dirk Grunwald and Sangtae Ha, University of Colorado Boulder
Many storage cache allocation methods use the miss ratio curve (MRC) to improve cache efficiency. However, they have focused only on single-tier cache architectures and require the whole MRC as input for cache management, while modern datacenters embrace hierarchical caching architectures to maximize resource utilization. Generating the MRC for multi-tier caches—we call it the miss ratio function—is far more challenging due to different eviction policies and capacities in each cache tier. We introduce eMRC, a multi-dimensional miss ratio approximation technique, to enable efficient MRC generation for multi-tier caching. Our approach uses a novel multi-dimensional performance cliff removal method and convex hull approximation technique to efficiently generate a multi-dimensional MRC without cliffs using a small number of sampling points. To demonstrate the benefits of eMRC, we designed ORCA, a multi-tier cache management framework that orchestrates caches residing in different hierarchies through eMRC and provides efficient multi-tier cache configurations to cloud tenants with diverse service level objectives. We evaluate the performance of our eMRC approximation technique and ORCA with real-world datacenter traces.
The Storage Hierarchy is Not a Hierarchy: Optimizing Caching on Modern Storage Devices with Orthus
Kan Wu, Zhihan Guo, Guanzhou Hu, and Kaiwei Tu, University of Wisconsin–Madison; Ramnatthan Alagappan, VMware Research; Rathijit Sen and Kwanghyun Park, Microsoft; Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau, University of Wisconsin–Madison
We introduce non-hierarchical caching (NHC), a novel approach to caching in modern storage hierarchies. NHC improves performance as compared to classic caching by redirecting excess load to devices lower in the hierarchy when it is advantageous to do so. NHC dynamically adjusts allocation and access decisions, thus maximizing performance (e.g., high throughput, low 99%-ile latency). We implement NHC in Orthus-CAS (a block-layer caching kernel module) and Orthus-KV (a user-level caching layer for a key-value store). We show the efficacy of NHC via a thorough empirical study: Orthus-KV and Orthus-CAS offer significantly better performance (by up to 2x) than classic caching on various modern hierarchies, under a range of realistic workloads.
A Community Cache with Complete Information
Mania Abdi, Northeastern University; Amin Mosayyebzadeh, Boston University; Mohammad Hossein Hajkazemi, Northeastern University; Emine Ugur Kaynar, Boston University; Ata Turk, State Street; Larry Rudolph, TwoSigma; Orran Krieger, Boston University; Peter Desnoyers, Northeastern University
Kariz is a new architecture for caching data from datalakes accessed, potentially concurrently, by multiple analytic platforms. It integrates rich information from analytics platforms with global knowledge about demand and resource availability to enable sophisticated cache management and prefetching strategies that, for example, combine historical run time information with job dependency graphs (DAGs), information about the cache state and sharing across compute clusters. Our prototype supports multiple analytic frameworks (Pig/Hadoop and Spark), and we show that the required changes are modest. We have implemented three algorithms in Kariz for optimizing the caching of individual queries (one from the literature, and two novel to our platform) and three policies for optimizing across queries from, potentially, multiple different clusters. With an algorithm that fully exploits the rich information available from Kariz, we demonstrate major speedups (as much as 3×) for TPC-H and TPC-DS.
Learning Cache Replacement with CACHEUS
Liana V. Rodriguez, Farzana Yusuf, Steven Lyons, Eysler Paz, Raju Rangaswami, and Jason Liu, Florida International University; Ming Zhao, Arizona State University; Giri Narasimhan, Florida International University
Recent advances in machine learning open up new and attractive approaches for solving classic problems in computing systems. For storage systems, cache replacement is one such problem because of its enormous impact on performance. We classify workloads as a composition of four workload primitive types—LFU-friendly, LRU-friendly, scan, and churn. We then design and evaluate CACHEUS, a new class of fully adaptive, machine-learned caching algorithms that utilize a combination of experts designed to address these workload primitive types. The experts used by CACHEUS include the state-of-the-art ARC, LIRS and LFU, and two new ones – SR-LRU, a scan-resistant version of LRU, and CR-LFU, a churn-resistant version of LFU. We evaluate CACHEUS using 17;766 simulation experiments on a collection of 329 workloads run against 6 different cache configurations. Paired t-test analysis demonstrates that CACHEUS using the newly proposed lightweight experts, SR-LRU and CR-LFU, is the most consistently performing caching algorithm across a range of workloads and cache sizes. Furthermore, CACHEUS enables augmenting state-of-the-art algorithms (e.g., LIRS, ARC) by combining it with a complementary cache replacement algorithm (e.g., LFU) to better handle a wider variety of workload primitive types.
8:40 am–8:50 am
8:50 am–10:15 am
The SSD Revolution Is Not Over
Session Chair: Janki Bhimani, Florida International University
FusionRAID: Achieving Consistent Low Latency for Commodity SSD Arrays
Tianyang Jiang, Guangyan Zhang, and Zican Huang, Tsinghua University; Xiaosong Ma, Qatar Computing Research Institute, HBKU; Junyu Wei, Zhiyue Li, and Weimin Zheng, Tsinghua University
The use of all-flash arrays has been increasing. Compared to their hard-disk counterparts, each drive offers higher performance but also undergoes more severe periodic performance degradation (due to internal operations such as garbage collection). With a detailed study of widely-used applications/traces and 6 SSD models, we confirm that individual SSD's performance jitters are further magnified in RAID arrays. Our results also reveal that with SSD latency low and decreasing, the software overhead of RAID write creates long, complex write paths involving more drives, raising both average-case latency and risk of exposing worst-case performance.
Based on these findings, we propose FusionRAID, a new RAID architecture that achieves consistent, low latency on commodity SSD arrays. By spreading requests to all SSDs in a shared, large storage pool, bursty application workloads can be served by plenty of ''normal-behaving'' drives. By performing temporary, replicated writes, it retains RAID fault-tolerance yet greatly accelerates small, random writes. Blocks of such transient data replicas are created in stripe-ready locations based on RAID declustering, enabling effortless conversion to long-term RAID storage. Finally, using lightweight SSD latency spike detection and request redirection, FusionRAID avoids drives under transient but severe performance degradation. Our evaluation with traces and applications shows that FusionRAID brings a 22%–98% reduction in median latency, and a 2.7x–62x reduction in tail latency, with a moderate and temporary space overhead.
Behemoth: A Flash-centric Training Accelerator for Extreme-scale DNNs
Shine Kim, Seoul National University and Samsung Electronics; Yunho Jin, Gina Sohn, Jonghyun Bae, Tae Jun Ham, and Jae W. Lee, Seoul National University
The explosive expansion of Deep Neural Networks (DNN) model size expedites the need for larger memory capacity. This movement is particularly true for models in natural language processing (NLP), a dominant application of AI along with computer vision. For example, a recent extreme-scale language model GPT-3 from OpenAI has over 175 billion parameters. Furthermore, such a model mostly consists of FC layers with huge dimensions, and thus has a relatively high arithmetic intensity. In that sense, an extreme-scale language model does not suit well to the conventional HBM DRAM-based memory system that lacks capacity and offers extremely high bandwidth. For this reason, we propose to pair the neural network training accelerator with the flash-based memory system instead of the HBM DRAM-based memory system. To design the effective flash-based memory system, we optimize the existing SSD design to improve the SSD bandwidth as well as endurance. Finally, we evaluate our proposed platform, and show that Behemoth achieves 3.65× cost saving over TPU v3 and 2.05× training throughput improvement over the accelerator attached to a commercial SSD.
FlashNeuron: SSD-Enabled Large-Batch Training of Very Deep Neural Networks
Jonghyun Bae, Seoul National University; Jongsung Lee, Seoul National University and Samsung Electronics; Yunho Jin and Sam Son, Seoul National University; Shine Kim, Seoul National University and Samsung Electronics; Hakbeom Jang, Samsung Electronics; Tae Jun Ham and Jae W. Lee, Seoul National University
Deep neural networks (DNNs) are widely used in various AI application domains such as computer vision, natural language processing, autonomous driving, and bioinformatics. As DNNs continue to get wider and deeper to improve accuracy, the limited DRAM capacity of a training platform like GPU often becomes the limiting factor on the size of DNNs and batch size—called memory capacity wall. Since increasing the batch size is a popular technique to improve hardware utilization, this can yield a suboptimal training throughput. Recent proposals address this problem by offloading some of the intermediate data (e.g., feature maps) to the host memory. However, they fail to provide robust performance as the training process on a GPU contends with applications running on a CPU for memory bandwidth and capacity. Thus, we propose FlashNeuron, the first DNN training system using an NVMe SSD as a backing store. To fully utilize the limited SSD write bandwidth, FlashNeuron introduces an offloading scheduler, which selectively offloads a set of intermediate data to the SSD in a compressed format without increasing DNN evaluation time. FlashNeuron causes minimal interference to CPU processes as the GPU and the SSD directly communicate for data transfers. Our evaluation of FlashNeuron with four state-of-the-art DNNs shows that FlashNeuron can increase the batch size by a factor of 12.4x to 14.0x over the maximum allowable batch size on NVIDIA Tesla V100 GPU with 16GB DRAM. By employing a larger batch size, FlashNeuron also improves the training throughput by up to 37.8% (with an average of 30.3%) over the baseline using GPU memory only, while minimally disturbing applications running on CPU.
D2FQ: Device-Direct Fair Queueing for NVMe SSDs
Jiwon Woo, Minwoo Ahn, Gyusun Lee, and Jinkyu Jeong, Sungkyunkwan University
With modern high-performance SSDs that can handle parallel I/O requests from multiple tenants, fair sharing of block I/O is an essential requirement for performance isolation. Typical block I/O schedulers take three steps (submit-arbitrate-dispatch) to transfer an I/O request to a device, and the three steps incur high overheads in terms of CPU utilization, scalability and block I/O performance. This motivates us to offload the I/O scheduling function to a device. If so, the three steps can be reduced to one step (submit=dispatch), thereby saving CPU cycles and improving the I/O performance.
To this end, we propose D2FQ, a fair-queueing I/O scheduler that exploits the NVMe weighted round-robin (WRR) arbitration, a device-side I/O scheduling feature. D2FQ abstracts the three classes of command queues in WRR as three queues with different I/O processing speeds. Then, for every I/O submission D2FQ selects and dispatches an I/O request to one of three queues immediately while satisfying fairness. This avoids time-consuming I/O scheduling operations, thereby saving CPU cycles and improving the block I/O performance. The prototype is implemented in the Linux kernel and evaluated with various workloads. With synthetic workloads, D2FQ provides fairness while saving CPU cycles by up to 45% as compared to MQFQ, a state-of-the-art fair queueing I/O scheduler.
An In-Depth Study of Correlated Failures in Production SSD-Based Data Centers
Shujie Han and Patrick P. C. Lee, The Chinese University of Hong Kong; Fan Xu, Yi Liu, Cheng He, and Jiongzhou Liu, Alibaba Group
Flash-based solid-state drives (SSDs) are increasingly adopted as the mainstream storage media in modern data centers. However, little is known about how SSD failures in the field are correlated, both spatially and temporally. We argue that characterizing correlated failures of SSDs is critical, especially for guiding the design of redundancy protection for high storage reliability. We present an in-depth data-driven analysis on the correlated failures in the SSD-based data centers at Alibaba. We study nearly one million SSDs of 11 drive models based on a dataset of SMART logs, trouble tickets, physical locations, and applications. We show that correlated failures in the same node or rack are common, and study the possible impacting factors on those correlated failures. We also evaluate via trace-driven simulation how various redundancy schemes affect the storage reliability under correlated failures. To this end, we report 15 findings. Our dataset and source code are now released for public use.
10:15 am–10:30 am
10:30 am–11:30 am
DNA Data Storage and Near-Molecule Processing for the Yottabyte Era
Luis Ceze, University of Washington, and Karin Strauss, Microsoft Research
DNA data storage is an attractive option for digital data storage because of its extreme density, durability, eternal relevance and environmental sustainability. This is especially attractive when contrasted with the exponential growth in world-wide digital data production. In this talk we will present our efforts in building an end-to-end system, from the computational component of encoding and decoding to the molecular biology component of random access, sequencing and fluidics automation. We will also discuss some early efforts in building a hybrid electronic/molecular computer system that can offer more than data storage, for example, image similarity search.
Luis Ceze, University of Washington
Luis Ceze is a Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington, Co-founder and CEO at OctoML, and Venture Partner at Madrona Venture Group. His research focuses on the intersection between computer architecture, programming languages, machine learning and biology. His current research focus is on approximate computing for efficient machine learning and DNA-based data storage. He co-directs the Molecular Information Systems Lab. He has co-authored over 100 papers in these areas, and had several papers selected as IEEE Micro Top Picks and CACM Research Highlights. His research has been featured prominently in the media including New York Times, Popular Science, MIT Technology Review, and Wall Street Journal, among others. He is a recipient of an NSF CAREER Award, a Sloan Research Fellowship, a Microsoft Research Faculty Fellowship, the 2013 IEEE TCCA Young Computer Architect Award, the 2020 ACM SIGARCH Maurice Wilkes Award and UIUC Distinguished Alumni Award.
Karin Strauss, Microsoft Research
Karin Strauss is a Senior Principal Research Manager at Microsoft Research and an Affiliate Professor at the Paul G. Allen School of Computer Science and Engineering at University of Washington. Her research areas are computer architecture, systems, and most recently biology, and she co-directs the Molecular Information System Laboratory. Her research interests include environmental sustainability of IT infrastructure, emerging memory and storage technologies, scaling of computation and storage, and special-purpose accelerators. She has over 100 papers and patents in these areas, was selected as one of the "100 Most Creative People in Business in 2016" by Fast Company Magazine, and is a recipient of the 2020 ACM SIGARCH Maurice Wilkes award. Her research has been featured prominently in the media including New York Times, Wall Street Journal, MIT Technology Review, Scientific American, and Popular Science, among others. She got her PhD from the Department of Computer Science at the University of Illinois, Urbana-Champaign in 2007.
11:30 am–11:45 am
Program Co-Chairs: Marcos K. Aguilera, VMware Research, and Gala Yadgar, Technion—Israel Institute of Technology