Empowering Azure Storage with RDMA


Wei Bai, Shanim Sainul Abdeen, Ankit Agrawal, Krishan Kumar Attre, Paramvir Bahl, Ameya Bhagat, Gowri Bhaskara, Tanya Brokhman, Lei Cao, Ahmad Cheema, Rebecca Chow, Jeff Cohen, Mahmoud Elhaddad, Vivek Ette, Igal Figlin, Daniel Firestone, Mathew George, Ilya German, Lakhmeet Ghai, Eric Green, Albert Greenberg, Manish Gupta, Randy Haagens, Matthew Hendel, Ridwan Howlader, Neetha John, Julia Johnstone, Tom Jolly, Greg Kramer, David Kruse, Ankit Kumar, Erica Lan, Ivan Lee, Avi Levy, Marina Lipshteyn, Xin Liu, Chen Liu, Guohan Lu, Yuemin Lu, Xiakun Lu, Vadim Makhervaks, Ulad Malashanka, David A. Maltz, Ilias Marinos, Rohan Mehta, Sharda Murthi, Anup Namdhari, Aaron Ogus, Jitendra Padhye, Madhav Pandya, Douglas Phillips, Adrian Power, Suraj Puri, Shachar Raindel, Jordan Rhee, Anthony Russo, Maneesh Sah, Ali Sheriff, Chris Sparacino, Ashutosh Srivastava, Weixiang Sun, Nick Swanson, Fuhou Tian, Lukasz Tomczyk, Vamsi Vadlamuri, Alec Wolman, Ying Xie, Joyce Yom, Lihua Yuan, Yanzhao Zhang, and Brian Zill, Microsoft


Given the wide adoption of disaggregated storage in public clouds, networking is the key to enabling high performance and high reliability in a cloud storage service. In Azure, we choose Remote Direct Memory Access (RDMA) as our transport and aim to enable it for both storage frontend traffic (between compute virtual machines and storage clusters) and backend traffic (within a storage cluster) to fully realize its benefits. As compute and storage clusters may be located in different datacenters within an Azure region, we need to support RDMA at regional scale.

This work presents our experience in deploying intra-region RDMA to support storage workloads in Azure. The high complexity and heterogeneity of our infrastructure bring a series of new challenges, such as the problem of interoperability between different types of RDMA network interface cards. We have made several changes to our network infrastructure to address these challenges. Today, around 70% of traffic in Azure is RDMA and intra-region RDMA is supported in all Azure public regions. RDMA helps us achieve significant disk I/O performance improvements and CPU core savings.

@inproceedings {286500,
author = {Wei Bai and Shanim Sainul Abdeen and Ankit Agrawal and Krishan Kumar Attre and Paramvir Bahl and Ameya Bhagat and Gowri Bhaskara and Tanya Brokhman and Lei Cao and Ahmad Cheema and Rebecca Chow and Jeff Cohen and Mahmoud Elhaddad and Vivek Ette and Igal Figlin and Daniel Firestone and Mathew George and Ilya German and Lakhmeet Ghai and Eric Green and Albert Greenberg and Manish Gupta and Randy Haagens and Matthew Hendel and Ridwan Howlader and Neetha John and Julia Johnstone and Tom Jolly and Greg Kramer and David Kruse and Ankit Kumar and Erica Lan and Ivan Lee and Avi Levy and Marina Lipshteyn and Xin Liu and Chen Liu and Guohan Lu and Yuemin Lu and Xiakun Lu and Vadim Makhervaks and Ulad Malashanka and David A. Maltz and Ilias Marinos and Rohan Mehta and Sharda Murthi and Anup Namdhari and Aaron Ogus and Jitendra Padhye and Madhav Pandya and Douglas Phillips and Adrian Power and Suraj Puri and Shachar Raindel and Jordan Rhee and Anthony Russo and Maneesh Sah and Ali Sheriff and Chris Sparacino and Ashutosh Srivastava and Weixiang Sun and Nick Swanson and Fuhou Tian and Lukasz Tomczyk and Vamsi Vadlamuri and Alec Wolman and Ying Xie and Joyce Yom and Lihua Yuan and Yanzhao Zhang and Brian Zill},
title = {Empowering Azure Storage with {RDMA}},
booktitle = {20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)},
year = {2023},
isbn = {978-1-939133-33-5},
address = {Boston, MA},
pages = {49--67},
url = {https://www.usenix.org/conference/nsdi23/presentation/bai},
publisher = {USENIX Association},
month = apr,

