Operationalizing ML Training Infra at Meta Scale

Friday, December 09, 2022 - 2:20 pm2:50 pm AEDT

Shivam Bharuka, Meta


Machine learning models are growing rapidly in scale in order to support the recommendation and content understanding use-cases. In order to keep up with this growth, we have re-architected the entire AI Infrastructure stack, from creating special hardwares using powerful GPUs and network devices to designing optimized distributed training algorithms using PyTorch. Traditional reliability practices do not translate well to detect problems in the ML training stack. In this talk, I will talk about the challenges we encountered and the approach we took to re-design and scale reliability for the ML Training Platform.

Shivam Bharuka, Meta

Shivam is an engineering leader with Meta as part of the AI Infrastructure team for the last three years. During this time, he has helped scale the machine learning training infrastructure at Meta to support large scale ranking and recommendation models, serving more than a billion users. He is responsible for driving performance, reliability, and efficiency-oriented designs across the components of the ML training stack at Meta. Shivam holds a B.S. and an M.S. in Computer Engineering from the University of Illinois at Urbana-Champaign.

@conference {284953,
author = {Shivam Bharuka},
title = {Operationalizing {ML} Training Infra at Meta Scale},
year = {2022},
address = {Sydney},
publisher = {USENIX Association},
month = dec,

Presentation Video