Real-Time Adaptive Controls for Resilient Distributed Systems

Thursday, December 08, 2022 - 1:20 pm2:20 pm AEDT

Praveen Yedidi, CrowdStrike


Modern services are equipped with hundreds of tunables. There are a lot of these tunables such as worker pool sizes, autoscaling policies, throttlers and circuit breakers that directly effect the service resilience. Finding ideal initial values for these tunables requires deep technical expertise. Also, these workloads change over time, requiring regular effort to re-tune stale parameters. As a consequence, configuration errors have become a source of operational toil and one of the major causes of overload, cascading service and system failures across the industry. Services should aim to expose a minimal configuration surface by dynamically adjusting parameters based on observations. Praveen will provide a deep-dive into how CrowdStrike is using real-time Adaptive Controls(inspired from TCP congestion control) to dynamically tune these parameters for improved resiliency using real-time sampling of errors and latencies, removing the need for periodic adjustment. He will also discuss lessons learned deploying the feature to CrowdStrike's massive production systems that handles multiple trillions of events per day without causing any incidents.

Distributed systems developer and Engineering Manager with experience in mentoring, facilitating and leading teams offering a decade of experience in Large Scale cloud-native application and tooling development. Possessing excellent analytical skills summed up with strong knowledge in Go, JavaScript, Kubernetes, AWS, Terraform, Vault, Consul, Service Meshes, Observability and monitoring tools. Active open source contributor and contributed to projects like Kubernetes, gvisor, grafana, terraform, firecracker-containerd.

