Scaling to 1 Million RPS and Beyond with Kubernetes, Istio and GRPC DevOps и эксплуатация

Доклад принят в программу конференции

Robin Percy

OpsGuru

Robin brings nearly two decades of software engineering and operational experience to our leadership team and client engagements. As a member, committer, and reviewer within the Kubernetes community, he specializes in designing, building, and operating scalable, cloud-native platforms.

Тезисы

Зал «Найроби + Касабланка»

7 ноября, 12:00

I will speak about my recent experience designing and building production-grade high-throughput istio service meshes on Kubernetes (GKE). In particular, I will discuss three important considerations in designing such a system, with a working demonstration of their implementation:
* Ensuring Visibility at all tiers;
* Efficient use of GRPC streams;
* Scaling the Istio Control Plane.

Ensuring you have visibility into the correct metrics is a critical step in building a high-performance system. Istio provides great tooling around visibility, but the defaults are not suitable for production - especially under heavy load. I will demonstrate the customizations required to Prometheus, Jaeger, and Grafana in order to maximize visibility into the platform, service mesh, protocol, and application tiers of the system.

GRPC streams are a powerful tool, but there are pitfalls to be aware of when adopting them - especially in combination with Istio. For example, Istio has limited metrics for GRPC streams, when compared to HTTP/S, which can waste a lot of time and frustration in debugging and optimizing. Similarly, load balancing GRPC streams requires special considerations. I will demonstrate how to design a production-grade system that works around these limitations.

Finally, I will demonstrate how the Istio control plane can be configured to scale with your workload. As of Istio 1.2.x, a misconfigured control plane can result in backpressure and downtime within the application data plane. In fact, the default HPA and telemetry configurations are not suitable for high-throughput systems. I will describe how these configurations can be modified to suit your workloads.