We are looking for a Cloud Observability and Performance Engineer to join our Chaos Cloud Engineering team. In this role, you will design and implement observability, monitoring, and performance strategies for cloud-hosted microservices that manage and orchestrate endpoint security agents at scale. This position is critical to ensuring the reliability, visibility, and performance optimization of our backend systems that power cloud-based security operations for millions of endpoints worldwide.
Design, build, and maintain end-to-end observability for distributed cloud services (telemetry, logging, tracing, alerting). Develop and optimize metrics pipelines and dashboards (e.g., Prometheus, Grafana, OpenTelemetry, Datadog). Ensure high performance, availability, and scalability of agent management systems in production. Collaborate with development, SRE, and security teams to troubleshoot production issues using observability tooling.