Job Description
Serve as a technical leader in modern SRE practices with a focus on zero-trust infrastructure, platform observability, and cloud-native scalability. Guide the architectural evolution of reliability systems, including multi-cluster Kubernetes environments, GitOps workflows, and service mesh integration. Champion SLO-driven engineering across teams and establish frameworks for defining, tracking, and enforcing reliability standards. Partner with platform and security teams to enable service-to-service authentication, policy enforcement, and resilient control planes. Develop AI-assisted tools and workflows (e.g., for incident triage, RCA generation, auto-remediation) to reduce operational burden and accelerate resolution. Define and maintain end-to-end observability strategies including distributed tracing, metrics pipelines, and log enrichment.
About Upwork
Upwork is the worldβs work marketplace that serves everyone from one-person startups to over 30% of the Fortune 100 companies.