Remote Devops Jobs · Prometheus

Job listings

Monitor public and internal IT services round-the-clock. Process events in the incident management system in a timely manner. Diagnose issues and fix them when possible. Develop and maintain the existing monitoring systems: Terraform for managing resources on AWS and VMware vSphere, Ansible for configuration management, TeamCity for Continuous Delivery; develop the Prometheus + Kubernetes bundle. Perform DevOps tasks for other teams.

As a Site Reliability Engineer (SRE) at Alpaca, you will be responsible for ensuring the reliability, scalability, and performance of our systems and services. You will work closely with development, operations and DevOps teams to build and maintain robust applications, ensuring they run smoothly and efficiently. This role requires a blend of software engineering and operations skills, with a strong ability to troubleshoot technical issues and resolve problems before they impact our users.

The Site Reliability Engineer plays a key role in platform enablement by building and maintaining core infrastructure tooling that enables teams to deploy and operate services reliably using AWS and Kubernetes. This position focuses on managing and evolving internal Infrastructure as Code (IaC) constructs, primarily Python-based abstractions built with AWS CDK and CDK8s. The engineer works closely with backend teams driving platform reliability and developer productivity.