Source Job

South America

  • Architect and maintain self-healing systems with 99.9%+ availability targets.
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns.
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data.

GCP Kubernetes Terraform Python Go

20 jobs similar to Principal Site Reliability Engineer (AI-first SRE)

Jobs ranked by similarity.

Latin America Unlimited PTO

  • Audit and optimize cloud usage, capacity, and spend.
  • Improve reliability through better automation, monitoring, and alerting.
  • Partner with engineers to upgrade infrastructure components and roll out changes safely.

Our client builds a high-scale data and analytics platform used by sophisticated teams to make critical business decisions. They are trusted by 800+ companies and value collaboration, high ownership, and long-term system reliability.

US

  • Architect and deploy secure, scalable infrastructure using Terraform, CloudFormation, or similar tools.
  • Ensure the platform meets strict SLA requirements for enterprise clients, minimizing downtime.
  • Implement comprehensive monitoring, logging, and alerting to provide deep visibility into system health.

Filevine provides cloud-based workflow tools for legal professionals, helping them manage organizations and serve clients. They are recognized as a fast-growing and innovative technology company with a team of passionate professionals.

US

  • Play a crucial part in designing and scaling secure cloud infrastructure.
  • Lead the charge in intelligent automation systems and ensure robust deployment processes.
  • Collaborate with product, engineering, and leadership to drive company success.

Jobgether is a company that connects job seekers with employers. They utilize an AI-powered matching process to ensure applications are reviewed quickly and objectively.

India

  • Oversee the reliability, scalability, performance, and security of key production services.
  • Collaborate with cross-functional teams to develop and maintain resilient infrastructure.
  • Provide expert mentorship and guidance on best practices to engineers throughout the organization.

Cision is a global leader in PR, marketing and social media management technology and intelligence, helping brands and organizations connect with customers and stakeholders to drive business results. The company has offices in 24 countries throughout the Americas, EMEA and APAC.

$219,000–$245,000/yr
US Unlimited PTO

  • Architect, operate, improve and secure the platform the Garner Health app runs on
  • Boost development velocity and productivity
  • Build systems to a high engineering standard and hold others to the same high standard

Garner has developed a revolutionary approach to evaluating doctor performance and a unique incentive model that's reshaping the healthcare economy to ensure everyone can afford high quality care. They have more than doubled their revenue annually over the last 5 years. Garner's award winning culture is designed to cultivate teamwork, trust, autonomy, exceptional results, and individual growth.

UK

Run the production environment by monitoring availability and taking a holistic view of system health. Build software and systems to manage platform infrastructure and applications. Improve reliability, quality, and time-to-market of our suite of software solutions.

NICE software products are used by 25,000+ global businesses to deliver extraordinary customer experiences, fight financial crime and ensure public safety.

  • Designing, building, and maintaining infrastructure that enables fast, reliable, and secure product delivery.
  • Improving and maintaining CI/CD pipelines to streamline deployments and increase reliability.
  • Contributing to infrastructure reliability and ensuring systems are designed for resilience and growth.

Incident.io is the leading AI incident response platform, built to help teams dramatically reduce incident response time and improve reliability. They have raised $100M from Index Ventures, Insight Partners, and Point Nine, alongside founders and executives from world-class technology companies.

$140,000–$190,000/yr
US Canada Unlimited PTO

  • Architect and maintain scalable, reliable infrastructure: Design and optimize infrastructure for high availability, fault tolerance, and performance across distributed systems.
  • Lead incident management and root cause analysis: Own incident response processes, ensure swift resolution of issues, and drive post-incident improvements to prevent recurrences.
  • Service monitoring and automation: Build and maintain automated monitoring, alerting, and healing systems that improve system health, reduce manual intervention, and minimize downtime.

VGS is the world's leader in payment tokenization, empowering clients and partners by tokenizing sensitive payment data and limiting compliance scope. They embed a universal token vault into their technology stack to manage the complexities of payment data tokenization across processors and networks and more. While the job posting doesn't specify size, they appear to have a culture that values transparency, collaboration, grit, and humility.

ANZ

  • Building world-class AI infrastructure to support a 100+ person research team.
  • Designing and scaling multi-cloud systems that support high-performance model training and inference.
  • Improving monitoring, alerting and system observability for AI workloads.

Canva is redefining how the world experiences design. They have campuses in Sydney and Melbourne, co-working spaces in Brisbane, Perth, Adelaide and Auckland, and trust their employees to choose the balance that empowers them and their team to achieve their goals.

Global 6w PTO 26w maternity

  • Build self-service systems that automate managing, deploying and operating services.
  • Automate environment observability and resilience. Enable all developers to troubleshoot and resolve problems.
  • Ensure we hit defined SLOs, including participation in an on-call rotation.

Cohere is focused on scaling intelligence to serve humanity by training and deploying frontier models for developers and enterprises. They are a team of researchers, engineers, and designers. They value diversity and strive to create an inclusive work environment.

Europe

Lead the Reliability & Operations function within the Developer & Production Enablement (DPE) division of RWS’s Product & Technology organization. Take ownership of global production operations and lead the transition from manual, ticket-based workflows to platform-integrated automation. Ensure stability today, while designing for scalability and autonomy in the future.

RWS's purpose is to unlock global understanding, valuing every language and culture, and celebrating diversity and inclusion to make the company strong.

$89,155–$287,488/yr
Global

  • Configure and maintain cloud infrastructure automation using Terraform, focusing on CDN optimization and content delivery performance
  • Develop capacity planning strategies and performance optimization initiatives for high-volume spatial content delivery.
  • Instrument services to understand system health.

Miris is a cutting-edge technology company building the future of 3D content delivery at global scale. Our mission is to empower creators and developers to deliver high-fidelity, photorealistic 3D experiences to billions of users instantly, seamlessly, and across all major platforms and devices.

US

  • Design, create, and maintain software and systems to improve the availability, scalability, and efficiency of Thumbtack's services.
  • Set the architectural direction of infrastructure and platform services while supporting the engineering organization.
  • Troubleshoot and debug critical systems throughout the SDLC.

Thumbtack helps millions of people confidently care for their homes by offering personalized guidance, AI tools, and a hiring experience. They have a growing community of 300,000 local service businesses and value a cross functional collaborative culture.

Americas EMEA Unlimited PTO

  • Design and implement highly scalable infrastructure for GitLab.com to support current and future growth.
  • Collaborate with cross-functional teams across the Infrastructure organization to plan and deliver projects that shape GitLab’s platform direction.
  • Operate and improve edge services and Kubernetes workloads, acting as a subject matter expert within the infrastructure department.

GitLab is an open-core software company that develops the most comprehensive AI-powered DevSecOps Platform, used by more than 100,000 organizations. They aim to enable everyone to contribute to and co-create the software that powers our world.

$155,000–$165,000/yr
US Unlimited PTO

  • Lead maintenance and operations for production and development environments.
  • Architect and implement complex solutions spanning OS, virtualization, network, and cloud layers.
  • Lead automation initiatives for infrastructure provisioning and operational tasks.

NMI enables partners with choice in payments, challenging the one-size-fits-all approach. They power innovative tech for SMBs, entrepreneurs, and fintech startups, fostering a diverse and welcoming workplace with a dedicated Diversity, Equity & Inclusion action group.

US

  • Designing & maintaining GCP infrastructure (GKE, Bigtable, BigQuery, GCS, networking).
  • Building monitoring, alerting, logging, and observability from the ground up.
  • Improving our security posture across auth, IAM, policies, and data access.

Software Mind develops solutions that make an impact for companies around the globe. They build cross-functional engineering teams that take ownership and crave more, embracing openness, respect, and grit. They combine employment with enjoyment in their culture.

US

Shape and scale critical infrastructure for one of the largest online platforms in the world. Build, maintain, and optimize multi-cloud compute systems for high-performance, reliable, and secure operations. Influence the technical direction of infrastructure platforms while mentoring and guiding other engineers.

This position is posted by Jobgether on behalf of a partner company.

India

  • Design and manage AWS infrastructure for AI services.
  • Implement Infrastructure as Code using Terraform.
  • Collaborate with cross-functional teams to enhance performance.

Jobgether uses an AI-powered matching process to ensure applications are reviewed quickly, objectively, and fairly against the role's core requirements. Their system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company.

Australia New Zealand

  • You’ll own challenging infrastructure problems end-to-end.
  • You’ll design scalable, maintainable services and contribute to technical proposals.
  • You’ll contribute to the roadmap for our Provisioning team.

Canva is a design platform that enables users to create a variety of visual content. They have campuses in Sydney and Melbourne, co-working spaces in other major cities, and offer a flexible work environment.

Canada

  • Design, create, and maintain software and systems to improve the availability, scalability, and efficiency of Thumbtack's services
  • Set the architectural direction of infrastructure and platform services while supporting the engineering organization
  • Design and implement tools and processes used for deployment, change, service, and infrastructure management

Thumbtack helps millions of people confidently care for their homes through personalized guidance, AI tools, and a hiring experience. They have a growing community of 300,000 local service businesses.