Source Job

Europe

Lead the Reliability & Operations function within the Developer & Production Enablement (DPE) division of RWS’s Product & Technology organization. Take ownership of global production operations and lead the transition from manual, ticket-based workflows to platform-integrated automation. Ensure stability today, while designing for scalability and autonomy in the future.

AWS Azure GCP Automation Observability

20 jobs similar to Head of Reliability and Operations

Jobs ranked by similarity.

$95,696–$108,929/yr
AU 5w PTO 12w maternity

  • Share SRE expertise with teams across the company.
  • Keep our build systems running with high reliability and availability.
  • Improve and iterate on our existing reliability practices.

Octopus Deploy sets the standard for Continuous Delivery, empowering software teams to deliver value in an agile way.

Design, implement, monitor and maintain Sysdig's Infrastructure at scale on different clouds and on-prem. Collaborate with development teams to improve system reliability, performance, and scalability. Participate in on-call rotation, respond to incidents, conduct root cause analyses, and implement preventive measures.

Sysdig helps organizations secure innovation in the cloud with runtime insights, open innovation, and agentic AI, trusted by over 60% of the Fortune 500.

Europe US 3w PTO

  • Own and sequence the Cloud Infra & Services roadmap.
  • Define SLOs with SRE/Infra and enforce error budgets for shipping.
  • Partner with Finance/RevOps on entitlements, reporting, and auditability.

n8n is the open workflow orchestration platform built for the new era of AI, giving technical teams the freedom of code with the speed of no-code.

Europe

As an SRE you will be responsible for ensuring the availability, performance and cost effectiveness of these services. You will be working with multiple feature development teams and the BAU/Support team to define and evolve our cloud & on-prem infrastructure & delivery pipelines, improving system observability. Proactively identifying and mitigating reliability risks.

In 2019, our founders were working as engineers solving complex cross domain problems within government organisations TwinStream was formed.

$160,000–$182,000/yr
US

  • Lead and mentor multiple teams across SRE, cloud infrastructure, and platform engineering functions.
  • Drive multi-team initiatives to deliver scalable, secure, and cost-efficient infrastructure leveraging AWS-native and serverless technologies.
  • Drive adoption of FinOps practices and partner with finance and product teams on budgeting and forecasting.

Model N is the leader in revenue optimization and compliance for pharmaceutical, medtech, and high-tech innovators. Model N is trusted by over 150 of the world’s leading companies across more than 120 countries.

Canada 5w PTO

Design, implement, and evolve large-scale, cloud-native infrastructure supporting MariaDB's global SaaS platform. Lead reliability and scalability initiatives, driving automation and resilience through infrastructure-as-code and GitOps practices. Proactively identify and remediate systemic reliability issues, ensuring high service availability and performance across multi-cloud environments.

MariaDB is making a big impact on the world and is the backbone of applications used everyday, including 75% of the Fortune 500 companies.

Brazil 26w maternity 4w paternity

Support the evolution of our platform by improving scalability, reliability, observability, and security. Proactively identify bottlenecks and unlock the autonomy of the entire engineering team. Maintain infrastructure & deployment pipelines and collaborate with engineering teams on architectural decisions and production-readiness practices.

Feegow joined the Docplanner Group, a health-tech company, in 2022 and is dedicated to developing innovative solutions for physicians and managers.

$155,000–$165,000/yr
US Unlimited PTO

  • Lead maintenance and operations for production and development environments.
  • Architect and implement complex solutions spanning OS, virtualization, network, and cloud layers.
  • Lead automation initiatives for infrastructure provisioning and operational tasks.

NMI enables partners with choice in payments, challenging the one-size-fits-all approach. They power innovative tech for SMBs, entrepreneurs, and fintech startups, fostering a diverse and welcoming workplace with a dedicated Diversity, Equity & Inclusion action group.

India Unlimited PTO

Seeking an experienced Site Reliability Engineer to help build highly resilient and scalable systems by automating, measuring, and monitoring everything. Implement highly-available and scalable architectures for core and third-party components of Acquia Source. Implement metrics, monitoring, and incident response processes.

Acquia is an open source digital experience company providing technology to brands that allows them to embrace innovation and create customer moments that matter.

Europe

Heavily contribute to the architecture and migration of our CI/CD platform. Act as a pragmatic driver and senior contributor, responsible for designing and implementing solutions. Design and build the paved path as a product, ensuring they are reliable, secure, and well-documented.

Glia is the leading AI customer service solution for banks and credit unions offering AI and human agents across every voice and digital conversation.

4w PTO

As a Senior Software Engineer, Enterprise Platform at Vanta, you will build and operate systems that power Vanta’s FedRAMP environments, including automated release, vulnerability remediation, and evidence generation pipelines that meet strict compliance timelines. You will also define and evolve Vanta’s production reliability framework, including SLOs, incident response patterns, observability standards, service catalog, metrics dashboards, and the Vanta SLA definition. You will identify and solve complex scalability and performance challenges, particularly related to service reliability and data throughput.

Vanta helps businesses earn and prove trust by empowering companies to practice better security and prove it with ease.

US

  • Designs, implements, and continuously improves observability strategies across services.
  • Focuses on understanding system behavior in production, identifying failure modes, performance bottlenecks, and reliability risks.
  • Evolves and maintains shared AWS CDK and CDK8s constructs, with emphasis on observability, autoscaling, and operational safeguards.

Truelogic is a leading provider of nearshore staff augmentation services. They have a team of 600+ highly skilled tech professionals based in Latin America, partnering with U.S. companies on impactful projects and valuing expertise and aspirations.

$74,900–$99,000/yr
US

  • Lead and mentor a team of Specialists, fostering a culture of ownership and continuous learning.
  • Enable Change and Problem Management teams to leverage Datadog observability tools for evaluating release quality.
  • Oversee implementation and optimization of CI/CD Observability pipelines to ensure Operational Readiness standards are met.

BWH Hotels is a global leader in hospitality for nearly 80 years, inspiring travel through unique experiences. Headquartered in Phoenix, Arizona, BWH Hotels boasts a powerful portfolio of 18 brands and they foster a workplace culture where contributions truly matter.

UK

Run the production environment by monitoring availability and taking a holistic view of system health. Build software and systems to manage platform infrastructure and applications. Improve reliability, quality, and time-to-market of our suite of software solutions.

NICE software products are used by 25,000+ global businesses to deliver extraordinary customer experiences, fight financial crime and ensure public safety.

Germany

Shape the way Scalable runs microservices in a performant, secure, and cost-efficient way. Collaborate with cross-functional teams to understand scalability requirements. Develop and maintain internal tooling around Monitoring, Developer Portal, and Load Testing.

Scalable Capital is a leading digital investment and banking platform with a full banking licence, empowering people across Europe to shape their own finances.

$140,000–$190,000/yr
US Canada Unlimited PTO

  • Architect and maintain scalable, reliable infrastructure: Design and optimize infrastructure for high availability, fault tolerance, and performance across distributed systems.
  • Lead incident management and root cause analysis: Own incident response processes, ensure swift resolution of issues, and drive post-incident improvements to prevent recurrences.
  • Service monitoring and automation: Build and maintain automated monitoring, alerting, and healing systems that improve system health, reduce manual intervention, and minimize downtime.

VGS is the world's leader in payment tokenization, empowering clients and partners by tokenizing sensitive payment data and limiting compliance scope. They embed a universal token vault into their technology stack to manage the complexities of payment data tokenization across processors and networks and more. While the job posting doesn't specify size, they appear to have a culture that values transparency, collaboration, grit, and humility.

Europe 6w PTO

  • Lead and support the platform team through coaching and clear expectations.
  • Own the platform strategy and roadmap, prioritizing initiatives and managing team capacity.
  • Provide technical direction for the AWS- and Kubernetes-based platform.

bunch is building the backbone of private markets, combining exceptional expertise, operational excellence, and frictionless technology.

US

  • Ramp on AWS architecture, Terraform patterns, Kubernetes setup, CI/CD pipelines, and observability stack.
  • Take ownership of an infrastructure area: CI/CD pipelines, observability stack, Kubernetes platform, or AWS security/networking.
  • Shape infrastructure direction with design docs, RFC proposals, and mentoring engineering teams.

Bastion enables financial institutions and enterprises to issue regulated stablecoins, generate revenue on reserves, and expand their ecosystems.

ANZ

  • Own challenging infrastructure problems end-to-end by understanding how engineers use the platform.
  • Design scalable, maintainable services and contribute to technical proposals.
  • Contribute to the roadmap, highlighting opportunities, validating approaches and helping keep our platform solutions current with cloud best practices.

Canva's intuitive suite of design products is powered by our large distributed infrastructure group, setting large and ambitious goals.

Europe

Contribute to operational excellence by enhancing cloud support maturity and driving standardization. Lead and coordinate incident response activities, ensuring timely resolution. Provision, configure, and maintain Azure resources across multiple environments using IaC tools such as Bicep.

Software Mind develops solutions that make an impact for companies around the globe.