Source Job

Global

  • Build and own the foundational infrastructure that our products run upon.
  • Work directly on our products' golang code base to implement SRE related objectives.
  • Take a data driven approach to quantifying system performance and reliability.

Golang Kubernetes Automation

20 jobs similar to Software Engineer / Site Reliability Engineer

Jobs ranked by similarity.

ANZ

  • Designing, building, and operating Kubernetes infrastructure across multiple cloud providers.
  • Building and maintaining automation for cluster lifecycle management, node provisioning, and provider onboarding.
  • Developing platform tooling and abstractions that enable other Canva engineers to deploy and scale workloads.

Canva is a design platform redefining how the world experiences design. They have campuses in Sydney and Melbourne, along with co-working spaces in Brisbane, Perth and Adelaide, offering a flexible and inclusive work environment.

$141,000–$230,000/yr
US

  • Collaborate with engineering teams to design and implement scalable, secure systems.
  • Establish and manage service level objectives (SLOs) and service level agreements (SLAs).
  • Enhance incident response processes and post-mortem analysis for outages.

ClickHouse, recognized on the 2025 Forbes Cloud 100 list, is one of the most innovative and fast-growing private cloud companies. With more than 3,000 customers and ARR that has grown over 250 percent year over year, ClickHouse leads the market in real-time analytics, data warehousing, observability, and AI workloads.

$205,000–$270,000/yr
US Unlimited PTO

  • Partner with engineers to build dev tools that empower developer workflows and deployment infrastructure.
  • Ensure reliability of multi-cloud Kubernetes clusters and pipelines.
  • Focus on automation so we can spend energy where it matters.

Cresta is on a mission to turn every customer conversation into a competitive advantage by unlocking the true potential of the contact center. Their platform combines the best of AI and human intelligence to help contact centers discover customer insights and behavioral best practices.

US

  • Help deploy and configure Dynatrace OneAgent and ActiveGates with automated tooling.
  • Define and instrument user‑centric metrics and objectives in Dynatrace.
  • Combine Davis® AI with Copilot/Claude to identify root causes and reduce MTTR.

AWP Safety's IT Internship Program is a hands‑on, learning experience for early‑career professionals who want to build a future in IT Site Reliability Engineering. They operate at the intersection of Software Engineering and Systems Operations, using Dynatrace to diagnose performance bottlenecks and automate "toil" out of existence.

US Canada

  • Own SLI/SLO/SLA definitions for the Akuity SaaS platform and drive continuous improvement.
  • Participate in an on-call rotation and act as incident commander for high-severity production events.
  • Partner with engineering teams to build reliability into new features before they ship to production

Akuity helps enterprises ship software faster and more reliably with modern GitOps best practices. The Akuity Platform enables teams to manage the development and deployment across hundreds – if not thousands – of Kubernetes clusters from a single control plane.

$172,614–$172,614/yr
US

  • Design infrastructure, networking, and software platform architecture.
  • Build and maintain automation of Continuous Integration and Continuous Deployment pipelines.
  • Troubleshoot infrastructure, internal applications, networking, and security issues.

Loadsmart is a technology company focused on the logistics and supply chain industry. They leverage data and technology to automate and optimize freight transportation, connecting shippers and carriers to streamline the shipping process. They are a mid-sized company passionate about transforming the future of freight.

North America Europe

  • Build distributed systems that support reliability, resiliency, and safe operation at scale.
  • Design and operate traffic control mechanisms: circuit breakers, rate limiting, admission control, backpressure, and graceful degradation.
  • Develop tooling that improves incident detection, response, and automated mitigation.

Whatnot is the largest live shopping platform in North America and Europe to buy, sell, and discover the things you love. They are a remote co-located team, inspired by innovation and anchored in their values.

Unlimited PTO

  • Build and operate cutting-edge cloud infrastructure to support Diagrid's core products
  • Define standards, deliver tools, processes, and frameworks to make our products secure, reliable, efficient, and highly available
  • Build and maintain CI/CD pipelines that enable delivering software quickly and securely across clouds

Diagrid believes that open-source software, open standards and APIs are the greatest transformational tools for organizations. They provide developers with APIs and tools that help them focus on their code and not on infrastructure and are founded by the creators of the Dapr and KEDA open-source projects.

Europe

  • Implement SLI/SLO frameworks with error budgets to drive reliability decisions
  • Design release strategies including blue/green deployments and version tracking
  • Lead incident response and develop automated runbooks to reduce MTTR

Jobgether is a company that helps connect individuals with jobs through an AI-powered matching process. They ensure applications are reviewed quickly, objectively, and fairly against roles' core requirements.

Americas

  • Work in Python and Golang to design and deliver open source software operations code
  • Shape high quality open source monitoring and alerting infrastructure
  • Grow a healthy, collaborative engineering culture in line with the company values

Canonical is a leading provider of open source software and operating systems to the global enterprise and technology markets. As the company that publishes Ubuntu, one of the most important open-source projects and the platform for AI, IoT, and the cloud, it is changing the world of software. The company has 1200+ colleagues in 75+ countries company and has a global distributed collaboration culture.

$160,000–$200,000/yr
US

  • Help drive reliability, automation and performance within our cloud-based infrastructure.
  • Become embedded within an Engineering team helping them navigate production excellence and advocate for best practices.
  • Debug production issues across services and levels of the stack as well as practice incident response and blameless postmortems.

Flywire is a global payments enablement and software company that was founded over a decade ago. They have over 1,200 global FlyMates, representing more than 40 nationalities, in 12 offices worldwide, and are looking for people to join the next stage of their journey as they continue to grow.

$230,000–$250,000/yr
US Unlimited PTO 12w paternity

  • Define and evolve reliability standards for the SmarterDx platform.
  • Enhance observability systems (metrics, logs, traces, alerting) to provide actionable insights and reduce mean time to detect (MTTD) and resolve (MTTR).
  • Reduce operational toil through automation, self-healing systems, and improved deployment and rollback mechanisms.

SmarterDx, a Smarter Technologies company, builds clinical AI that is transforming how hospitals translate care into payment. Founded by physicians in 2020, their platform connects clinical context with revenue intelligence, helping health systems recover millions in missed revenue, improve quality scores, and appeal every denial.

Global

  • Contribute to our core product, working across our stack primarily in Go, on services that power our applications.
  • Design and refine technical systems, including microservices, customer interfaces, and automated tests.
  • Collaborate closely across disciplines to explore problems, prototype ideas, and iterate quickly.

Humanitec is at the forefront of the Platform Engineering revolution, as enterprise companies across the globe re-shape how they manage their cloud infrastructure. Their mission is to help platform engineering teams build Internal Developer Platforms that unlock true developer self-service.

$170,000–$240,000/yr
US 4w PTO

  • Own our fundamental cloud services and tooling.
  • Own our application platform.
  • Own our developer experience.

Propel builds technology that strengthens the social safety net. They are a passionate team of ~100 Propellers who envision a future where every American has the tools and resources they need to thrive, offering a remote-first working environment with headquarters in Brooklyn.

LATAM Unlimited PTO

  • Tech lead two teams (DevEx and Cloud Infrastructure) totaling 6–8 engineers: set technical direction, review key designs/changes, and raise engineering standards across both domains.
  • Own the delivery toolchain end-to-end (Git, CI, deployments/releases): reduce flakiness, improve build/test times, make releases repeatable with clear rollback, and drive adoption of org-wide standards through tooling, docs, and supported migrations.
  • Improve the software development lifecycle (setup → build/test → PR → deploy → observe) and standardize environments so teams spend less time on tooling and more time shipping.

Traackr is a global SaaS technology company providing a data-driven influencer marketing platform that marketers use to optimize investments, streamline campaigns, and scale programs. They are a remote-first company with offices in San Francisco, New York, Boston, Paris, and London and operate on a culture of mutual respect.

  • Maximize the velocity of our product engineering team.
  • Ensure platform scalability, reliability, and security.
  • Champion best practices and shape the engineering culture.

They are building a robust, scalable trading platform to serve high-traffic, latency-sensitive applications. They leverage state-of-the-art technologies to support real-time trading while providing unparalleled reliability and performance.

US Europe

  • Build and lead the team responsible for the reliability, security, and scalability of Gensyn’s production infrastructure and developer platform.
  • Own the availability, scalability, and security posture of production systems: SLOs/SLIs, incident response, postmortems, reliability improvements, and hardening.
  • Drive delivery across ambiguous, high-stakes initiatives: roadmap planning, prioritization, and execution against tight timelines.

Gensyn is building a protocol that networks together the core resources required for machine intelligence to flourish alongside human intelligence. They value autonomy, independence, direct feedback and an extreme learning rate, and strive to reject mediocrity and waste.

Europe

  • Analyze, evaluate, and resolve network incidents and service requests (L1–L2).

REWE Group Austria develops innovative IT products and services for all corporate divisions in Austria and abroad, setting the tone for modern trade. They have more than 700 employees. Their culture is family-friendly, with flexible working hours and remote working options.

US 6w PTO

  • Build and scale a strong culture of operational excellence by defining standards and coaching teams to own reliability and availability.
  • Drive mature DevOps/SRE practices, including incident response and PIRs, on-call readiness, runbooks, alerting, observability, and release/change management.
  • Guide teams in the design, development, evolution, and operation of large-scale, distributed cloud systems.

Grafana Labs is a remote-first, open-source powerhouse with more than 20M users of Grafana, the open source visualization tool, around the globe. They help more than 3,000 companies manage their observability strategies with the Grafana LGTM Stack.

Global

  • Comfortable working in a fully remote environment.
  • Value designing solutions to customer problems.
  • Comfortable rolling up your sleeves to understand incidents.

Humanitec is at the forefront of the Platform Engineering revolution, as enterprise companies across the globe re-shape how they manage their cloud infrastructure. They aim to help platform engineering teams build Internal Developer Platforms that unlock true developer self-service.