Source Job

US Unlimited PTO

  • Lead infrastructure resiliency efforts including recovery mechanisms, tenant isolation, and load spike handling
  • Improve observability and operability of systems
  • Build performance-critical, user-facing infrastructure like real-time event processing

Debugging Observability Risk Mitigation

20 jobs similar to Software Engineer, Infrastructure

Jobs ranked by similarity.

US

Shape and scale critical infrastructure for one of the largest online platforms in the world. Build, maintain, and optimize multi-cloud compute systems for high-performance, reliable, and secure operations. Influence the technical direction of infrastructure platforms while mentoring and guiding other engineers.

This position is posted by Jobgether on behalf of a partner company.

$140,000–$190,000/yr
US Canada Unlimited PTO

  • Architect and maintain scalable, reliable infrastructure: Design and optimize infrastructure for high availability, fault tolerance, and performance across distributed systems.
  • Lead incident management and root cause analysis: Own incident response processes, ensure swift resolution of issues, and drive post-incident improvements to prevent recurrences.
  • Service monitoring and automation: Build and maintain automated monitoring, alerting, and healing systems that improve system health, reduce manual intervention, and minimize downtime.

VGS is the world's leader in payment tokenization, empowering clients and partners by tokenizing sensitive payment data and limiting compliance scope. They embed a universal token vault into their technology stack to manage the complexities of payment data tokenization across processors and networks and more. While the job posting doesn't specify size, they appear to have a culture that values transparency, collaboration, grit, and humility.

ANZ

  • Own challenging infrastructure problems end-to-end by understanding how engineers use the platform.
  • Design scalable, maintainable services and contribute to technical proposals.
  • Contribute to the roadmap, highlighting opportunities, validating approaches and helping keep our platform solutions current with cloud best practices.

Canva's intuitive suite of design products is powered by our large distributed infrastructure group, setting large and ambitious goals.

US Unlimited PTO

  • Serve as the highest-level technical support resource, handling complex, high-priority issues.
  • Collaborate with Engineering and Product teams to triage and resolve bugs or architectural issues.
  • Conduct deep diagnostics, including logs, APIs, and infrastructure troubleshooting.

Endor Labs is building the Application Security platform for the software development revolution, helping teams identify, prioritize, and fix critical risks faster.

Canada

  • Lead engineering teams responsible for Edge Traffic Infrastructure, ensuring networking architecture remains resilient.
  • Define and deliver a vision, strategy, and roadmap for system ingress and egress points, tying it to business impact.
  • Collaborate with infrastructure, security, and engineering teams to build robust networking solutions.

Jobgether uses an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Their system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company.

Design, implement, monitor and maintain Sysdig's Infrastructure at scale on different clouds and on-prem. Collaborate with development teams to improve system reliability, performance, and scalability. Participate in on-call rotation, respond to incidents, conduct root cause analyses, and implement preventive measures.

Sysdig helps organizations secure innovation in the cloud with runtime insights, open innovation, and agentic AI, trusted by over 60% of the Fortune 500.

Europe

As an SRE you will be responsible for ensuring the availability, performance and cost effectiveness of these services. You will be working with multiple feature development teams and the BAU/Support team to define and evolve our cloud & on-prem infrastructure & delivery pipelines, improving system observability. Proactively identifying and mitigating reliability risks.

In 2019, our founders were working as engineers solving complex cross domain problems within government organisations TwinStream was formed.

  • Influence and align cross-functional teams on platform evolution.
  • Architect and evolve hypervisor integrations across thousands of hosts.
  • Drive advanced performance tuning across CPU, memory, I/O, networking, and storage layers.

Vultr is on a mission to make high-performance cloud infrastructure easy to use, affordable, and locally accessible for enterprises and AI innovators around the world.

$155,000–$165,000/yr
US Unlimited PTO

  • Lead maintenance and operations for production and development environments.
  • Architect and implement complex solutions spanning OS, virtualization, network, and cloud layers.
  • Lead automation initiatives for infrastructure provisioning and operational tasks.

NMI enables partners with choice in payments, challenging the one-size-fits-all approach. They power innovative tech for SMBs, entrepreneurs, and fintech startups, fostering a diverse and welcoming workplace with a dedicated Diversity, Equity & Inclusion action group.

US

Manage and resolve high-impact customer escalations for enterprise products and services. Act as a technical liaison between engineering and support teams to drive rapid issue resolution. Debug and troubleshoot complex problems in cloud environments and operating systems (Linux/Unix).

Zscaler accelerates digital transformation so our customers can be more agile, efficient, resilient, and secure.

India Unlimited PTO

Seeking an experienced Site Reliability Engineer to help build highly resilient and scalable systems by automating, measuring, and monitoring everything. Implement highly-available and scalable architectures for core and third-party components of Acquia Source. Implement metrics, monitoring, and incident response processes.

Acquia is an open source digital experience company providing technology to brands that allows them to embrace innovation and create customer moments that matter.

US

  • Ramp on AWS architecture, Terraform patterns, Kubernetes setup, CI/CD pipelines, and observability stack.
  • Take ownership of an infrastructure area: CI/CD pipelines, observability stack, Kubernetes platform, or AWS security/networking.
  • Shape infrastructure direction with design docs, RFC proposals, and mentoring engineering teams.

Bastion enables financial institutions and enterprises to issue regulated stablecoins, generate revenue on reserves, and expand their ecosystems.

$260,800–$365,100/yr
US

  • Work collaboratively on a team to build out Reddit’s multi-cloud compute infrastructure.
  • Contribute to the design, implementation, and operations for one of the largest sites in the world.
  • Write software to improve the compute infrastructure and analyze problems as Reddit scales.

Reddit is a community of communities built on shared interests, passion, and trust, and is home to the most open and authentic conversations on the internet.

Europe 5w PTO

Design, build, and maintain scalable, reliable services that power high-volume software solutions. Take ownership of features from end-to-end across the software development lifecycle, including infrastructure, observability, and production operations. Write clean, production-grade code, focusing on maintainability, test coverage, and system resilience.

Rithum is the world’s most trusted commerce network, accelerating how brands, suppliers, and retailers work together to deliver seamless e-commerce experiences.

World Wide

  • Challenge advanced language models on realistic infrastructure and platform scenarios.
  • Verify architectural soundness and logical correctness, assess code quality and testing strategies.
  • Analyze performance bottlenecks and deployment risks, capture reproducible failure cases, and suggest improvements.

The company is hiring for a SWE Infrastructure Specialist. As a contractor, the employee will need to supply a secure computer and high-speed internet; company-sponsored benefits such as health insurance and PTO do not apply.

Canada 5w PTO

Design, implement, and evolve large-scale, cloud-native infrastructure supporting MariaDB's global SaaS platform. Lead reliability and scalability initiatives, driving automation and resilience through infrastructure-as-code and GitOps practices. Proactively identify and remediate systemic reliability issues, ensuring high service availability and performance across multi-cloud environments.

MariaDB is making a big impact on the world and is the backbone of applications used everyday, including 75% of the Fortune 500 companies.

$160,000–$182,000/yr
US

  • Lead and mentor multiple teams across SRE, cloud infrastructure, and platform engineering functions.
  • Drive multi-team initiatives to deliver scalable, secure, and cost-efficient infrastructure leveraging AWS-native and serverless technologies.
  • Drive adoption of FinOps practices and partner with finance and product teams on budgeting and forecasting.

Model N is the leader in revenue optimization and compliance for pharmaceutical, medtech, and high-tech innovators. Model N is trusted by over 150 of the world’s leading companies across more than 120 countries.

US

Design, develop, and maintain resilient backend services handling critical user-facing functionality. Build and maintain reusable libraries, frameworks, and tooling. Partner with product and platform teams to design APIs and distributed system patterns that are reliable, scalable, and maintainable.

Jobgether is a Talent Matching Platform that partners with companies worldwide to efficiently connect top talent with the right opportunities through AI-driven job matching.

4w PTO

As a Senior Software Engineer, Enterprise Platform at Vanta, you will build and operate systems that power Vanta’s FedRAMP environments, including automated release, vulnerability remediation, and evidence generation pipelines that meet strict compliance timelines. You will also define and evolve Vanta’s production reliability framework, including SLOs, incident response patterns, observability standards, service catalog, metrics dashboards, and the Vanta SLA definition. You will identify and solve complex scalability and performance challenges, particularly related to service reliability and data throughput.

Vanta helps businesses earn and prove trust by empowering companies to practice better security and prove it with ease.

US

Lead and manage the Platform Engineering team, providing technical guidance and mentorship. Design, build, and evangelize Golden Paths and Service Scaffolding to reduce friction across the development lifecycle. Oversee the design, implementation, and maintenance of Shared DB Platforms, ensuring optimal performance, integrity, and security across the organization.

Founded in 2012, EasyPost is a YC unicorn whose mission is to make shipping simple for businesses from garage startups to the Fortune 500.