- Lead infrastructure resiliency efforts including recovery mechanisms, tenant isolation, and load spike handling
- Improve observability and operability of systems
- Build performance-critical, user-facing infrastructure like real-time event processing
Jobs ranked by similarity.
Shape and scale critical infrastructure for one of the largest online platforms in the world. Build, maintain, and optimize multi-cloud compute systems for high-performance, reliable, and secure operations. Influence the technical direction of infrastructure platforms while mentoring and guiding other engineers.
This position is posted by Jobgether on behalf of a partner company.
VGS is the world's leader in payment tokenization, empowering clients and partners by tokenizing sensitive payment data and limiting compliance scope. They embed a universal token vault into their technology stack to manage the complexities of payment data tokenization across processors and networks and more. While the job posting doesn't specify size, they appear to have a culture that values transparency, collaboration, grit, and humility.
Canva's intuitive suite of design products is powered by our large distributed infrastructure group, setting large and ambitious goals.
Endor Labs is building the Application Security platform for the software development revolution, helping teams identify, prioritize, and fix critical risks faster.
Jobgether uses an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Their system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company.
Design, implement, monitor and maintain Sysdig's Infrastructure at scale on different clouds and on-prem. Collaborate with development teams to improve system reliability, performance, and scalability. Participate in on-call rotation, respond to incidents, conduct root cause analyses, and implement preventive measures.
Sysdig helps organizations secure innovation in the cloud with runtime insights, open innovation, and agentic AI, trusted by over 60% of the Fortune 500.
As an SRE you will be responsible for ensuring the availability, performance and cost effectiveness of these services. You will be working with multiple feature development teams and the BAU/Support team to define and evolve our cloud & on-prem infrastructure & delivery pipelines, improving system observability. Proactively identifying and mitigating reliability risks.
In 2019, our founders were working as engineers solving complex cross domain problems within government organisations TwinStream was formed.
Vultr is on a mission to make high-performance cloud infrastructure easy to use, affordable, and locally accessible for enterprises and AI innovators around the world.
NMI enables partners with choice in payments, challenging the one-size-fits-all approach. They power innovative tech for SMBs, entrepreneurs, and fintech startups, fostering a diverse and welcoming workplace with a dedicated Diversity, Equity & Inclusion action group.
Manage and resolve high-impact customer escalations for enterprise products and services. Act as a technical liaison between engineering and support teams to drive rapid issue resolution. Debug and troubleshoot complex problems in cloud environments and operating systems (Linux/Unix).
Zscaler accelerates digital transformation so our customers can be more agile, efficient, resilient, and secure.
Seeking an experienced Site Reliability Engineer to help build highly resilient and scalable systems by automating, measuring, and monitoring everything. Implement highly-available and scalable architectures for core and third-party components of Acquia Source. Implement metrics, monitoring, and incident response processes.
Acquia is an open source digital experience company providing technology to brands that allows them to embrace innovation and create customer moments that matter.
Bastion enables financial institutions and enterprises to issue regulated stablecoins, generate revenue on reserves, and expand their ecosystems.
Reddit is a community of communities built on shared interests, passion, and trust, and is home to the most open and authentic conversations on the internet.
Design, build, and maintain scalable, reliable services that power high-volume software solutions. Take ownership of features from end-to-end across the software development lifecycle, including infrastructure, observability, and production operations. Write clean, production-grade code, focusing on maintainability, test coverage, and system resilience.
Rithum is the world’s most trusted commerce network, accelerating how brands, suppliers, and retailers work together to deliver seamless e-commerce experiences.
The company is hiring for a SWE Infrastructure Specialist. As a contractor, the employee will need to supply a secure computer and high-speed internet; company-sponsored benefits such as health insurance and PTO do not apply.
Design, implement, and evolve large-scale, cloud-native infrastructure supporting MariaDB's global SaaS platform. Lead reliability and scalability initiatives, driving automation and resilience through infrastructure-as-code and GitOps practices. Proactively identify and remediate systemic reliability issues, ensuring high service availability and performance across multi-cloud environments.
MariaDB is making a big impact on the world and is the backbone of applications used everyday, including 75% of the Fortune 500 companies.
Model N is the leader in revenue optimization and compliance for pharmaceutical, medtech, and high-tech innovators. Model N is trusted by over 150 of the world’s leading companies across more than 120 countries.
Design, develop, and maintain resilient backend services handling critical user-facing functionality. Build and maintain reusable libraries, frameworks, and tooling. Partner with product and platform teams to design APIs and distributed system patterns that are reliable, scalable, and maintainable.
Jobgether is a Talent Matching Platform that partners with companies worldwide to efficiently connect top talent with the right opportunities through AI-driven job matching.
As a Senior Software Engineer, Enterprise Platform at Vanta, you will build and operate systems that power Vanta’s FedRAMP environments, including automated release, vulnerability remediation, and evidence generation pipelines that meet strict compliance timelines. You will also define and evolve Vanta’s production reliability framework, including SLOs, incident response patterns, observability standards, service catalog, metrics dashboards, and the Vanta SLA definition. You will identify and solve complex scalability and performance challenges, particularly related to service reliability and data throughput.
Vanta helps businesses earn and prove trust by empowering companies to practice better security and prove it with ease.
Lead and manage the Platform Engineering team, providing technical guidance and mentorship. Design, build, and evangelize Golden Paths and Service Scaffolding to reduce friction across the development lifecycle. Oversee the design, implementation, and maintenance of Shared DB Platforms, ensuring optimal performance, integrity, and security across the organization.
Founded in 2012, EasyPost is a YC unicorn whose mission is to make shipping simple for businesses from garage startups to the Fortune 500.