Serve as Incident Commander, leading real-time response efforts, managing communication across teams, triaging issues, and driving resolution of high-priority incidents.
Execute documented runbooks for troubleshooting and resolving production incidents involving AWS services and Kubernetes Clusters.
Collaborate post-incident with engineering teams, performing root cause analysis, documenting lessons learned, and driving the implementation of durable solutions.
Own the strategy and execution for Runtime Platform.
Set the technical direction, build and develop the team, and are accountable for outcomes.
Translate product needs into platform capabilities and building trust through consistent delivery.
Wealthsimple aims to help everyone achieve financial freedom by reimagining how people manage their money. As the largest fintech company in Canada, it has over 3+ million users and manages more than $100 billion in assets, fostering inclusive and high-performing teams.
Act as the single point of contact for critical customer escalations.
Lead internal war rooms, coordinating resources across teams.
Drive Root Cause Analysis processes, translating findings into actionable steps.
Netomi is the leading agentic AI platform for enterprise customer experience. They work with the largest global brands and are backed by WndrCo, Y Combinator, and Index Ventures, helping enterprises drive efficiency, lower costs, and deliver higher quality customer experiences.
Architect and maintain scalable, reliable infrastructure: Design and optimize infrastructure for high availability, fault tolerance, and performance across distributed systems.
Lead incident management and root cause analysis: Own incident response processes, ensure swift resolution of issues, and drive post-incident improvements to prevent recurrences.
Service monitoring and automation: Build and maintain automated monitoring, alerting, and healing systems that improve system health, reduce manual intervention, and minimize downtime.
VGS is the world's leader in payment tokenization, empowering clients and partners by tokenizing sensitive payment data and limiting compliance scope. They embed a universal token vault into their technology stack to manage the complexities of payment data tokenization across processors and networks and more. While the job posting doesn't specify size, they appear to have a culture that values transparency, collaboration, grit, and humility.
Own and maintain the incident response process, including defining procedures, tools, and best practices
Guide teams in establishing and monitoring Service Level Objectives (SLOs), including setting up alerts and reporting systems
Lead capacity planning initiatives, focusing on both short and long-term scalability while optimizing costs
Underdog makes sports more fun by building the best products for sports fans. They are a fast-growing sports company valued at $1.3B with a focus on a seamless, simple, easy to use, intuitive and fun app.
Serve as the primary escalation point for L1/L2 support on high-impact production issues.
Analyze and resolve problems across endpoints (Windows, macOS, mobile devices), networking, cloud services, and critical enterprise applications.
Perform detailed root cause analysis and develop preventative actions.
ShipBob is a leading global supply chain and fulfillment technology platform designed for SMB and Mid-Market ecommerce merchants, offering best-in-class capabilities and shopper experiences. Backed by leading investors and headquartered in Chicago, it is one of the fastest-growing tech companies.
Ensure the smooth operation and high availability of Clarifai's core services
Monitor system performance, identify bottlenecks, and implement optimizations to enhance reliability and efficiency
Design and implement scalable, secure, and cost-effective infrastructure solutions
Clarifai is a leading AI platform specializing in computer vision and generative AI, empowering organizations to transform unstructured data into actionable insights. Founded in 2013, they have a diverse, globally distributed team with $100M in funding and are committed to building a diverse and inclusive team.
Lead in addressing crisis situations in an operational environment.
Support communication with top management and clients regarding incident resolution.
Contribute to development of strategies to prevent incident occurrences.
Deutsche Telekom IT Solutions is a subsidiary of the Deutsche Telekom Group, recognized as Hungary’s most attractive employer in 2025. They provide IT and telecommunications services with over 5300 employees, serving hundreds of large customers in Germany and Europe, and are known for ethical practices and educational cooperation.
Lead and mentor multiple teams across SRE, cloud infrastructure, and platform engineering functions.
Drive multi-team initiatives to deliver scalable, secure, and cost-efficient infrastructure leveraging AWS-native and serverless technologies.
Drive adoption of FinOps practices and partner with finance and product teams on budgeting and forecasting.
Model N is the leader in revenue optimization and compliance for pharmaceutical, medtech, and high-tech innovators. Model N is trusted by over 150 of the world’s leading companies across more than 120 countries.
Design and implement highly scalable infrastructure for GitLab.com to support current and future growth.
Collaborate with cross-functional teams across the Infrastructure organization to plan and deliver projects that shape GitLab’s platform direction.
Operate and improve edge services and Kubernetes workloads, acting as a subject matter expert within the infrastructure department.
GitLab is an open-core software company that develops the most comprehensive AI-powered DevSecOps Platform, used by more than 100,000 organizations. They aim to enable everyone to contribute to and co-create the software that powers our world.
Lead infrastructure resiliency efforts including recovery mechanisms, tenant isolation, and load spike handling
Improve observability and operability of systems
Build performance-critical, user-facing infrastructure like real-time event processing
Jobgether is a platform that uses AI-powered matching process to ensure applications are reviewed quickly, objectively, and fairly against a role's core requirements. They identify top-fitting candidates and share this shortlist directly with the hiring company.
Oversee the reliability, scalability, performance, and security of key production services.
Collaborate with cross-functional teams to develop and maintain resilient infrastructure.
Provide expert mentorship and guidance on best practices to engineers throughout the organization.
Cision is a global leader in PR, marketing and social media management technology and intelligence, helping brands and organizations connect with customers and stakeholders to drive business results. The company has offices in 24 countries throughout the Americas, EMEA and APAC.
Hire, lead, and support a high-performing Infrastructure Platforms team.
Connect business goals and customer needs with sound engineering.
Guide the security, reliability, performance, and scalability of core platform components.
GitLab is an open-core software company that develops the most comprehensive AI-powered DevSecOps Platform, used by more than 100,000 organizations. Their mission is to enable everyone to contribute to and co-create the software that powers our world.
Lead the Reliability & Operations function within the Developer & Production Enablement (DPE) division of RWS’s Product & Technology organization. Take ownership of global production operations and lead the transition from manual, ticket-based workflows to platform-integrated automation. Ensure stability today, while designing for scalability and autonomy in the future.
RWS's purpose is to unlock global understanding, valuing every language and culture, and celebrating diversity and inclusion to make the company strong.