Jobs Similar to Senior AI Infrastructure & Platform Operations Engineer | TangerineFeed

Senior AI Infrastructure & Platform Operations Engineer

Mirantis 9 hours ago

Europe

Lead the investigation and resolution of complex infrastructure, networking, and platform-related incidents.
Provide technical leadership for Kubernetes platform operations and supporting infrastructure services.
Mentor and support AI Infrastructure & Platform Operations Engineers, sharing technical knowledge through documentation and training.

Linux Kubernetes Networking

20 jobs similar to Senior AI Infrastructure & Platform Operations Engineer

Jobs ranked by similarity.

AI Infrastructure & Platform Operations Engineer

Mirantis 9 hours ago

Europe

Monitor, operate, and support production AI infrastructure platforms.
Investigate and resolve infrastructure, networking, hardware, and platform-related incidents.
Collaborate with engineering teams, hardware vendors, and datacenter personnel to resolve technical issues.

Mirantis is the Kubernetes-native AI infrastructure company, enabling organizations to build and operate scalable, secure infrastructure for AI and data-intensive applications. The company is growing and invests heavily in AI infrastructure and platform services.

View details Similar jobs

SRE

Fal 15 days ago

$180,000–$250,000/yr

US

Own and operate our Kubernetes infrastructure.
Build and maintain CI/CD pipelines and deployment infrastructure.
Leverage AI to automate analysis and resolution of production issues.

Fal is the generative media ecosystem powering the next generation of AI products. They build the infrastructure, tools, and model access that teams need to move from idea to production.

View details Similar jobs

Platform Support Engineer (APAC)

Lightning AI 24 days ago

APAC

Partner directly with customer engineering teams running training and inference workloads in production.
Investigate failures involving distributed training, Kubernetes orchestration, GPU allocation, networking, and storage systems.
Identify recurring patterns across customer issues and drive long term reliability improvements.

Lightning AI is the company behind PyTorch Lightning, building an end-to-end platform for developing, training, and deploying AI systems. They serve solo researchers, startups, and large enterprises, operating globally with offices in New York City, San Francisco, Seattle, and London.

View details Similar jobs

Technical Support Engineer

Mirantis 29 days ago

Maintain the reliability and performance of customer environments remotely, supporting Mirantis Opensack/k0s layers.
Diagnose and resolve system-level issues, requiring hands-on Linux administration experience.
Troubleshoot customer environments based on Linux, OpenStack, Kubernetes, networking, and other cloud technologies; detect, report, and resolve issues.

Mirantis helps enterprises move to the cloud on their terms, delivering a true cloud experience on any infrastructure, powered by Kubernetes. They serve many of the world’s leading enterprises and value openness, collaboration, risk-taking, and continuous growth.

View details Similar jobs

Staff Site Reliability Engineer I EMEA

Remote 23 days ago

$188,550–$212,150/yr

Global Unlimited PTO

Own the technical direction of Remote's SRE/Platform domain.
Define and drive the reliability strategy across the platform.
Identify and lead AI enablement initiatives across the engineering organisation.

Remote is solving modern organizations’ biggest challenge – navigating global employment compliantly with ease. With our core values at heart and a future-focused work culture, our team works tirelessly on ambitious problems, asynchronously, around the world.

View details Similar jobs

Principal GenAI Platform Engineer (US)

PointClickCare 15 days ago

$179,000–$199,000/yr

US

Design, build, and maintain the core infrastructure layer supporting GenAI products.
Implement secure access controls and authentication mechanisms integrated by default into the AI platform components.
Develop and manage observability, monitoring, and logging solutions for GenAI workloads and infrastructure.

PointClickCare is a healthcare technology company. This team will serve as the product owner for GenAI capabilities, closely integrated with key horizontal partners to ensure delivery of safe, scalable and high-impact AI Products.

View details Similar jobs

Network Infrastructure Engineer

Render 11 days ago

US 4w PTO 14w maternity 14w paternity

Own Render's core network infrastructure across multiple data centers and cloud providers, shaping how networking evolves as Render rapidly scales.
Design and build customer-facing networking capabilities that give users greater flexibility in how their services connect and communicate, and how traffic is routed.
Investigate complex networking issues across the stack, from the kernel and data plane to distributed systems and edge networking.

Render is building a modern cloud platform for developers creating AI-native, full-stack, multi-service applications, eliminating the tradeoff between hyperscaler power and developer-friendliness. They are a diverse and talented team that values craft, velocity, and user experience.

View details Similar jobs

Sr DevOps Engineer

Aylo 18 hours ago

Europe

Support and improve hybrid production infrastructure for 15+ development teams handling 100+ products, 10K+ domains, and billions of hits per day.
Architect and plan improvements of a multi-datacenter development environment, advocating for migration to automated, elastic infrastructures using cloud, Kubernetes, and serverless technologies.
Document processes, monitor performance metrics, promote CICD practices, and mentor junior DevOps engineers.

Aylo is a tech pioneer that offers world-class adult entertainment and games on safe, popular platforms. With an international team of dynamic innovators, the company focuses on trust-and-safety protocols and has offices in Montreal, Austin, and Nicosia.

View details Similar jobs

Infrastructure Engineer (Observability)

Lightning AI 23 days ago

$180,000–$200,000/yr

US

Own and evolve a scalable observability platform spanning metrics, logs, traces, and events.
Design telemetry pipelines ingesting data from GPUs, CPUs, networking, containers, APIs, and BMC/Redfish.
Design and implement noise-resistant alerting systems to improve signal quality and reduce operational load.

Lightning AI builds an end-to-end platform for developing, training, and deploying AI systems, designed to take ideas from research to production with less friction. They combine developer-first software with cost-efficient, large-scale compute, serving solo researchers, startups, and large enterprises.

View details Similar jobs

Manager, Platform Engineering

Natera 29 days ago

$138,700–$173,350/yr

US

Lead the architecture of a high-scale AWS environment optimized for AI workloads.
Manage and mentor a high-performing team of 8 engineers, providing technical leadership and career coaching.
Conduct user research with internal Natera developers to identify friction points.

Natera is a global leader in cell-free DNA (cfDNA) testing, dedicated to oncology, women’s health, and organ health. The Natera team consists of statisticians, geneticists, doctors, laboratory scientists, business professionals, software engineers, and many other professionals from world-class institutions.

View details Similar jobs

Principal Support Engineer (L3, Edge Cloud)

Gcore 9 hours ago

Global

Act as the final escalation point for complex Cloud infrastructure issues, analyzing logs and metrics to identify root causes.
Own high-severity incidents, coordinate resolution with Engineering, DevOps, and SRE teams, and contribute to preventive actions.
Mentor L1 and L2 support engineers, create runbooks and SOPs, and collaborate with Product teams to reproduce issues.

Gcore provides infrastructure and software solutions for AI, cloud, network, and security, powering real-time communication, streaming, enterprise AI, and secure web applications. With 550+ professionals globally, they collaborate with partners like Intel, NVIDIA, Dell, and Equinix to support the digital ecosystem.

View details Similar jobs

Staff Platform Engineer

Docker 1 day ago

Global 16w maternity 16w paternity

Lead the design and implementation of self-service platform infrastructure for provisioning, deployment, and observability across engineering teams.
Evolve multi-tenant EKS foundations toward better reliability, security, scale, and multi-region connectivity.
Set delivery standards using Terraform, GitOps, and progressive rollout, while improving SLOs and alerting on Grafana Cloud.

Docker is a developer tooling company trusted by over 20 million monthly users and 20 billion container image pulls. They are a globally distributed, remote-first team building tools that define how software gets built and delivered.

View details Similar jobs

Senior Platform/MLOps Engineer

Bright Machines 27 days ago

$150,000–$170,000/yr

US

Design, implement, and maintain reliable, scalable, and secure infrastructure, applications, and tooling, with a focus on our ML/AI pipelines and workloads
Write clean, maintainable code, and perform peer code-reviews
Write clear and concise documentation and engage in cross-team communication and knowledge sharing

Bright Machines is a next-generation, AI-enabled manufacturer focused on data center infrastructure assembly operations. The company utilizes AI-based robotics and software to assemble AI infrastructure hardware products for hyperscalers and leading OEMs, employing under 500 employees, with a culture rooted in innovation and expertise.

View details Similar jobs

Director of Engineering, Product Infrastructure and Release Engineering

Mechanical Orchard 22 days ago

Canada

Own the end-to-end infrastructure product vision, including installers, deployment tooling, reference architectures, and operational patterns.
Define and evolve a cohesive infrastructure roadmap aligned with Platform architecture, customer needs, and GTM strategy.
Partner closely with Product Leadership to balance near-term customer needs with long-term platform scalability and repeatability.

Mechanical Orchard is reinventing how the world’s most critical software gets modernized, focusing on system behavior to turn modernization into a repeatable process. They are an applied AI company challenging industry assumptions and prioritizing quality, rigor, and progress.

View details Similar jobs

Platform Operations Lead

NexGen Cloud 25 days ago

Europe 5w PTO

Build and improve scalable infrastructure operations processes that support a growing cloud platform.
Enable customer-facing and operational teams with secure automation, diagnostics, tooling and clear workflows.
Reduce repeatable manual work by identifying operational pain points and turning them into automated or self-service solutions.

NexGen Cloud delivers on-demand and private GPU infrastructure to a wide array of customers. They're a tight-knit, fast-moving team working at the cutting edge of AI cloud infrastructure, equipping their people with AI at every level.

View details Similar jobs

Senior Site Reliability Engineer

MZLA Technologies Corporation 1 day ago

US 5w PTO

Design and develop CI/CD systems for websites, services, and release workflows, and operate an EKS-based Kubernetes platform.
Diagnose debug production incidents, drive root-cause analysis, and implement improvements to enhance system reliability.
Write and maintain infrastructure as code using Pulumi or Terraform/OpenTofu across multiple AWS accounts with security-conscious practices.

Thunderbird is one of the world’s most trusted open-source email applications, empowering more than 20 million people globally. Our small but growing distributed team includes 65+ people across seven countries, and we build privacy-respecting communication tools with a collaborative, inclusive, and user-first spirit.

View details Similar jobs

Staff Software Engineer

Zeta Global 12 days ago

$160,000–$180,000/yr

US Unlimited PTO

Identify systemic engineering challenges across our platforms and drive their resolution.
Write code, review PRs, debug production issues, and optimize system performance.
Partner with engineering teams as a technical point of contact on complex projects.

Zeta Global is an AI-Powered Marketing Cloud that leverages advanced artificial intelligence (AI) and trillions of consumer signals to help marketers acquire, grow, and retain customers more efficiently. They were founded in 2007 and are headquartered in New York City with offices around the world.

View details Similar jobs

Senior Site Reliability Engineer

Finom 13 days ago

Europe

Design and operate our Kubernetes ecosystem with a focus on high availability and zero-downtime operations.
Own and evolve our PaaS strategy, using GitOps and CI/CD to empower domain teams to deploy independently.
Define and implement our observability strategy across metrics, logs, and tracing.

Finom is a European tech startup headquartered in Amsterdam, revolutionizing financial services for entrepreneurs. They offer an all-in-one financial B2B solution integrating banking, accounting, financial management, and invoicing into a mobile-first platform, with about 346 million in funding.

View details Similar jobs

Senior AIOps Engineer, Incident Response

Quanata 12 days ago

$215,000–$280,000/yr

US 4w PTO 12w maternity 12w paternity

Own production health, reliability, and operational support processes across critical systems and services
Lead incident response efforts, stakeholder communication, root cause analysis, and post-incident reviews
Design and implement AI-driven agents and workflows that automate support and operational tasks

Quanata is on a mission to help ensure a better world through context-based insurance solutions. They are an exceptional, customer centered team with a passion for creating innovative technologies, digital products, and brands. Quanata, LLC is wholly owned and funded by State Farm.

View details Similar jobs

Engineering Tech Lead

VCluster Labs 25 days ago

Global

Drive the architecture for how vNode wraps containerd, integrates with the kubelet, and exposes safe isolation primitives.
Lead the work where vNode meets containerd, Kata Containers, gVisor, runc, and the kernel.
Own how vNode plugs into the node lifecycle: CRI, kubelet device plugins, cgroups v2, eviction.

VCluster Labs is pioneering Kubernetes virtualization for the AI era. They are a venture-backed tech startup in a hyper-growth phase with a remote-first work culture.

View details Similar jobs