Source Job

$150,000–$210,000/yr
US

  • Own the reliability, performance, and operational health of production AI services
  • Refactor and harden existing systems to improve resilience, clarity, and maintainability
  • Diagnose and resolve issues across distributed services, data pipelines, and storage layers

Python Java Scala Kubernetes

20 jobs similar to Sr. Software Engineer-AI Reliability

Jobs ranked by similarity.

$111,888–$128,633/yr
Canada US

  • Design and build production-grade AI systems, including RAG pipelines, multi-step agents, and LLM-powered features.
  • Build comprehensive evaluation and observability frameworks to measure model accuracy, grounding, and quality drift.
  • Create production-quality Python services to wrap AI logic into secure microservices.

League, founded in 2014, is the leading healthcare consumer experience (CX) platform powered by AI, reaching over 63 million people globally. Payers, providers, and consumer health partners use League’s platform to deliver high-engagement healthcare solutions and improve health outcomes.

US

  • Build and scale the AI Agent platform.
  • Design and implement APIs, services, and infrastructure.
  • Prototype rapidly and continuously improve system performance.

Podium brings AI Employees to local businesses that turn every conversation into revenue. Trusted by 60,000+ businesses, they have crossed $100M in AI Agent ARR, scaling 300% year-over-year and empowering real business outcomes for their customers.

US

  • You will define, build, and evolve foundational systems that enable autonomous agents to operate reliably in production.
  • You’ll explore new approaches, prototype quickly, and turn what works into durable platform foundations.
  • You’ll identify high-leverage architectural improvements, abstractions, and guardrails that expand what the platform can do while keeping it reliable, secure, observable, and maintainable under real-world conditions.

Kindo is an agent automation platform for DevOps and SecOps teams, helping organizations automate high-friction operational work using autonomous agents. They are a small, highly technical team with strong customer traction and real enterprise revenue, where engineers have direct ownership over critical systems.

US

  • Responsible for building clean, scalable, and reliable solutions that support data-driven and AI‑enabled environments.
  • Developing well‑tested Python applications and working within cloud-native ecosystems.
  • Collaborating effectively across engineering, data, and scientific teams, turning complex requirements into production-ready solutions.

Onebridge, a Marlabs Company, is a global AI and Data Analytics Consulting Firm that empowers organizations worldwide to drive better outcomes through data and technology. Since 2005, we have partnered with some of the largest healthcare, life sciences, financial services, and government entities across the globe.

Europe

  • Design, develop, and maintain high-quality software solutions using Python.
  • Contribute to the design and evolution of scalable and maintainable software architectures.
  • Deploy, operate, and monitor applications in cloud environments (AWS, Azure, or GCP).

Lynx Analytics works on real-world AI and advanced analytics solutions with measurable business impact. They have a collaborative culture that values real outcomes, offering high ownership and rapid learning opportunities.

$225,000–$315,000/yr
US 20w maternity 12w paternity

  • Architect and optimize distributed training and inference systems for large-scale AI models
  • Design and deliver customer-focused solutions that maximize performance and business value
  • Lead the transition of ML pipelines from POC to scalable production systems

The company offers an AI-centric cloud platform reshaping the landscape of artificial intelligence. They provide infrastructure, tools, and services for developers to service the explosive growth of the global AI industry, catering to Fortune 1000 companies, startups, and AI researchers.

EMEA

  • Design and implement tooling that enables researchers to quickly deploy and evaluate new models in production
  • Design, build, and maintain high-performance, cost-efficient inference pipelines, making architectural decisions about scaling, reliability, and cost trade-offs
  • Proactively identify and resolve infrastructure bottlenecks, proposing and scoping improvements to iteration speed and production reliability

AssemblyAI builds best-in-class Speech AI models that power the next generation of voice applications. They are a remote team building one of the next great AI companies where teammates define and build their company culture.

Global

  • Contribute to the development of the Everywhere Inference platform, a Kubernetes-based solution.
  • Design and implement APIs and developer tools to simplify deployment, management, and monitoring of AI applications.
  • Optimize serverless container workflows for AI workloads, ensuring performance, scalability, and seamless autoscaling.

Gcore provides infrastructure and software solutions for AI, cloud, network, and security. They have 550+ professionals globally and collaborate with technology partners such as Intel, NVIDIA, Dell, and Equinix.

North America 4w PTO

  • Partner with stakeholders to tackle technical problems at scale, building framework agnostic services.
  • Establish roadmap and architecture for Wealthsimple’s Machine Learning platform.
  • Build highly performant scalable systems, contributing to our ML platform on Kubernetes, Bedrock and Sagemaker.

Wealthsimple aims to provide financial freedom by making financial services transparent and low-cost. As the largest fintech company in Canada, with over 1,500 employees, they manage over $100 billion in assets and foster a collaborative and quality-focused culture.

Global

  • Design and build the infrastructure layer powering AI agent systems in production
  • Develop high-performance Rust services that handle model inference, orchestration, and execution
  • Architect scalable systems capable of supporting millions of users and high request throughput

Kraken is a mission-focused company rooted in crypto values, aiming to accelerate the global adoption of crypto so that everyone can achieve financial freedom and inclusion. As a fully remote company, Kraken has employees in 70+ countries and is committed to industry-leading security, crypto education, and client support.

Nigeria

  • Detect and triage service and reliability issues.
  • Develop automation to eliminate manual and repetitive operational tasks.
  • Investigate and resolve customer complaints escalated beyond L1 and L2 support.

Moniepoint is an all-in-one financial services platform for emerging markets. Since 2019, Moniepoint’s technology has powered over 3 million people, offering personal and business banking, payment, credit and business management tools to help them succeed.

US

  • Design and implement APIs, data pipelines, and simulation runtime logic for mission applications.
  • Develop software using modern programming languages such as Java, Python, C++, or TypeScript/Angular.
  • Build and integrate modular microservices for improved scalability and maintainability.

They deliver advanced technology solutions, integrating people and processes to tackle complex challenges effectively. The company has a collaborative and supportive team culture.

$161,925–$227,325/yr
US

  • Lead projects end to end and contribute to impactful platform initiatives.
  • Partner with engineers, scientists, product managers and business teams to identify opportunities.
  • Design and ship components of a new platform architecture to enable multi-tenancy and scaling.

Freenome is working to detect cancer in its earliest, most treatable stages using a routine blood draw. Freenome is an equal-opportunity employer who values diversity and does not discriminate.

Canada 4w PTO

  • Improve the scalability and reliability of our core data systems.
  • Define and evolve how we model, store, and query resource data across Vanta.
  • Collaborate with product, design, and other engineering teams to understand user needs.

Vanta helps businesses earn and prove trust by providing continuous security monitoring and verification. They empower companies to practice better security and prove it with ease. Vanta has a kind and talented team, with many having succeeded without prior extensive security experience.

US

  • Define and evolve the technical vision for AI and agentic systems across products.
  • Design orchestration, data, and serving patterns that handle global scale with reliability.
  • Collaborate with AI Research to turn prototypes into extensible, governed production frameworks.

KnowBe4 is a cybersecurity company that puts security first, empowering over 70,000 organizations worldwide to strengthen their security culture. They value radical transparency, extreme ownership, and continuous professional development in a welcoming workplace that encourages all employees to be themselves.

Global

  • Co-create evaluation frameworks, proctoring solutions, and critical security mechanisms.
  • Evaluate and implement state-of-the-art AI/ML techniques.
  • Design, build, and deploy scalable AI services and pipelines.

The company develops and scales a global Certification-as-a-Service platform that automates the entire lifecycle of professional exams. The solution enables companies and organizations to quickly and inexpensively create and administer official online exams/certifications.

US

  • You will design, build, and operate core systems that enable autonomous agents to function reliably in production.
  • You’ll build production-grade agentic workflows, retrieval and memory systems, multi-model execution, and tool-calling integrations that interact safely with enterprise systems.
  • You’ll explore new approaches, prototype quickly, and turn what works into durable production systems.

Kindo is an agent automation platform for DevOps and SecOps teams. They help organizations automate high-friction operational work using autonomous agents. Kindo is a small, highly technical team with strong customer traction and real enterprise revenue.

$170,000–$240,000/yr
US Unlimited PTO

  • Own SentiLink’s real-time ML model monitoring domain.
  • Own our ML experimentation, model tracking, and versioning infrastructure.
  • Drive improvements to the model development process.

SentiLink provides identity and risk solutions for secure transactions. They are backed by investors like Craft Ventures and Andreessen Horowitz, recognized by Forbes Fintech 50, and have offices across the U.S. and India.

$130,000–$160,000/yr
US

  • Learn and build expertise across several software engineering disciplines.
  • Solve challenging Airflow problems for our customers.
  • Spend up to 25% of your time on side projects that contribute to Astronomer’s overall success.

Astronomer empowers data teams to bring mission-critical software, analytics, and AI to life and is behind Astro, the industry-leading unified DataOps platform powered by Apache Airflow®. They are trusted by more than 800 of the world's leading enterprises, letting businesses do more with their data.

Global

  • Architect and ship new backend capabilities that integrate AI-adjacent functionality into Kraken’s core systems.
  • Design distributed services that meet high standards for reliability, performance, and correctness.
  • Own end-to-end technical design, from protocol and service boundaries through production deployment.

Kraken is a mission-focused company rooted in crypto values. It aims to accelerate the global adoption of crypto, so that everyone can achieve financial freedom and inclusion. As a fully remote company, Kraken has Krakenites in 70+ countries who speak over 50 languages.