Source Job

US

  • Own agent quality end-to-end: diagnosis, improvement, and validation across SmartAssist's orchestrator and subagents
  • Drive quality improvements through prompt engineering, context engineering, and RAG retrieval tuning
  • Extend and mature our evaluation framework: scorers, golden datasets, regression gates, and online evaluation for production traffic

Python LLM RAG MLflow CI

20 jobs similar to Senior Software Engineer II - Applied AI and Evaluations

Jobs ranked by similarity.

$88,911–$117,952/yr
Canada

  • Build AI-Powered Features: Design, develop, and deploy production-grade AI applications from concept to deployment.
  • Architect Scalable Systems: Create robust backend architectures that support AI workloads, ensuring low latency and high reliability.
  • Drive AI Innovation: Implement and optimize agentic AI systems, RAG pipelines, and multi-agent workflows using modern LLM frameworks.

Procurify is the AI-enhanced procurement and AP automation platform for mid-market organizations, making it easy for organizations to take control of spending and save money. It is a remote-first company with a big heart and a strong ambition to modernize how organizations manage business spend.

  • Design, implement, and evaluate machine learning models and AI algorithms.
  • Develop and optimize prompts for LLMs to improve model outputs.
  • Collaborate with software engineers, data scientists, and product teams.

Cadre AI is focused on building and optimizing AI-powered platforms, bringing together cutting-edge technologies and expertise in machine learning and large language models. The team is dedicated to advancing AI capabilities and applying them to real-world challenges through scalable, high-impact solutions.

Global

  • Design and develop an AI-powered productivity analytics platform.
  • Build scalable LLM pipelines and create a meta-workflow system.
  • Develop system-level prompt engineering and build an evaluation framework for AI output quality control.

Appflame is a Ukrainian product-driven tech company committed to building world-class products. They have 500+ team members and offices in Kyiv, London, Limassol, and a co-working hub in Warsaw; they value bold, driven people who are passionate about building real products.

Europe

  • Own the architecture and delivery of production-grade LLM systems and classical ML solutions.
  • Design, evaluate, and optimize RAG pipelines (retrieval strategy, chunking, indexing, monitoring).
  • Build scalable, production-grade LLM services and agentic workflows, alongside traditional ML systems where appropriate.

Hiflylabs is a team of 250+ data and tech enthusiasts based in Budapest. They focus on data engineering, data science, artificial intelligence and application development, working on a wide range of projects around the world. Hiflylabs values its people and is committed to nurturing their personal and professional development through a mentoring system.

$130,000–$170,000/yr
US

  • Design AI integration patterns and architecture standards across the SaaS platform
  • Integrate LLM APIs (OpenAI, Anthropic, AWS Bedrock) into production features
  • Establish model evaluation, benchmarking, and observability processes

PerfectServe offers Best in KLAS clinical communication and physician scheduling solutions and is a Leader in the Gartner Magic Quadrant for Clinical Communication and Collaboration. They have seen an 88% growth rate over the past three years.

$179,000–$199,000/yr
US

  • Set the technical vision and reference architecture for agentic AI across applications.
  • Build and govern reusable platform components to accelerate adoption across teams.
  • Drive cross-functional roadmaps and integration standards across OCIO and business teams.

PointClickCare helps providers deliver exceptional care. They are a leading health tech company that’s founder-led and privately held, empowering their employees to push boundaries, innovate, and shape the future of healthcare.

Europe

  • Build and ship AI-powered product and internal solutions using LLMs, RAG, tool calling, workflows, and agentic patterns
  • Design quality and evaluation frameworks for AI systems, including offline evals, online signals, failure analysis, and continuous improvement loops
  • Contribute to AI platform and tooling decisions that improve reuse, speed, and consistency across teams

Finom is a European tech startup headquartered in Amsterdam, revolutionizing financial landscape for entrepreneurs. They develop an all-in-one financial B2B solution integrating banking, accounting, financial management, and invoicing into a mobile-first platform and nurture innovation in an inspiring work environment.

$35–$50/hr
Global

  • Design and implement LLM-powered application workflows
  • Architect retrieval-augmented generation pipelines
  • Collaborate with backend architects to integrate AI services into APIs

They are seeking a hands-on AI Engineer with deep expertise in Large Language Model integration and production AI systems. The company's culture sounds innovative and collaborative, focusing on building scalable and secure AI applications.

$73,000–$104,390/yr
North America 3w PTO 1w paternity

  • Design and build the evaluation infrastructure that ensures the platform's AI systems produce accurate, well-sourced, high-quality responses
  • Build automated test suites that validate answer quality across agent pipeline changes
  • Develop regression detection systems that catch quality degradation before it reaches users

IDC is building the next generation of AI-powered intelligence platforms that transform how technology decisions get made. As the premier global provider of trusted technology intelligence, IDC equips business and technology leaders with the evidence they need to make confident decisions.

US

  • Lead the reference architecture for AI and agentic systems within key product domains.
  • Design and deliver production ready AI systems, ensuring reliability, scalability, observability, performance, and cost efficiency across the full lifecycle.
  • Establish platform level AI and data components, including Model broker and multi-model routing strategies.

PointClickCare is a leading health tech company that’s founder-led and privately held, they empower their employees to push boundaries, innovate, and shape the future of healthcare. With the largest long-term and post-acute care dataset and a Marketplace of 400+ integrated partners, their platform serves over 30,000 provider organizations and they reinvest a significant percentage of their revenue back into research and development.

US

  • Develop, test, and deploy LLM-powered extraction pipelines for clinical text at scale.
  • Automate prompt execution, result validation, and error handling to enhance reliability.
  • Monitor and maintain production AI models, ensuring uptime, accuracy, and compliance.

iCIMS is a software company. The job posting mentions thriving in a start-up environment.

$160,000–$190,000/yr
US Unlimited PTO

  • Design, build, and deploy production AI agents and multi-agent orchestration systems.
  • Architect RAG pipelines with vector search and knowledge base management for AI-driven support.
  • Build production microservices and APIs serving as orchestration layers for AI agent systems.

Greenlight is a family fintech company helping parents raise financially smart kids. They serve over 6 million parents and kids with their banking app, aiming to ensure every child has the opportunity to become financially healthy and happy.

US

  • Lead Rula’s applied AI investments as they scale.
  • Own technical direction for high-impact AI products and work across teams to turn big ideas into shipped systems.
  • Help raise the bar for how they build, evaluate, and operate AI in production.

Rula strives to create a world where mental health is no longer stigmatized and provides quality, evidence-based care. They are a remote-first company that is dedicated to treating the whole person, not just the symptoms, and making a positive impact in the field of mental healthcare.

5w PTO

  • Build and ship agentic features across the RevenueCat universe
  • Design and implement tool integrations that expand what agents can see and do
  • Own the reliability and quality of agent responses

RevenueCat is a monetization platform for mobile apps, helping developers understand and grow their revenue by removing the headaches of building and scaling in-app subscriptions. They are a remote-first company of 120+ employees across 25 countries, valuing customer obsession and continuous improvement.

$80,000–$400,000/yr
US

  • Shape best practices and mentor team members.
  • Work on varied projects and influence technologies and solutions.
  • Identify and experiment with new approaches, technologies, or tools.

Qvest US is a global leader in technology and business consulting for the Media & Entertainment and Consumer Packaged Goods & Retail industries. They strategize, advise, design, develop, and implement future-forward business & technology solutions. Qvest US is currently 300+ people strong and they’ve been recognized as a “Best Place to Work,” a “Great Place to Work,” “Fastest Growing,” and “A Jewel."

Asia

  • Design and build AI-powered developer tools that improve engineering efficiency across the company
  • Develop and deploy LLM-based applications to streamline internal workflows
  • Build and optimize AI-assisted CI/CD pipelines, including intelligent test selection, build failure prediction, and deployment risk assessment

Binance is focused on leveraging AI/LLM technologies to fundamentally improve how our engineers build, test, deploy, and operate software. They are looking for people to join their team to design and implement AI-powered tools and systems to accelerate development velocity across the organization.

$140,000–$160,000/yr
US 4w PTO

  • Design and build agentic AI systems and RAG pipelines for production features across the marketplace.
  • Integrate LLMs into product experiences across search, categorization, communication, and trust & safety.
  • Partner with Data Scientists and Engineers to turn research into shipped products.

OfferUp is dedicated to creating the simplest and most trusted way for people to buy, sell, and connect in their local communities. OfferUp used by more than 1 in 6 adults in the U.S. in 2024.

LATAM

  • Design and implement scalable ML infrastructure to support model development and deployment
  • Develop and maintain evaluation frameworks for Large Language Models (LLMs), including RAG-based systems
  • Evaluate model performance using tools such as RAGAS, DeepEval, or similar frameworks

EX Squared LATAM collaborates with global clients to build innovative digital solutions that drive real business impact. They foster a collaborative, inclusive, and innovation-driven culture where continuous learning and professional growth are at the core of everything they do.

Global

  • Architect and build agentic workflows that combine large language models, reasoning components, and data pipelines to create adaptive, goal-driven conversational systems
  • Lead the design and development of advanced ML/NLP products, from ideation to production - including model training, evaluation, optimization, and deployment
  • Drive experimentation with new approaches for agentic reasoning, coordination, and autonomous system design

SmartRecruiters is the Recruiting AI Company that transforms hiring for the world’s leading enterprises. Built for global scale, SmartRecruiters, an SAP company, delivers an AI-powered hiring platform that automates and optimizes the entire talent acquisition process, ensuring faster and smarter hiring decisions. They are a values-driven, globally focused tech company with strong financial backing and a bold vision for the future of work.

Europe

  • Lead Agent Development: Drive the development of Owkin’s Data Transformation Agent (DTA).
  • Orchestrate Data Workflows: Design, implement, and maintain complex data transformation workflows.
  • Ensure Code Excellence: Define and enforce robust engineering practices.

Owkin is an AI company on a mission to solve the complexity of biology. They are building the first Biology Super Intelligence (BASI) by combining powerful biological large language models, multimodal patient data, and agentic software.