Create and curate an evaluation suite of real-world tasks for frontier AI models.
Rigorously evaluate AI systems, analyze results, and communicate findings.
Improve evaluation processes and potentially build out standalone benchmarks.
Epoch AI is a research institute that investigates trends in machine learning and the economic consequences of AI. Our mission is to develop a comprehensive, publicly accessible knowledge base on AI that informs policymakers, industry leaders, and society at large.
Identify and prototype high-leverage AI opportunities across engineering, GTM, and operations to improve revenue, efficiency, and quality.
Partner with functional owners to build internal workflow automations, prompt systems, and AI-enabled tools using approved platforms like Claude Code and Anthropic API.
Document and share prototypes through demos and showcases, setting the standard for responsible, visible AI building.
Chainguard delivers hardened, secure builds of open source software for enterprises. Backed by leading investors, it serves Fortune 500 clients including OpenAI and Snap, and fosters a values-driven remote culture focused on security and innovation.
Evaluate and select cutting-edge AI models to enhance product capabilities and user experience.
Design evaluation frameworks and configure observability for AI performance in production.
Collaborate with data science, CTO, and engineering teams to fine-tune and integrate AI models.
Vetcove modernizes veterinary software and pet healthcare with a procurement marketplace, home delivery ecommerce, and practice management system. Over 25,000 hospitals across all 50 states use the platform daily, and the company is backed by Y Combinator and top venture investors.
Design and develop coding benchmarks used to evaluate frontier AI models.
Analyze AI-generated code for correctness, reliability, efficiency, and edge cases.
Build and maintain scalable data pipelines that support AI evaluation workflows.
An enterprise client is a leading AI platform that enables organizations to build intelligent applications through high-quality human feedback, AI evaluation, and model alignment. The selected consultants will work on improving frontier AI models, though company size and culture details are not specified.
Build AI-powered tools into engineers' day-to-day workflows (e.g., Claude, VS Code, GitLab, Datadog, internal documentation, Slack chatops).
Implement and evolve inference and tool-calling pathways using Claude models on Amazon Bedrock, LiteLLM, and MCP/tool gateways within Omada's secure networks.
Partner with teams across Engineering to define AI-augmented SDLC patterns across planning, coding, testing, and operations.
Omada Health is a virtual-first healthcare and technology company that combines human-led care teams, connected devices, and AI-enabled technology to deliver personalized care at scale, focusing on chronic conditions like obesity, diabetes, and hypertension. They have served more than two million members since launch across 2,000+ employers, health plans, pharmacy benefit managers, and health systems, and are certified as a Great Place to Work.
Build automated vendor intelligence pipelines that continuously collect and parse AI system cards, model benchmarks, security disclosures, and public vendor documentation.
Design synthesis systems that map disparate vendor information to our risk taxonomy, translating technical capabilities into governance-relevant risk signals.
Implement quality evaluation for generated risk profiles and create adaptive interpretation systems that adjust risk assessments based on organizational context.
Credo AI is a venture-backed company on a mission to empower organizations to responsibly build, adopt, procure and use AI at scale. Founded in 2020, Credo AI has been recognized as a Most Innovative Company of 2024 by Fast Company and a Technology Pioneer by the World Economic Forum.
Partner with account executives to articulate the Arize value proposition and lead product demonstrations.
Work closely with GenAI teams as a trusted advisor, advising on best practices and leading Proof of Concepts.
Handle technical objections and develop strategies across sales, engineering, and product.
Arize AI is the leading AI & Agent Engineering observability and evaluation platform, empowering AI engineers to ship high-performing, reliable agents and applications. They are a Series C company with over $135M in funding and a rapidly growing customer base of 150+ leading enterprises.
Design and execute AI-native software development experiments to measurably improve productivity, quality, and speed.
Evaluate emerging AI engineering tools and institutionalize development, testing, and delivery standards.
Coach engineers and leaders across the organization to adopt AI-assisted workflows and drive transformation.
Sparkrock helps social benefit organizations—such as nonprofits, school boards, and government agencies—operate more effectively. They are a global, fully remote organization dedicated to mission-driven enterprise software.
Design, build, and maintain AI-powered features like LLM integrations and RAG pipelines.
Contribute to tooling for AI reliability, including observability and monitoring.
Prototype solutions quickly and work with peers to harden them for production.
Homeward creates financial products that remove uncertainty from homebuying. As a remote-first real estate startup with about 200 employees, we value the Golden Rule, One Team One Dream, and Calm Focus.
Design and build a next-generation reliability platform for Affirm's production systems, blending distributed systems engineering with AI-assisted development.
Create AI agents and a centralized command center to assist with incident triage, root-cause analysis, and unified system health visualization.
Own projects end-to-end, from requirements to rollout, collaborating with partner teams to build powerful, simple solutions for developers.
Affirm is reinventing credit to make it more honest and friendly, offering consumers the flexibility to buy now and pay later without hidden fees. The company is a remote-first organization with a strong focus on people-first values and inclusive benefits.
Design and build LLM-powered AI components for internal tools and user-facing applications.
Develop systems for data retrieval, embeddings, and intelligent automation workflows.
Collaborate with cross-functional teams on AI use cases and contribute to technical vision.
Jobgether powers job matching using AI to connect candidates with opportunities. They emphasize a remote-first, collaborative culture with high autonomy and flat structure.
Own end-to-end delivery of core data platform components including schema mapping, normalization, and validation pipelines.
Drive technical architecture for an AI-native data warehouse serving institutional financial clients.
Build AI evaluation infrastructure to ensure trustworthy outputs in high-stakes financial data contexts.
Juniper Square is a private market operations platform that unifies technology, data, and fund administration services for over 2,300 GPs. With 1,000+ employees and $350M+ in funding, the company has a founder-led culture focused on ambitious, meaningful work and diverse perspectives.
Identify workflows where AI agents can achieve 10-100x efficiency gains, build the business case, and align with domain leadership.
Design and implement agentic workflows integrating CRM, ERP, ticketing, and other enterprise systems with human-in-the-loop checkpoints.
Own production agent performance, track KPIs, tune prompts and retrieval, and provide observability for continuous improvement.
Natera is a global leader in cell-free DNA testing, dedicated to oncology, women's health, and organ health. The team consists of highly dedicated professionals from world-class institutions, working in a challenging and collaborative culture.
Build high-quality, scalable systems while using AI as a co-creator across the software development lifecycle.
Own end-to-end engineering initiatives, ensuring delivery readiness, production stability, and smooth execution.
Provide technical leadership and drive improvements using modern practices like DORA metrics and flow optimization.
Our partner is a company seeking a Software Craftsperson/Python/AI for a remote role based in Ireland. They emphasize an autonomous, consultative, and impact-driven engineering culture with strong expectations around ownership and technical excellence.
Design, build, and deploy AI-powered product features leveraging LLMs to enhance user workflows.
Develop and maintain evaluation frameworks, guardrails, and monitoring systems for AI reliability.
Collaborate with cross-functional teams to integrate generative AI into backend services and APIs.
This company provides a modern SaaS platform that helps organizations strengthen security, compliance, and trust. They have a remote-first environment with a strong engineering culture focused on ownership and collaboration.
Write behavioral specs, architectural constraints, and feature requirements that agents implement against.
Build and maintain harness infrastructure including structural tests, linting rules, and CI gates.
Design validation systems where agents write the tests and you verify features work from the user's perspective.
Bolo.ai builds generative AI systems for the energy industry, making daily work faster, safer, and better for heavy industry workers. We have Fortune 500 contracts, production deployments, and growing enterprise demand, and we're scaling with a small, senior-leaning engineering team.
Work as a software engineer maintaining and expanding agentic coding systems and AI SDK features.
Take end-to-end ownership of new features, collaborating with teams to deliver reliability and great developer experience.
Serve as a domain expert on AI design patterns, collaborating with field staff and writing public technical documentation.
Temporal provides a reliable foundation powering AI leaders such as OpenAI, NVIDIA, and others, serving users across a broad range of AI applications. The company is fully remote, action-oriented, and focused on shipping fast and solving customer problems with a thorough technical grounding.
Design and build production-grade LLM-powered agents and workflows for enterprise-scale AI solutions.
Develop and optimize RAG pipelines, agent reasoning patterns, and evaluation frameworks to measure model quality.
Collaborate with Engineering, Product, and cross-functional teams to translate business requirements into impactful AI systems.
Smartsheet builds AI-powered strategic planning and work execution agents through SmartAssist, an intelligent agent platform. The company is a publicly traded, large enterprise with a collaborative, inclusive culture that values diverse perspectives and engineering rigor.
Lead, manage, and grow a distributed engineering team dedicated to AI platform development.
Own the technical strategy and roadmap for Cohere's AI platforms, applications, and agents.
Design data and semantic architectures to provide AI agents with accurate healthcare data.
Cohere Health's clinical intelligence platform uses agentic AI to optimize care for health plans and providers, covering over 15 million people. The company has been recognized as a top LinkedIn startup and is backed by leading investors.
Design and optimize scalable, secure, and maintainable AI-powered software solutions, integrating machine learning models and generative AI services.
Champion engineering excellence by writing high-quality, well-tested code and guiding peers in best practices for AI integration.
Collaborate cross-functionally to evaluate new AI capabilities and contribute to the roadmap for AI-enabled features and platforms.
BECU is a financial cooperative with 1.5 million members and over $30 billion in managed assets, focused on people over profits. With 90 years of history and a purpose-driven culture, they are one of the nation's leading credit unions, emphasizing employee support and community well-being.