Source Job

20 jobs similar to Senior Python Developer (AI Evaluation & Benchmarking)

Jobs ranked by similarity.

Romania

  • Train AI models on Python coding tasks including algorithms, data structures, and distributed systems.
  • Verify logical accuracy, code quality, and capture error traces to improve model reasoning.
  • Collaborate remotely by conversing with the model and suggesting improvements to prompt engineering and evaluation metrics.

The company is working on training large-scale language models to power scientific discovery, education, and software development. They are a tech-focused organization seeking freelance contractors, with no information on size or culture provided.

Canada

  • Evaluate the model's ability to respond to coding requests and code base-related questions using available tools.
  • Assess agent trajectories and model capabilities for code generation and debugging requests.
  • Prompt models to complete complex coding tasks and review the accuracy of generated responses.

Cohere is a security-first enterprise AI company building cutting-edge foundation models and end-to-end products. We are a global team of researchers, engineers, and designers with offices in Toronto, San Francisco, and other key cities.

Global

  • Work as a software engineer maintaining and expanding agentic coding systems and AI SDK features.
  • Take end-to-end ownership of new features, collaborating with teams to deliver reliability and great developer experience.
  • Serve as a domain expert on AI design patterns, collaborating with field staff and writing public technical documentation.

Temporal provides a reliable foundation powering AI leaders such as OpenAI, NVIDIA, and others, serving users across a broad range of AI applications. The company is fully remote, action-oriented, and focused on shipping fast and solving customer problems with a thorough technical grounding.

Global Unlimited PTO

  • Implement and maintain AI benchmarks using evaluation infrastructure like the Inspect library.
  • Contribute to the design and development of new benchmarks for frontier AI models.
  • Collaborate with researchers and engineers to ensure accurate and insightful evaluation data.

Epoch AI is a research institute investigating trends in machine learning and the economic consequences of AI. With a small, mission-driven team, we aim to provide rigorous, independent insights into AI development.

Netherlands

  • Clone, run, and debug AI-generated full-stack and frontend applications to validate functionality and identify issues.
  • Evaluate code quality, structure, readability, and adherence to engineering standards, documenting UI and logic inconsistencies.
  • Collaborate with engineers to refine code evaluation frameworks and contribute to reusable frontend components and internal tooling.

A partner company is seeking a Junior Full-Stack Engineer to work at the intersection of full-stack development and AI, focusing on improving AI systems that generate and evaluate code. They offer a fast-moving, experimental environment with strong learning and career growth potential, collaborating with experienced engineers and AI researchers.

US 16w maternity 12w paternity

  • Orchestrate High-Velocity Workflows: Leverage advanced agentic coding tools (e.g., Cursor, multi-agent environments) to dramatically accelerate feature prototyping, code generation, and test coverage.
  • Own the Guardrails & Quality: Act as the ultimate reviewer and architect; define the specifications, establish repo-context guardrails, and review AI-accelerated output for hidden security risks, scale bottlenecks, and architectural alignment.
  • Build Scalable Application and Data Layers: Design, build, and maintain our data pipelines and application to service our hundreds of users.

EvolutionIQ provides technology to improve insurance claims handling. The company is experiencing massive growth and has been named a top workplace, prioritizing its team.

US

  • Shape the product hands-on, partnering with Product to design and deliver core application components and critical services.
  • Build product capabilities end-to-end using Python, Next.js, and modern backend patterns, leveraging AI-assisted development tools.
  • Own technical readiness for production, including reliability, observability, performance tuning, and incident response preparedness.

CentralReach is a leading provider of autism and IDD care software for Applied Behavior Analysis (ABA), multidisciplinary therapy, and special education. Trusted by more than 200,000 users, we enable therapy providers, educators, and employers to scale the way they deliver ABA and related therapies.

US

  • Design and build a next-generation reliability platform for Affirm's production systems, blending distributed systems engineering with AI-assisted development.
  • Create AI agents and a centralized command center to assist with incident triage, root-cause analysis, and unified system health visualization.
  • Own projects end-to-end, from requirements to rollout, collaborating with partner teams to build powerful, simple solutions for developers.

Affirm is reinventing credit to make it more honest and friendly, offering consumers the flexibility to buy now and pay later without hidden fees. The company is a remote-first organization with a strong focus on people-first values and inclusive benefits.

Global

  • Design and execute AI-native software development experiments to measurably improve productivity, quality, and speed.
  • Evaluate emerging AI engineering tools and institutionalize development, testing, and delivery standards.
  • Coach engineers and leaders across the organization to adopt AI-assisted workflows and drive transformation.

Sparkrock helps social benefit organizations—such as nonprofits, school boards, and government agencies—operate more effectively. They are a global, fully remote organization dedicated to mission-driven enterprise software.

US Unlimited PTO

  • Evaluate and select cutting-edge AI models to enhance product capabilities and user experience.
  • Design evaluation frameworks and configure observability for AI performance in production.
  • Collaborate with data science, CTO, and engineering teams to fine-tune and integrate AI models.

Vetcove modernizes veterinary software and pet healthcare with a procurement marketplace, home delivery ecommerce, and practice management system. Over 25,000 hospitals across all 50 states use the platform daily, and the company is backed by Y Combinator and top venture investors.

US Unlimited PTO

  • Design and prototype AI-assisted development techniques like code generation and test automation.
  • Partner with product and platform teams to integrate AI workflows into existing systems.
  • Mentor engineers on effective use of AI tools and contribute to technical strategy.

Hims & Hers is a leading health and wellness platform that provides personalized care from diagnosis to treatment delivery. As a public company traded on NYSE, they foster a remote-first culture with a focus on innovation and employee well-being.

Ireland

  • Build high-quality, scalable systems while using AI as a co-creator across the software development lifecycle.
  • Own end-to-end engineering initiatives, ensuring delivery readiness, production stability, and smooth execution.
  • Provide technical leadership and drive improvements using modern practices like DORA metrics and flow optimization.

Our partner is a company seeking a Software Craftsperson/Python/AI for a remote role based in Ireland. They emphasize an autonomous, consultative, and impact-driven engineering culture with strong expectations around ownership and technical excellence.

US 5w PTO

  • Build and enhance production-grade software applications using AI-assisted development workflows.
  • Design and implement clean, well-structured APIs, services, and application components.
  • Collaborate closely with the AI-Native Tech Lead to implement engineering best practices and deliver features end-to-end.

Cotiviti is a leading solutions and analytics company that leverages clinical and financial datasets to deliver insights into healthcare system performance. The company focuses on helping healthcare organizations improve financial performance and quality.

US

  • Design and build full-stack applications from concept to production.
  • Own the full SDLC: design, development, testing, deployment, and iteration.
  • Use AI tools to accelerate development, testing, and code quality.

Crosslake Technologies helps private equity investors and portfolio company leaders drive value creation through technology. The company operates with small, highly capable engineering pods and emphasizes core values of service, curiosity, credibility, commitment, and creativity.

Global

  • Design and architect AI capabilities on a cutting-edge iPaaS platform, working with technologies like LLM, RAG, Azure AI, and AWS Bedrock.
  • Build robust, scalable AI systems that run 24/7/365, collaborating with engineers, product management, and operations.
  • Mentor team members, use data-driven decision-making, and stay current with emerging AI and cloud computing trends.

Jitterbit is a leading data, application, and process workflow automation solution, rooted in iPaaS and fueled by an ambitious vision to integrate critical business processes. The company empowers enterprises of all sizes to accelerate their digital journey and is recognized in Gartner MQ for seven straight years, with a distributed, fun, fast-paced, and performance-oriented culture.

US

  • Write behavioral specs, architectural constraints, and feature requirements that agents implement against.
  • Build and maintain harness infrastructure including structural tests, linting rules, and CI gates.
  • Design validation systems where agents write the tests and you verify features work from the user's perspective.

Bolo.ai builds generative AI systems for the energy industry, making daily work faster, safer, and better for heavy industry workers. We have Fortune 500 contracts, production deployments, and growing enterprise demand, and we're scaling with a small, senior-leaning engineering team.

Austria

  • Design, develop, and deploy AI-powered applications and full-stack solutions.
  • Build and iterate rapidly on prototypes, MVPs, and proofs of concept.
  • Collaborate directly with stakeholders to translate business needs into scalable implementations.

They are an AI-focused company building intelligent applications and scalable products. The team is small and fast-moving, with a culture of experimentation and rapid execution.

United States Unlimited PTO

  • Own problems end-to-end: drive requirements with stakeholders, leverage AI to build, and verify outcomes.
  • Write production-quality code across Python, Django, AWS, React, and PostgreSQL.
  • Iterate until the metric moves, owning adoption, monitoring, and support.

Counterpart is an insurtech platform reimagining management and professional liability for the modern workplace. We are a fully remote company backed by A-rated carriers, with a culture focused on autonomy, AI-driven development, and inclusive collaboration.

US Unlimited PTO 16w maternity 16w paternity

  • Design, build, and deploy AI-powered solutions connecting Remote’s platform to customer systems and workflows.
  • Work at the frontier of practical AI, shipping reliable, observable, and secure systems in production.
  • Own customer outcomes from discovery through production rollout with high autonomy and direct influence on product roadmap.

Remote is a global HR platform that helps businesses recruit, pay, and manage international teams compliantly. The company fosters a future-focused, async culture with team members across 6 continents and emphasizes innovation, automation, and AI.

US

  • Design and build scalable backend services and data pipelines for clinical and analytics applications.
  • Collaborate with cross-functional teams to translate requirements into robust cloud-native solutions using GCP and Docker.
  • Mentor junior engineers and contribute to architecture decisions while ensuring high-quality deliverables.

Trissential is a consulting company that partners with clients to build innovative data-driven systems. They offer a flexible, remote-first culture with a focus on collaboration, autonomy, and continuous learning.