Source Job

$230,000–$322,000/yr
US

  • Define technical strategy & architecture for data curriculum pipelines powering next-gen foundation models.
  • Design & execute dynamic curriculum learning strategies, improving model stability & reasoning.
  • Engineer logic for serializing Reddit’s complex conversational trees into optimal training contexts.

Python Spark Rust C++

20 jobs similar to Staff Research Engineer, Pre-training Data

Jobs ranked by similarity.

Europe North America 7w PTO

  • Improve the quality of pretraining datasets by leveraging your previous experience, intuition and training experiments.
  • Focus on generating synthetic data at scale and determining the best strategies to leverage such data into training large models.
  • Closely collaborate with other teams like Pretraining, Postraining, Evals, and Product to define high-quality data needs.

Poolside aims to be the company that builds a world where AI will be the engine behind economically valuable work and scientific progress. They are a remote-first team across Europe and North America that values the quality of their systems.

Global

  • Conduct cutting-edge machine learning research, building and training large language models.
  • Focus on research projects aimed at expanding the frontier of knowledge in language modelling and associate areas such as evaluation, multimodal models, optimisation etc.
  • Disseminate your research results through the production of publications, datasets, and code.

Cohere is dedicated to scaling intelligence to serve humanity by training and deploying frontier models for developers and enterprises, building AI systems for content generation, semantic search, and more! They foster a culture of hard work, valuing diverse perspectives and contributions to model capabilities and customer value.

US

  • Design, train, and evaluate machine learning models from first principles.
  • Develop and maintain production-quality Python code for data processing.
  • Build natural language processing systems for document understanding.

Alpha7X is a technology company. They seem to be a growing company with a focus on innovation within the AI and machine learning space.

Europe 5w PTO

  • Improve model performance through data quality, curation, labeling, and evaluation.
  • Work on the data layer of Generative AI products involving images, video, or audio.
  • Design, build, and operate workflow orchestration systems and large-scale data processing pipelines.

Synthesia is on a mission to make video easy for everyone with their AI video communications platform. They simplify the entire video production process, making it easy for everyone to create, collaborate, and share high-quality videos, and are trusted by leading brands such as Heineken, Zoom, Xerox, and McDonald’s.

US

  • Build state-of-art multimodal data mining and semantic search solutions to power AV product development.
  • Develop data understanding platform infrastructure for real-time querying/vector databases and batch/stream processing using technologies like Ray, Spark, Lance, or similar.
  • Deliver end-to-end data mining solutions that span onboard (C++) and offboard (ML & Data Infra) infrastructure to accelerate AV product development.

Stack is developing revolutionary AI and advanced autonomous systems designed to enhance safety, reliability, and efficiency of modern operations. With decades of experience creating and deploying real world systems for demanding environments, the Stack team is dedicated to developing an autonomous solution ecosystem tailored to the trucking industry's unique demands.

US

  • Work with complex datasets from various sources to build Extract, Transform, and Load (ETL) data pipelines for downstream tasks
  • Finetune and integrate the latest Large Language Models (LLMs) such as OpenAI/Gemini/MistralAI models into production systems.
  • Train, finetune and deploy large-scale NLP and CV models to power complex document understanding experiences.

Wealth.com is the industry’s leading estate planning platform, empowering more than 1,000 wealth management firms to modernize how they talk about estate planning with their clients. They cultivate a collaborative and supportive environment, fostering innovation and making Wealth.com a truly enjoyable workplace.

$135,000–$162,000/yr
US

  • Design, implement, and evolve RAG pipelines combining structured data, embeddings, and LLMs.
  • Develop and maintain prompt strategies used across multi-step agent workflows.
  • Integrate LLMs into production systems with attention to reliability, cost, and latency.

FirmPilot builds AI-powered systems that automate and scale real-world business outcomes. They focus on applied AI, using best-in-class large language models and tooling to deliver reliable, production-grade automation.

$110,720–$138,400/yr
US

  • Design, develop, and deploy LLM- and RAG-powered applications that enhance analyst and hacker productivity across offensive security use cases.
  • Architect and maintain large-scale, high-performance data pipelines to process vulnerability, asset, and activity datasets from multiple sources.
  • Collaborate with security researchers and engineers to translate offensive security workflows into data-driven automation.

Bugcrowd empowers organizations to take back control and stay ahead of threat actors. With a network of hackers, Bugcrowd brings diverse expertise to uncover hidden weaknesses and adapts swiftly to evolving threats.

AI Engineer

Quora
$107,360–$152,900/yr
Global

  • Work with other engineers on a wide variety of AI engineering tasks to improve our existing applied AI systems
  • Identify new opportunities to apply emerging AI capabilities to different parts of the Poe product
  • Take end-to-end ownership of applied AI systems - from prototyping, data pipelines, model optimization/evaluation to reliable deployment at scale

Quora's mission is to grow the world's collective intelligence. They have two platforms: Quora, a global knowledge sharing platform, and Poe, a platform to chat, explore and build with AI language models. They have a culture rooted in transparency, idea-sharing, and experimentation.

  • Build, optimize, and evolve RAG pipelines.
  • Develop prompts and guardrails for domain-specific LLM applications.
  • Implement hallucination detection, mitigation, and fact-checking mechanisms.

Robots & Pencils builds meaningful, scalable digital products by blending strategy, design, and engineering. They are a small, senior team with direct access to enterprise clients.

US

  • Design, build, and deploy AI Agents including custom tools, prompt engineering, orchestration workflows, and agent design patterns.
  • Contribute to the backend infrastructure powering Candidly's AI capabilities, including API development, data integrations, and data pipelines.
  • Work closely with stakeholders across product, design, engineering, and leadership to translate complex AI concepts into actionable strategies and features.

Candidly, founded in 2016, is the category leader with the market’s most comprehensive AI-driven student debt and savings optimization platform. They partner with hundreds of top employers, financial institutions, and retirement record keepers, positioning Candidly to serve more than 35 million Americans. Candidly is a high-growth, Series B startup, funded by leading investors with an international team of 70 (and counting).

$133,109–$239,596/yr
US

  • Create advanced machine learning analytical solutions to extract insights from diverse structured and unstructured data sources.
  • Unearth data value by selecting and applying the right machine learning, deep learning and processing techniques.
  • Refine data manipulation and retrieval through the design of efficient data structures and storage solutions.

Experian is a global data and technology company, powering opportunities for people and businesses around the world. As a FTSE 100 Index company listed on the London Stock Exchange (EXPN), they have a team of 22,500 people across 32 countries, and corporate headquarters in Dublin, Ireland.

$147,300–$245,000/yr
US

  • Research and develop Machine Learning models and optimize them for scaled production usage.
  • Work with colleagues to explore ongoing product issues and recommend innovative ML/AI based solutions.
  • Work with subject matter experts to curate and generate optimal datasets following responsible data collection and model maintenance practices.

Turnitin is a recognized innovator in the global education space, partnering with educational institutions to promote honesty, consistency, and fairness across all subject areas and assessment types. They are a global organization with team members in over 35 countries, offering a remote-first culture which empowers team members to work with purpose and accountability.

Europe Asia

  • Collaborate with engineers, data scientists, and business analysts to understand requirements, refine models, and integrate LLMs into AI solutions
  • Development and implementation of Deep learning algorithms for AI solutions
  • Preprocess raw data, including text normalization, tokenization, and other techniques, to make it suitable for use with NLP models

Exadel is an AI-first global tech company with 25+ years of engineering leadership. They have 2,000+ team members, and 500+ active projects powering Fortune 500 clients valuing open dialogue, creative freedom, and mentorship.

  • Own end-to-end quality design for Prolific managed service studies.
  • Define, implement, and maintain quality measurement systems.
  • Build and deploy automated quality checks and launch gates using Python and SQL.

Prolific provides the high-quality, diverse data required to train the next generation of AI models. Through our platform, they empower researchers and companies to access a global, ethically curated participant base, ensuring cutting-edge AI research and training grounded in inclusivity and precision.

$107,360–$152,900/yr
Global

  • Improve our existing Ads Recommendation systems using your expertise
  • Identify new opportunities to apply Machine Learning to different parts of the Quora Ads Product
  • Work with other engineers to implement algorithms and systems in an efficient way

Quora's mission is to grow the world's collective intelligence through its knowledge-sharing platform and AI language model platform. Behind these products are passionate, collaborative, and high-performing global teams with a culture rooted in transparency, idea-sharing, and experimentation.

$255,816–$375,183/yr
Global

  • Improve our existing machine learning systems using your core coding skills and ML knowledge
  • Identify new opportunities to apply machine learning to different parts of the Quora product
  • Work with other machine learning engineers to implement algorithms and systems efficiently

Quora's mission is to grow the world's collective intelligence through platforms like Quora and Poe. They are a remote-first company with passionate, collaborative, and high-performing global teams, emphasizing transparency, idea-sharing, and experimentation.

US

  • Explore data and real-world cases to develop signals and machine learning models that identify and reduce platform risk.
  • Partner with cross-functional teams to design practical ML solutions that fit real workflows.
  • Prototype, train, and iterate on machine learning models, using a mix of established and novel techniques.

Patreon is a media and community platform where over 300,000 creators give their biggest fans access to exclusive work and experiences. They aim to fund the creative class and have generated over $10 billion for creators with millions of memberships.

Global Unlimited PTO

  • Build scalable backend services and internal APIs for the AI platform.
  • Integrate LLMs and retrieval into reliable, production-ready workflows.
  • Build knowledge ingestion pipelines for LLMs (documents, APIs, semi-structured data).

MaintainX is the world's leading Asset and Work Intelligence platform for industrial and frontline environments. It powers operational excellence for 13,000+ businesses. They recently completed a $150 million Series D round, at a valuation of $2.5 billion.

$200,000–$250,000/yr
US Canada Unlimited PTO

  • Work with our team to understand our Archie capability roadmap and decompose capabilities into technical development.
  • Turn capability prototypes and PoCs from our AI research team into robust, scalable implementations.
  • Diagnose and solve technical problems identified by our team or users.

P-1 AI is building an engineering AGI, focusing on the built world. They are a small team tackling an ambitious problem, aiming to put an Archie on every engineering team at every industrial company on earth.