Job Description

Play a critical role in the foundation model development process, focusing on consolidating, gathering, and generating high-quality text data for pretraining, midtraining, SFT, and preference optimization. Create and maintain data cleaning, filtering, selection pipeline than can handle >100TB of data. Watch out for the release of public dataset on huggingface and other platforms. Create crawlers to gather datasets from the web where public data is lacking. Write and maintain synthetic data generation pipelines. Run ablations to assess new dataset and judging pipelines.

About Liquid AI

Liquid AI is building efficient AI systems at every scale, spun out of MIT, with a mission to build efficient AI systems at every scale.

Apply for This Position