Member of Technical Staff (MLE, Pre-Training Data)
PARIS, 75
il y a 1 jour
Requirements
- If you are passionate about transforming data into the foundation of AI systems, this role offers a unique opportunity to make a meaningful impact
- Strong software engineering skills, with proficiency in Python and experience building data pipelines
- Familiarity with data processing frameworks such as Apache Spark, Apache Beam, Pandas, or similar tools
- Experience working with large-scale datasets, including web data, code data, and multilingual corpora
- Knowledge of data quality assessment techniques and experimentation with data mixtures
- A passion for bridging research and engineering to solve complex data-related challenges in AI model training
- Bonus: paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP)
What the job involves
- As a Machine Learning Engineer specializing in pretraining data, you will play a pivotal role in developing the data pipeline that underpins Cohere’s advanced language models
- Your responsibilities will encompass the end-to-end management of training data, including ingestion, cleaning, filtering, and optimization, as well as data modeling to ensure datasets are structured and formatted for optimal model performance
- You will work with diverse data sources—such as web data, code data, multilingual corpora, and synthetic data—to ensure their quality, diversity, and reliability
- In this role, you will design and implement scalable, robust pipelines for data processing, conduct data ablations to evaluate quality, and experiment with data mixtures to enhance model performance
- By combining research and engineering, you will bridge the gap between raw data and cutting-edge AI models, directly contributing to improvements in critical training metrics like throughput and accelerator utilization
- Design and build scalable data pipelines to ingest, clean, filter, and optimize diverse datasets, including web data, code data, multilingual corpora, and synthetic data
- Conduct data ablations to assess data quality and experiment with data mixtures to enhance model performance
- Develop robust data modeling techniques to ensure datasets are structured and formatted for optimal training efficiency
- Research and implement innovative data curation methods, leveraging Cohere’s infrastructure to drive advancements in natural language processing
- Collaborate with cross-functional teams, including researchers and engineers, to ensure data pipelines meet the demands of cutting-edge language models
Entreprise
Deepstreamtech
Plateforme de publication
WHATJOBS
Offres pouvant vous intéresser
PARIS, 75
il y a 9 jours
PARIS, 75
il y a 9 jours
PARIS, 75
il y a 26 jours
PARIS, 75
il y a 1 jour