Chargement en cours

Member of Technical Staff (MLE, Pre-Training Data)

PARIS, 75
il y a 1 jour

Requirements

  • If you are passionate about transforming data into the foundation of AI systems, this role offers a unique opportunity to make a meaningful impact
  • Strong software engineering skills, with proficiency in Python and experience building data pipelines
  • Familiarity with data processing frameworks such as Apache Spark, Apache Beam, Pandas, or similar tools
  • Experience working with large-scale datasets, including web data, code data, and multilingual corpora
  • Knowledge of data quality assessment techniques and experimentation with data mixtures
  • A passion for bridging research and engineering to solve complex data-related challenges in AI model training
  • Bonus: paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP)

What the job involves

  • As a Machine Learning Engineer specializing in pretraining data, you will play a pivotal role in developing the data pipeline that underpins Cohere’s advanced language models
  • Your responsibilities will encompass the end-to-end management of training data, including ingestion, cleaning, filtering, and optimization, as well as data modeling to ensure datasets are structured and formatted for optimal model performance
  • You will work with diverse data sources—such as web data, code data, multilingual corpora, and synthetic data—to ensure their quality, diversity, and reliability
  • In this role, you will design and implement scalable, robust pipelines for data processing, conduct data ablations to evaluate quality, and experiment with data mixtures to enhance model performance
  • By combining research and engineering, you will bridge the gap between raw data and cutting-edge AI models, directly contributing to improvements in critical training metrics like throughput and accelerator utilization
  • Design and build scalable data pipelines to ingest, clean, filter, and optimize diverse datasets, including web data, code data, multilingual corpora, and synthetic data
  • Conduct data ablations to assess data quality and experiment with data mixtures to enhance model performance
  • Develop robust data modeling techniques to ensure datasets are structured and formatted for optimal training efficiency
  • Research and implement innovative data curation methods, leveraging Cohere’s infrastructure to drive advancements in natural language processing
  • Collaborate with cross-functional teams, including researchers and engineers, to ensure data pipelines meet the demands of cutting-edge language models
#J-18808-Ljbffr
Entreprise
Deepstreamtech
Plateforme de publication
WHATJOBS
Offres pouvant vous intéresser
Soyez le premier à postuler aux nouvelles offres
Soyez le premier à postuler aux nouvelles offres
Créez gratuitement et simplement une alerte pour être averti de l’ajout de nouvelles offres correspondant à vos attentes.
* Champs obligatoires
Ex: boulanger, comptable ou infirmière
Alerte crée avec succès