Director, Reinforcement Learning & Agentic Post-Training
About the AI Studio
The AI Studio's mission is to find the fastest possible path to an autonomous supply chain. We build AI agents, learning systems, model training pipelines, evaluations, simulations, and decision‑making systems for some of the hardest problems in global supply chain. Our work spans LLMs, reinforcement learning, agentic workflows, software automation, optimization, and production engineering.
Your Mission
We are looking for a deeply technical Director of Reinforcement Learning & Agentic Post‑Training to lead how Blue Yonder trains LLM‑based agents to operate supply chain software. This role sits at the center of our Model Training Factory, built with NVIDIA, where we develop specialized AI agents for the autonomous supply chain. Our agents must reason over supply chain state, use tools, interact with Blue Yonder workflows, execute multi‑step operational tasks, and improve through feedback, evaluation, and reinforcement learning. Tool use is not a side feature here. Our agents must learn to work inside real enterprise software: querying state, proposing actions, invoking APIs, respecting constraints, handling exceptions, escalating uncertainty, and collaborating with human operators. The challenge is not simply making a model sound knowledgeable about supply chain. The challenge is training models that can reliably act.
What You'll Do
- Lead the technical strategy for reinforcement learning, post‑training, and tool‑using LLM agents within the AI Studio.
- Build and manage a team of machine learning engineers working on agent training, RL environments, reward modeling, evaluation, data generation, and training infrastructure.
- Design environments where LLM agents learn to operate Blue Yonder software through APIs, tools, workflows, simulations, and human feedback.
- Develop training and evaluation systems for multi‑step supply chain workflows across planning, warehouse management, transportation, commerce, and network operations.
- Define what "good" looks like for operational agents: correct tool use, constraint adherence, business outcome quality, latency, cost, robustness, escalation behavior, and human trust.
- Build reward models, verifiers, preference pipelines, automated graders, and evaluation harnesses for agent behavior.
- Create evaluation frameworks that measure real agent performance, including tool‑call correctness, workflow completion, recovery from bad state, long‑horizon reliability, and failure modes.
- Partner with product, engineering, architecture, and domain experts to turn real supply chain workflows into trainable agent environments.
- Guide model improvement across supervised fine‑tuning, preference optimization, reinforcement learning from human or AI feedback, rejection sampling, synthetic data generation, and policy optimization.
- Make practical technical tradeoffs between model capability, inference cost, latency, reliability, product timelines, and operational safety.
- Establish engineering standards for experiment tracking, reproducibility, observability, rollout safety, and production monitoring.
- Document what works and what fails so the team compounds learning over time.
What We’re Looking For
- Have led a team to ship LLM models trained with reinforcement learning, SFT, DPO, RLHF/RLAIF and other post‑trained models in production.
- Have led a team to train models to use tools, call APIs, interact with software environments, or complete multi‑step tasks.
- Have a strong machine learning engineering background and can credibly lead engineers because you have built systems like this yourself.
- Have managed or technically led high‑performing reinforcement learning ML engineering teams.
- Are highly proficient in Python and PyTorch.
- Understand modern LLM post‑training workflows, including supervised fine‑tuning, preference data, reward modeling, policy optimization, evaluation, and deployment.
- Have hands‑on experience with reinforcement learning methods such as reward shaping, PPO‑style optimization, GRPO, offline RL, policy evaluation, rejection sampling, or environment design.
- Know how to evaluate open‑ended agent behaviour beyond static benchmark scores.
- Can reason about production constraints: latency, inference cost, safety, observability, rollback, and reliability.
- Can balance frontier‑oriented exploration with shipping production systems.
- Are comfortable with ambiguity but intolerant of unsound technical thinking.
- Care about engineering craft, reproducibility, and learning velocity.
- Are curious about why systems work, not just whether a metric moved.
Bonus Points
- Experience building simulated or sandboxed enterprise software environments for agent training.
- Experience with NVIDIA Nemotron, NVIDIA NeMo, Megatron, vLLM, Ray, distributed training, or large‑scale inference systems.
- Experience with warehouse management, supply chain planning, transportation, merchandising, logistics, operations research, or enterprise workflow automation.
- Experience designing agent safety systems, including permissioning, action validation, approval flows, uncertainty escalation, and audit trails.
- Evidence of technical taste through papers, open‑source contributions, internal platforms, side projects, or shipped systems that show deep curiosity about model behaviour.
Equal Opportunity
All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or protected veteran status. Blue Yonder is an equal opportunity employer.
#J-18808-Ljbffr