Chargement en cours

Lead GPU Engineer

PARIS, 75
il y a 2 jours

A shift is happening in AI that most people have not fully priced in. As models become more capable and agents take over more software work, inference becomes the critical bottleneck. The question stops being whether a model can do the work and becomes whether it can run fast enough to feel like thinking.

Kog was built for that shift.

We co-design the execution engine and the model architecture together, specifically for AMD MI300X hardware. Our monokernel runs from first token to last without returning control to the CPU. Our Laneformer architecture is designed to overlap computation and communication by deferring all-reduce by one layer.

Today, Kog serves 2,500 tokens per second. Our next target is 5,000.

Our MoE v3 already outperforms Llama 3.2-3B on CORE benchmarks and shows emergent reasoning capabilities where dense models of similar size score zero.

We are a team of 11 people, including 10 engineers and 4 PhDs, building a different kind of inference company from first principles.

Why this role matters now

Inference speed is becoming a product constraint, a model constraint, and a company constraint at the same time. At Kog, this role sits directly on that bottleneck. The work you do here will shape token-by-token generation speed, influence which model designs become viable, and determine how quickly engineering judgment turns into measurable performance.

The problem

Most inference systems still carry architectural decisions that made sense for an earlier generation of workloads. Sequential generation still absorbs synchronization costs, CPU handoffs, and memory behavior that become limiting when every token matters.

Kog took a different route. We built a monokernel execution path and co-designed the model architecture with the hardware. That created a different set of opportunities and a higher level of technical difficulty. Progress comes from understanding the machine at a very fine-grained level, making strong tradeoffs, and turning them into real gains in generation speed.

The role

You will own the technical execution of the Kog inference engine at the hardware boundary. You will work close to the machine, close to the model, and close to the people making the most consequential architectural decisions in the company.

This is a hands‑on leadership role. You will write code, review kernels, define performance priorities, make architecture calls, and drive a team toward improvements that matter in production. You will be expected to move between microscopic detail and system‑level judgment with the same rigor.

What you will work on

  • Kernel architecture for the monokernel pipeline, including memory hierarchy choices, scheduling behavior, and strategies that hide HBM latency behind useful computation

  • Low‑level optimization work on modern GPU hardware, with profiling that turns machine behavior into concrete engineering decisions

  • Execution strategies that improve end‑to‑end sequential generation speed rather than isolated wins on local kernels

  • Close collaboration with model architecture to turn model constraints into execution opportunities, and execution constraints into model design feedback

  • Technical direction for a small team working on the critical path of generation speed

  • Engineering milestones that connect ambitious performance targets to work that can be shipped, measured, and iterated on quickly

Must‑have

  • You have written GPU kernels for production workloads where performance was central to the system outcome

  • You understand memory hierarchy, scheduling, occupancy, and execution behavior at the level where you can anticipate likely bottlenecks before profiling confirms them

  • You have shipped optimizations with measurable impact and can explain the exact decisions that created the result

  • You have operated with real ownership over difficult technical work and raised the standard of the people around you through code, reviews, and decision‑making

  • You are comfortable carrying both individual technical depth and team‑level responsibility in the same role

Strong signal

  • You have deep low‑level GPU performance experience on AMD, NVIDIA, or both

  • You have worked on inference engine components such as attention kernels, KV cache management, quantization‑aware execution, or communication‑sensitive execution paths

  • You have built or shaped systems where model behavior and execution behavior had to be designed together

  • You have a public trace of serious low‑level work, such as benchmarks, repositories, technical writing, conference talks, or profiling methods adopted by others

Top 0.1% for this role

The strongest candidates for this role have already developed original judgment at the hardware boundary. They have found performance wins that were not obvious from documentation alone. They can explain why those wins worked, what tradeoffs they introduced, and how those decisions improved real token‑by‑token generation speed.

They have a track record of shortening the loop between observation, hypothesis, implementation, and measured result. They bring both authorship and taste. They know when to push deeper into the machine, when to change the execution plan, and when to influence model structure so the whole system moves faster.

What we offer

  • Direct access to AMD MI300X clusters from day one, with enough compute to validate serious work at real scale

  • A team where technical judgment carries weight and where the people closest to the problem shape the key decisions

  • Problems that sit on the critical path of model execution speed and that directly influence what the system can become

  • A remote‑first working model, with regular time overlap close to France time and monthly Paris weeks for engineering depth, alignment, and time together

  • Compensation aligned with top technical profiles in the Paris AI market, including meaningful equity

#J-18808-Ljbffr
Entreprise
Kog
Plateforme de publication
WHATJOBS
Offres pouvant vous intéresser
PARIS, 75
il y a 14 jours
GRENOBLE, 38
il y a 20 heures
PARIS, 75
il y a 14 jours
Soyez le premier à postuler aux nouvelles offres
Soyez le premier à postuler aux nouvelles offres
Créez gratuitement et simplement une alerte pour être averti de l’ajout de nouvelles offres correspondant à vos attentes.
* Champs obligatoires
Ex: boulanger, comptable ou infirmière
Alerte crée avec succès