Lead GPU Engineer
A shift is happening in AI that most people have not fully priced in. As models become more capable and agents take over more software work, inference becomes the critical bottleneck. The question stops being whether a model can do the work and becomes whether it can run fast enough to feel like thinking.
Kog was built for that shift.
We co-design the execution engine and the model architecture together, specifically for AMD MI300X hardware. Our monokernel runs from first token to last without returning control to the CPU. Our Laneformer architecture is designed to overlap computation and communication by deferring all-reduce by one layer.
Today, Kog serves 2,500 tokens per second. Our next target is 5,000.
Our MoE v3 already outperforms Llama 3.2-3B on CORE benchmarks and shows emergent reasoning capabilities where dense models of similar size score zero.
We are a team of 11 people, including 10 engineers and 4 PhDs, building a different kind of inference company from first principles.
Why this role matters now
Inference speed is becoming a product constraint, a model constraint, and a company constraint at the same time. At Kog, this role sits directly on that bottleneck. The work you do here will shape token-by-token generation speed, influence which model designs become viable, and determine how quickly engineering judgment turns into measurable performance.
The problem
Most inference systems still carry architectural decisions that made sense for an earlier generation of workloads. Sequential generation still absorbs synchronization costs, CPU handoffs, and memory behavior that become limiting when every token matters.
Kog took a different route. We built a monokernel execution path and co-designed the model architecture with the hardware. That created a different set of opportunities and a higher level of technical difficulty. Progress comes from understanding the machine at a very fine-grained level, making strong tradeoffs, and turning them into real gains in generation speed.
The role
You will own the technical execution of the Kog inference engine at the hardware boundary. You will work close to the machine, close to the model, and close to the people making the most consequential architectural decisions in the company.
This is a hands‑on leadership role. You will write code, review kernels, define performance priorities, make architecture calls, and drive a team toward improvements that matter in production. You will be expected to move between microscopic detail and system‑level judgment with the same rigor.
What you will work on
Kernel architecture for the monokernel pipeline, including memory hierarchy choices, scheduling behavior, and strategies that hide HBM latency behind useful computation
Low‑level optimization work on modern GPU hardware, with profiling that turns machine behavior into concrete engineering decisions
Execution strategies that improve end‑to‑end sequential generation speed rather than isolated wins on local kernels
Close collaboration with model architecture to turn model constraints into execution opportunities, and execution constraints into model design feedback
Technical direction for a small team working on the critical path of generation speed
Engineering milestones that connect ambitious performance targets to work that can be shipped, measured, and iterated on quickly
Must‑have
You have written GPU kernels for production workloads where performance was central to the system outcome
You understand memory hierarchy, scheduling, occupancy, and execution behavior at the level where you can anticipate likely bottlenecks before profiling confirms them
You have shipped optimizations with measurable impact and can explain the exact decisions that created the result
You have operated with real ownership over difficult technical work and raised the standard of the people around you through code, reviews, and decision‑making
You are comfortable carrying both individual technical depth and team‑level responsibility in the same role
Strong signal
You have deep low‑level GPU performance experience on AMD, NVIDIA, or both
You have worked on inference engine components such as attention kernels, KV cache management, quantization‑aware execution, or communication‑sensitive execution paths
You have built or shaped systems where model behavior and execution behavior had to be designed together
You have a public trace of serious low‑level work, such as benchmarks, repositories, technical writing, conference talks, or profiling methods adopted by others
Top 0.1% for this role
The strongest candidates for this role have already developed original judgment at the hardware boundary. They have found performance wins that were not obvious from documentation alone. They can explain why those wins worked, what tradeoffs they introduced, and how those decisions improved real token‑by‑token generation speed.
They have a track record of shortening the loop between observation, hypothesis, implementation, and measured result. They bring both authorship and taste. They know when to push deeper into the machine, when to change the execution plan, and when to influence model structure so the whole system moves faster.
What we offer
Direct access to AMD MI300X clusters from day one, with enough compute to validate serious work at real scale
A team where technical judgment carries weight and where the people closest to the problem shape the key decisions
Problems that sit on the critical path of model execution speed and that directly influence what the system can become
A remote‑first working model, with regular time overlap close to France time and monthly Paris weeks for engineering depth, alignment, and time together
Compensation aligned with top technical profiles in the Paris AI market, including meaningful equity