Observability Tech Lead
PARIS, 75
il y a 13 heures
Responsibilities
- In 2023 we made a decisive move: we replaced our observability-as-a-service provider with a fully self-hosted observability stack, giving us complete control over cost, data residency, and the developer experience around telemetry
- Today our stack spans the full LGTM suite — Grafana, Mimir, Loki, and Tempo — alongside Victoria
Metrics, self-hosted Sentry, Grafana Alloy as our telemetry collector, and Open
Telemetry as the instrumentation standard - We use Pyrra for SLO tracking and are building toward a unified service health dashboard powered by error budgets and burn-rate alerting
- Telemetry is the backbone of how we operate a bank at scale — ingesting over 100 million samples, serving 400+ services, and capturing end-to-end traces from clients through services to system dependencies
- Every trade, every card payment depends on our ability to see, measure, and respond to what’s happening in production. We’ve proven the architecture works
- Now we’re building a dedicated in-house observability team to take it to the next level: stabilise and harden the platform, drive down cost-per-signal, and build the golden path for observability — where 100% of components ship with production-grade telemetry because the best thing to do is the easiest thing to do
- Build and evolve the observability platform: Design and operate large-scale telemetry pipelines while continuously improving core components with a strong focus on automation, reliability, and developer experience
- Build for scale, design for cost: Architect high-throughput telemetry systems with sampling strategies, data tiering, and retention policies that balance signal fidelity with infrastructure cost at scale
- Make production observable by default: Define and implement observability and reliability standards — SLOs, error budgets, and low-noise alerting — and actively support engineering teams in adopting them, making doing the right thing effortless
- Own the platform end to end: Participate in the on-call rotation for the observability platform, ensuring full end-to-end ownership of the systems you build and operate
- Own the direction and drive it forward: Define long-term observability direction, drive cross-team initiatives from kickoff to delivery, and align observability strategy with broader engineering reliability and business goals
Qualifications
- Proven ability to design and operate high-throughput telemetry pipelines across distributed, multi-cloud environments
- Deep hands‑on expertise with the observability stack — Prometheus, OpenTelemetry, Grafana, or equivalent at scale
- Hands‑on experience with Mimir, Loki, and Tempo architectures is a strong benefit
- Strong command of SLO-based reliability practices — error budgets, burn‑rate alerting, and incident response tooling
- A track record of turning observability best practices into opinionated standards that engineering teams actually adopt
- Ability to contribute to architectural decisions and clearly communicate trade-offs to both engineers and leadership
- 5+ years of experience in observability, platform engineering, or a related SRE/infrastructure discipline
- We are hiring from senior to staff level, so whether you have a strong foundation and are ready for more ownership or you have been leading observability strategy for large‑scale systems for many years, we would love to hear from you
- Cloud‑native in your DNA: hands‑on with Kubernetes, Terraform, and running production workloads on AWS, GCP, or Azure
- The ability to work in a flexible hybrid setup, with 2‑3 days a week in the office
- Experience driving cross‑team technical initiatives end‑to‑end, from ambiguous problem to shipped solution
Entreprise
Trade Republic
Plateforme de publication
WHATJOBS
Offres pouvant vous intéresser
PARIS, 75
il y a 2 jours
PARIS, 75
il y a 2 jours
SAINT OUEN SUR SEINE
il y a 2 jours
SAINT MANDÉ
il y a 2 jours