Site Reliability Engineer (Data Platform)

PARIS, 75

il y a 20 heures

Requirements

Strong experience with Kubernetes in production environments
Experience with distributed data systems (or strong willingness to learn)
Solid understanding of SRE principles (monitoring, alerting, SLAs/SLOs)
Experience with Infrastructure as Code (Terraform or similar)
Familiarity with GitOps workflows
Experience with observability tools (Prometheus, Grafana, logging systems)
Comfortable working in cloud environments
Strong collaboration mindset and ability to work across teams
Fluent in English
(Desirable) Experience with Trino, Iceberg, or data lakehouse architectures
(Desirable) Experience with Ceph S3 or object storage systems
(Desirable) Knowledge of Kafka / Flink / Airflow
(Desirable) Experience with FinOps practices and cost optimization
(Desirable) Experience with Crossplane or platform self-service models
(Desirable) Programming skills (Python, Java, or Go)
(Desirable) Experience with multi-region / multi-DC architectures

What the job involves

Being an SRE at VeepeeTech means being part of a transversal SRE community while integrating a product-oriented Data Platform team
You will contribute to the reliability, scalability, and operability of critical data services by applying SRE and DevOps practices, while sharing knowledge across teams
The Data Platform is currently evolving toward a modern lakehouse architecture deployed on VeepeeCloud (our on-prem platform), based on technologies such as Trino, Iceberg, and object storage, with strong ambitions around performance, cost efficiency, and platform ownership
You will work in a distributed environment (France & Spain), within a team of 40–50 data professionals across engineering, analytics, data science, and governance
You will play a key role in ensuring the reliability and scalability of this next-generation data platform, while supporting the transition from public cloud to hybrid/on-prem architectures
Ensure reliability and performance of our data platform services (Trino, Iceberg, S3, Kafka, Flink)
Define and implement SRE best practices: SLIs/SLOs, error budgets, observability
Build and maintain monitoring, alerting, and incident response frameworks (Prometheus, Grafana, etc.)
Contribute to the migration from public datawarehouse cloud to VeepeeCloud lakehouse stack
Support coexistence between cloud and on-prem systems and ensure consistency and reliability
Help design resilient architectures for ingestion, transformation, and serving layers
Operate and improve services running on Kubernetes (GKE/EKS & on-prem clusters)
Automate infrastructure provisioning using Terraform, Atlantis, and/or Crossplane
Improve GitOps workflows for platform deployment and configuration
Collaborate with teams to optimize compute/storage usage (Trino queries, BigQuery slots, etc.)
Build tools and dashboards to track cost, usage, and efficiency
Support the transition toward cost-efficient on-prem workloads
Improve self-service capabilities for data teams (e.g., provisioning Trino/Iceberg resources)
Help teams adopt best practices in reliability, observability, and deployment
Write clear technical documentation and runbooks
Contribute to Disaster Recovery Plan (DRP) definition and implementation
Ensure multi-DC resilience (FR1 / NL1) and data replication strategies
Participate in incident management and postmortems

#J-18808-Ljbffr

Entreprise

Deepstreamtech

Plateforme de publication

WHATJOBS

Offres pouvant vous intéresser

SRE (DataPlatform)

PARIS, 75

il y a 9 jours

Alternance - Data Engineer H/F/X

SAINT DENIS

il y a 25 jours

Staff Backend Engineer - Engineering

PARIS, 75

il y a 25 jours

Architecte GCP / Data & IA (H/F)

ÎLE- E FRANCE, FRANCE

il y a 25 jours