Site Reliability Engineer (Data Platform)
PARIS, 75
il y a 20 heures
Requirements
- Strong experience with Kubernetes in production environments
- Experience with distributed data systems (or strong willingness to learn)
- Solid understanding of SRE principles (monitoring, alerting, SLAs/SLOs)
- Experience with Infrastructure as Code (Terraform or similar)
- Familiarity with GitOps workflows
- Experience with observability tools (Prometheus, Grafana, logging systems)
- Comfortable working in cloud environments
- Strong collaboration mindset and ability to work across teams
- Fluent in English
- (Desirable) Experience with Trino, Iceberg, or data lakehouse architectures
- (Desirable) Experience with Ceph S3 or object storage systems
- (Desirable) Knowledge of Kafka / Flink / Airflow
- (Desirable) Experience with FinOps practices and cost optimization
- (Desirable) Experience with Crossplane or platform self-service models
- (Desirable) Programming skills (Python, Java, or Go)
- (Desirable) Experience with multi-region / multi-DC architectures
What the job involves
- Being an SRE at VeepeeTech means being part of a transversal SRE community while integrating a product-oriented Data Platform team
- You will contribute to the reliability, scalability, and operability of critical data services by applying SRE and DevOps practices, while sharing knowledge across teams
- The Data Platform is currently evolving toward a modern lakehouse architecture deployed on VeepeeCloud (our on-prem platform), based on technologies such as Trino, Iceberg, and object storage, with strong ambitions around performance, cost efficiency, and platform ownership
- You will work in a distributed environment (France & Spain), within a team of 40–50 data professionals across engineering, analytics, data science, and governance
- You will play a key role in ensuring the reliability and scalability of this next-generation data platform, while supporting the transition from public cloud to hybrid/on-prem architectures
- Ensure reliability and performance of our data platform services (Trino, Iceberg, S3, Kafka, Flink)
- Define and implement SRE best practices: SLIs/SLOs, error budgets, observability
- Build and maintain monitoring, alerting, and incident response frameworks (Prometheus, Grafana, etc.)
- Contribute to the migration from public datawarehouse cloud to VeepeeCloud lakehouse stack
- Support coexistence between cloud and on-prem systems and ensure consistency and reliability
- Help design resilient architectures for ingestion, transformation, and serving layers
- Operate and improve services running on Kubernetes (GKE/EKS & on-prem clusters)
- Automate infrastructure provisioning using Terraform, Atlantis, and/or Crossplane
- Improve GitOps workflows for platform deployment and configuration
- Collaborate with teams to optimize compute/storage usage (Trino queries, BigQuery slots, etc.)
- Build tools and dashboards to track cost, usage, and efficiency
- Support the transition toward cost-efficient on-prem workloads
- Improve self-service capabilities for data teams (e.g., provisioning Trino/Iceberg resources)
- Help teams adopt best practices in reliability, observability, and deployment
- Write clear technical documentation and runbooks
- Contribute to Disaster Recovery Plan (DRP) definition and implementation
- Ensure multi-DC resilience (FR1 / NL1) and data replication strategies
- Participate in incident management and postmortems
Entreprise
Deepstreamtech
Plateforme de publication
WHATJOBS
Offres pouvant vous intéresser
PARIS, 75
il y a 9 jours
SAINT DENIS
il y a 25 jours
PARIS, 75
il y a 25 jours
ÎLE- E FRANCE, FRANCE
il y a 25 jours