Chargement en cours

Site Reliability Engineer (m/f/d)

SAINT OUEN SUR SEINE
il y a 13 heures

Key Responsibilities

As a Site Reliability Engineer within Advanced Analytics (DA3) in the Chief Data & AI Office at Allianz Partners, you will join the platform engineering team to own the reliability and operational health of the central engineering platform. You will define and maintain service level objectives, drive incident response at the infrastructure layer, and systematically eliminate operational toil through automation. You will work closely with Platform Engineers, Security Engineers, and incident‑response leads to ensure the platform meets its reliability commitments across production workloads spanning AI services, Java APIs, and frontend applications.

  • Define, instrument, and maintain SLOs and SLIs for platform components; own error budget tracking and produce regular reliability reports for senior leadership.
  • Serve on the on‑call rotation as the infrastructure escalation tier; lead incident response for cluster‑level, network‑level, and storage failures; chair blameless post‑incident reviews.
  • Implement and operate Kubernetes infrastructure (AKS): cluster lifecycle management, networking, resource quotas, autoscaling configuration, and multi‑tenancy patterns across product team namespaces.
  • Develop Infrastructure as Code (Terraform) to provision and manage Azure resources with consistency, auditability, and repeatable rollback capability.
  • Build and maintain observability infrastructure: Prometheus, Grafana, Azure Monitor, and Application Insights; own alerting rules, dashboards, and distributed tracing coverage across platform components.
  • Perform capacity planning and cost‑aware resource management: right‑size node pools, tune vertical and horizontal pod autoscalers, and identify resource waste across namespaces.
  • Identify and eliminate toil: automate repetitive operational tasks through scripting and tooling; measure and track toil reduction over time.
  • Maintain platform reliability procedures: rolling upgrades, backup and recovery testing, disaster recovery runbooks, and change freeze coordination.
  • Contribute to CI/CD pipelines and GitOps tooling (GitHub Actions, ArgoCD) from a reliability and deployment safety perspective; work with platform engineering on release gates and rollback mechanisms.
  • Collaborate with incident‑response leads on incident SLA targets and operational procedures; work with Security Engineers on infrastructure hardening and vulnerability remediation.

What You Bring

  • 5+ years professional experience in site reliability engineering, DevOps, or platform engineering roles.
  • Strong Kubernetes experience: cluster operations, networking (Ingress, network policies), storage, autoscaling, and hands‑on troubleshooting across production environments.
  • Solid Infrastructure as Code experience with Terraform; familiarity with Bicep or ARM templates is a plus.
  • Production experience with Azure cloud services: AKS, ACR, Key Vault, Azure Monitor, Application Insights, Virtual Networks, and Private Endpoints.
  • Strong observability experience: Prometheus, Grafana, centralized logging, alerting configuration, and distributed tracing instrumentation.
  • Working knowledge of SLO/SLI methodology: error budget principles, reliability target setting, and capacity planning.
  • Structured incident management experience: on‑call ownership, blameless post‑incident review, and runbook authorship.
  • Scripting and automation proficiency in Python or bash for toil elimination and operational tooling.
  • Strong CI/CD experience: GitHub Actions and ArgoCD or equivalent GitOps tooling.

Ways of Working

  • Comfortable in agile, iterative delivery environments with personal ownership and accountability for platform reliability.
  • Clear communicator across global, cross‑functional stakeholders; able to translate technical reliability metrics into business impact for non‑technical audiences.
  • Proactive learner with pragmatic adoption of AI‑assisted developer tools (e.g., GitHub Copilot, Claude Code) to improve automation coverage and delivery velocity.

Nice to Have

  • Kubernetes certifications: CKA or CKAD.
  • Experience supporting AI or ML infrastructure workloads: GPU scheduling, model serving platforms, or inference pipeline operations.
  • Exposure to chaos engineering practices and fault injection testing.
  • FinOps experience: reserved capacity planning, resource right‑sizing programs, and cost attribution per team or workload.
  • Service mesh experience (Istio, Linkerd) for traffic management and reliability patterns.
  • Experience in regulated industries (insurance, finance, healthcare) where auditability, change traceability, and secure‑by‑default operations are standard practice.

We therefore welcome applications regardless of ethnicity or cultural background, age, gender, nationality, religion, social class, disability or sexual orientation, or any other characteristics protected under applicable local laws and regulations.

#J-18808-Ljbffr
Entreprise
Allianz Partners
Plateforme de publication
WHATJOBS
Offres pouvant vous intéresser
Soyez le premier à postuler aux nouvelles offres
Soyez le premier à postuler aux nouvelles offres
Créez gratuitement et simplement une alerte pour être averti de l’ajout de nouvelles offres correspondant à vos attentes.
* Champs obligatoires
Ex: boulanger, comptable ou infirmière
Alerte crée avec succès