Site Reliability Engineer (SRE) - AI GPU Clusters

BORDEAUX, 33

il y a 1 jour

WHY WE NEED YOU?

Our growth is driving us to strengthen our SRE team to support and scale our production environments. Your mission will be to build and maintain reliable, observable, and secure infrastructure to ensure optimal service availability for customers worldwide.

YOUR FUTURE TEAM

We work in a collaborative and international environment where the diversity of Scalers, combined with a spirit of sharing, helps bring new projects to life every day, advancing our ambitions together. You will join a newly formed team dedicated to building and operating Scaleway’s future AI infrastructure. As part of this group, you will design, maintain, and scale core systems and observability tools, partner with product teams, and ensure the reliability and performance of AI services across Scaleway.

YOUR DAILY ROUTINE

Build a large AI infrastructure with monitoring, diagnosis, and remediation of production incidents
Troubleshoot high-impact production issues in collaboration with other engineering teams
Participate in an on‑call rotation to handle incidents and ensure service continuity
Implement and maintain observability solutions to monitor AI infrastructure and application health
Contribute to AI infrastructure lifecycle management across different environments and countries
Promote and apply best practices in terms of stability, resiliency, scalability, and security
Maintain clear technical documentation for tools and procedures
Contribute to system and tool evolution based on production feedback
Collaborate closely with development teams to ensure infrastructure readiness
Participate in team rituals and knowledge‑sharing initiatives

ABOUT YOU

SOFTSKILLS

Proactive and solution‑oriented mindset
Passion for automation and continuous improvement
Strong collaboration and communication skills
Ability to work independently and in a team
Willingness to mentor and share knowledge

HARDSKILLS

Experience with Go, Python or Rust
Strong scripting skills (Bash, Python)
Hands‑on experience with Linux systems (Ubuntu/Debian)
Preferred hands‑on experience with GPU & HPC infrastructure
Knowledge of networking (TCP/IP, DNS, BGP, load‑balancing, IPv6, etc.)
Familiarity with monitoring and logging tools (Prometheus, Grafana, Elastic, etc.)
Comfortable with Infrastructure‑as‑Code (Ansible, Salt, AWX, etc.)
Experience managing relational databases (PostgreSQL)
Understanding of CI/CD pipelines (GitLab)
Comfortable with English (written and spoken)

BENEFITS

Hybrid work: up to 3 days of remote work per week
Offices: spacious, dynamic workspaces with outdoor spaces and bike parking facilities, located near public transport
Dining: healthy meal service at headquarters and breakfast year‑round; Swile card for lunches at regional sites
Well‑being commitments: access to gym, daycare places, or discounted services for caring services
International environment: dozens of nationalities, English widely spoken
Career & mobility: internal mobility opportunities within the Iliad Group

At Scaleway, we are committed to building an inclusive and respectful workplace where everyone has a fair opportunity to thrive. All applications are considered with care, regardless of age, gender, sexual orientation, ethnic or social background, religion, disability, or any other characteristic. We believe great ideas come from everywhere, and every candidate is encouraged to apply.

#J-18808-Ljbffr

Entreprise

Scaleway

Plateforme de publication

WHATJOBS

Offres pouvant vous intéresser

Site Reliability Engineer - SRE

LILLE, 59

il y a 7 jours

Site Reliability Engineer - SRE

PARIS, 75

il y a 7 jours

Cloud Engineer - AI

PARIS, 75

il y a 7 jours

Site Reliability Engineer - SRE - Engineering Enabler

PARIS, 75

il y a 7 jours