Chargement en cours

Site Reliability Engineer (SRE) - AI GPU Clusters

PARIS, 75
il y a 1 jour

OUR STORY

Since 1999, Scaleway has designed secure and sustainable infrastructures for ambitious companies. In 2015, we shifted to cloud computing, becoming a leading European cloud provider. With AI investments, we are building sovereign AI alternatives. Our products serve a diverse range of customers across France and globally.

WHY WE NEED YOU?

Our growth drives us to strengthen our SRE team to support and scale production environments.

YOUR DAILY ROUTINE

  • Build a large AI infrastructure with monitoring, diagnosis, and remediation.
  • Troubleshoot high-impact production issues with engineering teams.
  • Participate in on‑call rotation to handle incidents and ensure service continuity.
  • Implement and maintain observability solutions for AI infrastructure and application health.
  • Contribute to AI infrastructure lifecycle management across environments and countries.
  • Promote and apply best practices in stability, resiliency, scalability, and security.
  • Maintain clear technical documentation for tools and procedures.
  • Collaborate closely with development teams to ensure infrastructure readiness.
  • Participate in team rituals and knowledge‑sharing initiatives.

ABOUT YOU

SOFT SKILLS

  • Proactive and solution-oriented mindset.
  • Passion for automation and continuous improvement.
  • Strong collaboration and communication skills.
  • Ability to work independently and in a team.
  • Willingness to mentor and share knowledge.

HARD SKILLS

  • Experience with Python, Go, or C++.
  • Strong scripting skills (Bash, Python).
  • Hands-on experience with Linux systems (Ubuntu/Debian).
  • Preferred hands-on experience with GPU & HPC infrastructure.
  • Knowledge of networking (TCP/IP, DNS, BGP, load-balancing, IPv6, etc.).
  • Familiarity with monitoring and logging tools (Prometheus, Grafana, Elastic, etc.).
  • Comfortable with Infrastructure-as-Code (Ansible, Salt, AWX, etc.).
  • Experience managing relational databases (MariaDB).
  • Understanding of CI/CD pipelines (GitLab).
  • Comfortable with English (written and spoken).

BENEFITS AT SCALEWAY

Hybrid work: up to 3 days remote per week.

Offices: spacious, dynamic with outdoor spaces and bike parking.

Dining: chef-served healthy meals at headquarters; breakfast available all sites; Swile card for lunches at regional sites.

Well-being commitments: access to gym, daycare places, discounted services.

International environment: English widely spoken, diverse nationalities.

Career & Mobility: managers value internal mobility; opportunities to transition within Iliad Group.

INCLUSION STATEMENT

At Scaleway, we are committed to building an inclusive and respectful workplace where everyone has a fair opportunity to thrive. All applications are considered with care, regardless of age, gender, sexual orientation, ethnicity, religion, disability, or any other characteristic.

#J-18808-Ljbffr
Entreprise
Scaleway
Plateforme de publication
WHATJOBS
Soyez le premier à postuler aux nouvelles offres
Soyez le premier à postuler aux nouvelles offres
Créez gratuitement et simplement une alerte pour être averti de l’ajout de nouvelles offres correspondant à vos attentes.
* Champs obligatoires
Ex: boulanger, comptable ou infirmière
Alerte crée avec succès