Site Reliability Engineer (SRE) - AI GPU Clusters
OUR STORY
Since 1999, Scaleway has designed secure and sustainable infrastructures for ambitious companies. In 2015, we shifted to cloud computing, becoming a leading European cloud provider. With AI investments, we are building sovereign AI alternatives. Our products serve a diverse range of customers across France and globally.
WHY WE NEED YOU?
Our growth drives us to strengthen our SRE team to support and scale production environments.
YOUR DAILY ROUTINE
- Build a large AI infrastructure with monitoring, diagnosis, and remediation.
- Troubleshoot high-impact production issues with engineering teams.
- Participate in on‑call rotation to handle incidents and ensure service continuity.
- Implement and maintain observability solutions for AI infrastructure and application health.
- Contribute to AI infrastructure lifecycle management across environments and countries.
- Promote and apply best practices in stability, resiliency, scalability, and security.
- Maintain clear technical documentation for tools and procedures.
- Collaborate closely with development teams to ensure infrastructure readiness.
- Participate in team rituals and knowledge‑sharing initiatives.
ABOUT YOU
SOFT SKILLS
- Proactive and solution-oriented mindset.
- Passion for automation and continuous improvement.
- Strong collaboration and communication skills.
- Ability to work independently and in a team.
- Willingness to mentor and share knowledge.
HARD SKILLS
- Experience with Python, Go, or C++.
- Strong scripting skills (Bash, Python).
- Hands-on experience with Linux systems (Ubuntu/Debian).
- Preferred hands-on experience with GPU & HPC infrastructure.
- Knowledge of networking (TCP/IP, DNS, BGP, load-balancing, IPv6, etc.).
- Familiarity with monitoring and logging tools (Prometheus, Grafana, Elastic, etc.).
- Comfortable with Infrastructure-as-Code (Ansible, Salt, AWX, etc.).
- Experience managing relational databases (MariaDB).
- Understanding of CI/CD pipelines (GitLab).
- Comfortable with English (written and spoken).
BENEFITS AT SCALEWAY
Hybrid work: up to 3 days remote per week.
Offices: spacious, dynamic with outdoor spaces and bike parking.
Dining: chef-served healthy meals at headquarters; breakfast available all sites; Swile card for lunches at regional sites.
Well-being commitments: access to gym, daycare places, discounted services.
International environment: English widely spoken, diverse nationalities.
Career & Mobility: managers value internal mobility; opportunities to transition within Iliad Group.
INCLUSION STATEMENT
At Scaleway, we are committed to building an inclusive and respectful workplace where everyone has a fair opportunity to thrive. All applications are considered with care, regardless of age, gender, sexual orientation, ethnicity, religion, disability, or any other characteristic.
#J-18808-Ljbffr