Site Reliability Engineer (SRE) - AI GPU Clusters
WHY WE NEED YOU?
Our growth is driving us to strengthen our SRE team to support and scale our production environments. Your mission will be to build and maintain reliable, observable, and secure infrastructure to ensure optimal service availability for customers worldwide.
YOUR FUTURE TEAM
We work in a collaborative and international environment where the diversity of Scalers, combined with a spirit of sharing, helps bring new projects to life every day, advancing our ambitions together. You will join a newly formed team dedicated to building and operating Scaleway’s future AI infrastructure. As part of this group, you will design, maintain, and scale core systems and observability tools, partner with product teams, and ensure the reliability and performance of AI services across Scaleway.
YOUR DAILY ROUTINE
- Build a large AI infrastructure with monitoring, diagnosis, and remediation of production incidents
- Troubleshoot high-impact production issues in collaboration with other engineering teams
- Participate in an on‑call rotation to handle incidents and ensure service continuity
- Implement and maintain observability solutions to monitor AI infrastructure and application health
- Contribute to AI infrastructure lifecycle management across different environments and countries
- Promote and apply best practices in terms of stability, resiliency, scalability, and security
- Maintain clear technical documentation for tools and procedures
- Contribute to system and tool evolution based on production feedback
- Collaborate closely with development teams to ensure infrastructure readiness
- Participate in team rituals and knowledge‑sharing initiatives
ABOUT YOU
SOFTSKILLS
- Proactive and solution‑oriented mindset
- Passion for automation and continuous improvement
- Strong collaboration and communication skills
- Ability to work independently and in a team
- Willingness to mentor and share knowledge
HARDSKILLS
- Experience with Go, Python or Rust
- Strong scripting skills (Bash, Python)
- Hands‑on experience with Linux systems (Ubuntu/Debian)
- Preferred hands‑on experience with GPU & HPC infrastructure
- Knowledge of networking (TCP/IP, DNS, BGP, load‑balancing, IPv6, etc.)
- Familiarity with monitoring and logging tools (Prometheus, Grafana, Elastic, etc.)
- Comfortable with Infrastructure‑as‑Code (Ansible, Salt, AWX, etc.)
- Experience managing relational databases (PostgreSQL)
- Understanding of CI/CD pipelines (GitLab)
- Comfortable with English (written and spoken)
BENEFITS
- Hybrid work: up to 3 days of remote work per week
- Offices: spacious, dynamic workspaces with outdoor spaces and bike parking facilities, located near public transport
- Dining: healthy meal service at headquarters and breakfast year‑round; Swile card for lunches at regional sites
- Well‑being commitments: access to gym, daycare places, or discounted services for caring services
- International environment: dozens of nationalities, English widely spoken
- Career & mobility: internal mobility opportunities within the Iliad Group
At Scaleway, we are committed to building an inclusive and respectful workplace where everyone has a fair opportunity to thrive. All applications are considered with care, regardless of age, gender, sexual orientation, ethnic or social background, religion, disability, or any other characteristic. We believe great ideas come from everywhere, and every candidate is encouraged to apply.
#J-18808-Ljbffr