Site Reliability Engineer (AI GPU Clusters)
PARIS, 75
il y a 1 jour
Requirements
- Proactive and solution-oriented mindset
- Passion for automation and continuous improvement
- Strong collaboration and communication skills
- Ability to work independently and in a team
- Willingness to mentor and share knowledge
- Experience with Go, Python or Rust
- Strong scripting skills (Bash, Python)
- Hands-on experience with Linux systems (Ubuntu/Debian)
- Hands-on experience with GPU & HPC infrastructure
- Knowledge of networking (TCP/IP, DNS, BGP, load-balancing, IPv6, etc.)
- Familiarity with monitoring and logging tools (Prometheus, Grafana, Elastic, etc.)
- Comfortable with Infrastructure-as-Code (Ansible, Salt, AWX, etc.)
- Experience managing relational databases (PostgreSQL)
- Understanding of CI/CD pipelines (GitLab)
- Comfortable with English (written and spoken)
What the job involves
- You will join a newly formed team dedicated to building and operating Scaleway’s future AI infrastructure
- As part of this group, you will design, maintain, and scale core systems and observability tools, partner with product teams, and ensure the reliability and performance of AI services across Scaleway
- Build a large AI infrastructure with monitoring, diagnosis, and remediation of production incidents- Troubleshoot high-impact production issues in collaboration with other engineering teams
- Participate in an on-call rotation to handle incidents and ensure service continuity
- Implement and maintain observability solutions to monitor AI infrastructure and application health
- Contribute to AI infrastructure lifecycle management across different environments and countries
- Promote and apply best practices in terms of stability, resiliency, scalability, and security
- Maintain clear technical documentation for tools and procedures
- Contribute to system and tool evolution based on production feedback
- Collaborate closely with development teams to ensure infrastructure readiness- Participate in team rituals and knowledge-sharing initiatives
Entreprise
Scaleway
Plateforme de publication
WHATJOBS
Offres pouvant vous intéresser
BORDEAUX, 33
il y a 3 jours
ROUEN, 76
il y a 1 jour
LILLE, 59
il y a 3 jours
MARSEILLE, 13
il y a 12 jours