Site Reliability Engineer (SRE) - AI GPU Clusters
ROUEN, 76
il y a 1 jour
Why we need you?
Our growth is driving us to strengthen our SRE team to support and scale our production environments.
Your mission will be to build and maintain reliable, observable, and secure infrastructure in order to ensure optimal service availability for our customers around the world.
Responsibilities
- Build a large AI infrastructure with monitoring, diagnosis, and remediation of production incidents
- Troubleshoot high-impact production issues in collaboration with other engineering teams
- Participate in an on-call rotation to handle incidents and ensure service continuity
- Implement and maintain observability solutions to monitor AI infrastructure and application health
- Contribute to AI infrastructure lifecycle management across different environments and countries
- Promote and apply best practices in terms of stability, resiliency, scalability, and security
- Maintain clear technical documentation for tools and procedures
- Contribute to system and tool evolution based on production feedback
- Collaborate closely with development teams to ensure infrastructure readiness and participate in team rituals and knowledge-sharing initiatives
Soft Skills
- Proactive and solution-oriented mindset
- Passion for automation and continuous improvement
- Strong collaboration and communication skills
- Ability to work independently and in a team
- Willingness to mentor and share knowledge
Hard Skills
- Experience with Go or Python
- Strong scripting skills (Bash, Python)
- Hands-on experience with Linux systems (Ubuntu/Debian)
- Preferred hands-on experience with GPU & HPC infrastructure
- Knowledge of networking (TCP/IP, DNS, BGP, load-balancing, IPv6, etc.)
- Familiarity with monitoring and logging tools (Prometheus, Grafana, Elastic, etc.)
- Comfortable with Infrastructure-as-Code (Ansible, Salt, AWX, etc.)
- Experience managing relational databases (MariaDB)
- Understanding of CI/CD pipelines (GitLab)
- Comfortable with English (written and spoken)
Benefits
- Hybrid work: We offer up to 3 days of remote work per week.
- Offices: Our offices are spacious, dynamic workspaces with bold design, conveniently located near public transport. Most of our offices feature outdoor spaces (terraces) and bike parking facilities.
- Dining: Our chef provides a healthy meal service at the headquarters, and breakfast is available across all our sites year-round. Scalers working from regional sites enjoy a Swile card for lunches.
- Well-being commitments: Whether it’s access to a gym, daycare places, or discounted services for caring services, Scaleway is committed to supporting Scalers in maintaining a balanced life.
- International environment: With dozens of nationalities, Scaleway offers a stimulating environment where English is as widely spoken as French.
- Career & mobility: Our managers value internal mobility, and opportunities to transition to other entities within the Iliad Group are accessible to all Scalers.
EEO Statement
At Scaleway, we are committed to building an inclusive and respectful workplace where everyone has a fair opportunity to thrive.
All applications are considered with care, regardless of age, gender, sexual orientation, ethnic or social background, religion, disability, or any other characteristic.
We believe great ideas come from everywhere.
#J-18808-Ljbffr
Entreprise
Scaleway
Plateforme de publication
WHATJOBS
Offres pouvant vous intéresser
LILLE, 59
il y a 3 jours
ROUEN, 76
il y a 3 jours
BORDEAUX, 33
il y a 3 jours
PARIS, 75
il y a 3 jours