SRE: Cloud Reliability & Automation Leader
Job DescriptionWe are looking for a Site Reliability Engineer to strengthen our Infrastructure & Security department and help us scale our internal and customer-facing platforms.
Job DescriptionWe are looking for a Site Reliability Engineer to strengthen our Infrastructure & Security department and help us scale our internal and customer-facing platforms.In this role, you will contribute to both run and build activities: operating production environments, improving reliability, leading technical transformation initiatives, and designing modern, scalable, secure, and observable infrastructure.You will work across cloud and on-premise environments, collaborate closely with Engineering, QA, Data, Security, and Product teams, and help improve our developer experience, operational excellence, production mindset, and infrastructure maturity.Responsibilities
- Operate, maintain, and improve production and internal infrastructure environments across cloud and on-premise platforms.
- Contribute to both run activities, such as incident response, monitoring, support, troubleshooting, maintenance, and reliability improvements, and build activities, such as architecture evolution, automation, migration, tooling, and platform transformation.
- Help design, build, and maintain resilient, scalable, secure, observable, and cost-efficient infrastructure.
- Lead or contribute to technical migrations, modernization projects, and architecture transformation initiatives.
- Strengthen operational processes: incident management, change management, backup and restore, disaster recovery, on-call practices, documentation, and post-incident reviews.
- Improve observability across systems, services, and infrastructure through metrics, logs, traces, dashboards, alerting, and SLOs.
- Promote a strong production mindset across teams, with a focus on reliability, performance, security, customer impact, and operational simplicity.
- Collaborate closely to improve delivery quality and platform reliability.
- Contribute to Developer Experience by improving tooling, CI/CD workflows, infrastructure automation, environments, deployment processes, and self-service capabilities.
- Support FinOps practices by monitoring costs, optimizing infrastructure usage, and helping teams make cost-aware decisions.
- Build automation and tooling to reduce manual work, improve repeatability, and make infrastructure easier to operate.
- Participate in technical architecture discussions and provide guidance to infrastructure and engineering teams.
- Contribute to infrastructure roadmaps, technical standards, best practices, and long-term platform strategy.
- Maintain strong documentation and knowledge sharing practices.
- Built strong trust with Infrastructure, Engineering, Security, and Product teams.
- Demonstrated strong ownership of production systems and contributed to improving reliability, stability, and operational maturity.
- Helped improve observability through better dashboards, alerts, metrics, logs, traces, or SLOs.
- Contributed to reducing operational toil through automation, documentation, tooling, or process improvements.
- Helped improve incident response, post-incident reviews, change management, or on-call practices.
- Contributed to one or more meaningful build initiatives: migration, architecture improvement, platform modernization, CI/CD improvement, internal tooling, or developer experience enhancement.
- Shown strong ability to work across both cloud and on-premise environments.
- Contributed to making infrastructure more secure, scalable, performant, cost-efficient, and easier to operate.
- Helped Engineering teams improve delivery quality and production readiness.
- Recognized as a collaborative, structured, pragmatic, and reliable technical partner.
- Strong experience in a similar role.
- Solid experience operating production environments with high reliability, availability, and performance expectations.
- Good understanding of cloud infrastructure, ideally AWS.
- Strong knowledge of systems, networking, DNS, load balancing, security fundamentals, and infrastructure troubleshooting.
- Experience with infrastructure as code, automation, configuration management, and CI/CD pipelines.
- Experience with observability tools: metrics, logs, traces, alerting, dashboards, SLOs, SLIs..
- Good understanding of containers, orchestration, service discovery, secrets management, and modern platform architecture.
- Experience with incident management, post-incident reviews, backup and restore, disaster recovery, capacity planning, and operational processes.
- Ability to write scripts, automation, or internal tooling to reduce manual work and improve reliability.
- Understanding of security best practices for infrastructure, cloud, identity, secrets, network segmentation, and production access.
- Interest or experience in FinOps, cost optimization, performance optimization, and infrastructure efficiency.
- Experience with developer experience, internal platforms, self-service tooling, or platform engineering is a strong plus.
- Strong production mindset: reliability, customer impact, resilience, security, and operational excellence.
- Excellent communication skills with both technical and non-technical stakeholders.
- Structured, rigorous, autonomous, and pragmatic approach.
- Ability to lead technical initiatives, migrations, or architecture discussions.
- Collaborative, curious, and committed to continuous improvement.
- Discovery call with the hiring team (30 minutes)
- Interview with the VP Infrastructure & Security (1 hour)
- AssessFirst personality assessment
- Interview with the Infrastructure team (1 hour)
- Interview with an Engineering Manager (1 hour)
- Interview with the Head of HR (30 minutes)