Site Reliability Engineer
Our Aviation organization is looking for an experienced SRE who will be responsible for designing and building systems, tooling, and processes to provide an extensible, scalable, and observable platform
They strive to empower our development teams to own and manage their full application stack, thus minimizing bottlenecks and optimizing development velocity without compromising on reliability The First 30 Days:
Onboarding. Make a first day push to infrastructure using Terraform
Meet the software engineering teams you’ll be working closely with, learn the Aerodome Platform codebase and how the software works together
Learn Flock Safety’s AWS infrastructure, security, SRE, and engineering architecture, tooling, and policies
KR1: Meet with Security, SRE, and compliance teams
KR2: Review available diagrams, policies, and any other relevant documents
Create and deploy production releases of the Aerodome platform The First 60 Days:
Review and improve coverage of observability within the Aerodome platform through review and improvement of metrics/logs/traces/profiling, creating or updating dashboards and documents as needed
Contribute to project development in a supportive role, advising software engineering teams of best practices that drive software engineering decisions
Improve our CI pipeline, removing inefficiencies and speeding up developer feedback loops
Create or improve existing Helm charts to automate and structure
90 Days & Beyond:
Improve automation of the Aviation infrastructure
KR1: Contribute to the Platform Engineering efforts, building additional self-service ability to increase developer efficiency
KR2: Assist in building/integrating Io
T management tooling (for drones, docks, radars)
KR3: Ensure self-healing and auto scaling strategies are implemented with high availab