Cloud Site Reliability Staff Engineer
Ottawa, ON, Canada
Full Time Senior-level / Expert CAD 122K - 162K
Barracuda Networks Inc.
Barracuda Networks is the worldwide leader in Email Protection, Application Protection, Network Security, and Data Protection Solutions- Application Infrastructure Design: Engage with internal customers to understand application design and cloud infrastructure needs, focusing on scalability, security, and reliability
- Infrastructure Automation: Create and design templates, tools, and accelerators for deployment infrastructure to support development teams
- Architectural Leadership: Lead architectural decisions and approve major system design changes, implementing contemporary architectural patterns
- Platform Development: Design and develop self-service platforms for Product Engineering teams
- Service Level Management: Define, implement, and track SLIs, SLOs, and SLAs across services
- Incident Management: Lead incident response processes and conduct post-incident learning reviews
- Disaster Recovery: Develop and maintain disaster recovery and business continuity plans
- Technical Design: Plan and implement non-functional requirements including security, performance, deployment frequency, and monitoring
- Solution Architecture: Oversee architecture snapshots, solution design, prototyping, and code reviews
- Technology Stack Implementation: Drive modern solutions using AWS, Kubernetes, GitHub Actions, Jenkins, Terraform, Pulumi, and other current technologies
- Data Infrastructure: Build support infrastructure for global data pipeline and storage using Databricks, Spark, and ELK stack
- Deployment Automation: Lead initiatives to convert manual deployments to automated processes
- Observability Systems: Build and enhance monitoring and reliability systems
- On-Call Duties: Participate in on-call rotation to ensure 24/7 system reliability
- Team Development: Mentor junior team members and foster a positive team culture
- Technical Expertise: 10+ years hands-on infrastructure design experience, including 5+ years cloud development and 3+ years in SRE/DevOps roles
- Cloud Infrastructure: Deep expertise in AWS cloud infrastructure development, security, and operations with proven success in large-scale production environments
- Infrastructure as Code: Extensive experience with Terraform, CloudFormation, Pulumi, and Crossplane for cloud infrastructure automation
- CI/CD & Automation: Strong background with GitHub, GitHub Actions, Jenkins, Packer, Ansible, and Puppet
- Deployment Patterns: Expertise in blue/green, canary, rolling deployments, and draining strategies
- Container Orchestration: Comprehensive experience with Docker, Kubernetes, and EKS in AWS environments
- Programming: Strong coding abilities in Python, Go, Ruby etc.
- Operating Systems: Advanced Linux knowledge including system internals
- Observability: Extensive experience with New Relic, Elastic APM, CloudWatch, Prometheus, and Grafana...
- Data Engineering: Experience with Databricks, Apache Spark, Kafka, and DataStage
- Problem Solving: Strong systematic debugging and troubleshooting capabilities
- Communication: Excellent verbal and written communication skills
- Certifications: AWS certifications (Solutions Architect, DevOps) and Kubernetes certifications (CKA, CKAD, CKS) a plus
Tags: Ansible Automation AWS CI/CD Cloud Databricks DevOps Docker ELK GitHub Grafana Incident response Jenkins Kafka Kubernetes Linux Monitoring Prometheus Prototyping Puppet Python Ruby SaaS SLAs SLOs Terraform XDR
Perks/benefits: Career development Equity / stock options
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.