Manager, Site Reliability Engineering

Heredia, Costa Rica

ServiceNow

ServiceNow allows employees to work the way they want to, not how software dictates they have to. And customers can get what they need, when they need it.

View all jobs at ServiceNow

Apply now Apply later

Company Description

It all started in sunny San Diego, California in 2004 when a visionary engineer, Fred Luddy, saw the potential to transform how we work. Fast forward to today — ServiceNow stands as a global market leader, bringing innovative AI-enhanced technology to over 8,100 customers, including 85% of the Fortune 500®. Our intelligent cloud-based platform seamlessly connects people, systems, and processes to empower organizations to find smarter, faster, and better ways to work. But this is just the beginning of our journey. Join us as we pursue our purpose to make the world work better for everyone.

Job Description

Position Overview:

We are seeking a highly skilled Technical SRE Manager to lead our Site Reliability Engineering (SRE) team. This role is pivotal in ensuring the scalability, availability, and reliability of our critical systems while driving automation, observability, and operational excellence. You will build and lead a team of NextGenOps and SREs, collaborate with engineering and operations teams, and implement AI/ML-driven strategies to enhance predictive analytics, proactive issue resolution, and self-healing systems.

This is a flexible work schedule position based in Costa Rica.

The NextGenOps team is a forward-thinking, AI-powered Site Reliability Engineering (SRE) group at the forefront of revolutionizing how we approach operations and infrastructure. Our team is dedicated to building resilient, scalable, and self-healing systems using cutting-edge AI/ML-driven technologies. We are not just an operations team; we are engineers on a mission to push the boundaries of automation, observability, and operational excellence. By combining AI with our deep expertise in cloud-native platforms, DevOps, and SRE best practices, we are shaping the future of how technology scales, evolves, and self-heals. If you're passionate about innovation and making a real impact through intelligent, data-driven solutions, you'll thrive in our dynamic, collaborative, and engineering-centric culture.

 

Qualifications

Key Responsibilities:

  • Lead and mentor a team of AI/ML-powered SREs, fostering a culture of automation, observability, and proactive issue resolution.
  • Define and execute AI/ML-driven SRE strategies for incident prediction, anomaly detection, and root cause analysis.
  • Champion AI-powered observability practices and advocate for self-healing architectures with machine learning automation.
  • Develop and enforce SLOs, SLIs, and SLAs using AI-driven insights.
  • Oversee AI-powered incident management, real-time anomaly detection, and auto-remediation.
  • Drive AI-driven automation for issue resolution, anomaly detection, and system fine-tuning.
  • Implement predictive maintenance and auto-remediation through machine learning models.
  • Ensure reliable deployments with AI-assisted rollouts, blue-green deployments, and canary releases.
  • Optimize costs through AI-powered resource allocation and workload balancing.
  • Ensure security and compliance with AI-driven event detection and threat mitigation.
  • Implement chaos engineering with AI-driven failure analysis to strengthen system resilience.
  • Collaborate with security teams to enforce AI-assisted threat detection and automated compliance monitoring
  • Lead capacity planning and performance optimization using AI/ML for dynamic scaling and resource forecasting.
  • Implement intelligent monitoring, logging, and alerting with AI-powered tools like Prometheus and Grafana.
  • Optimize CI/CD pipelines with AI-driven risk assessments and automated rollbacks.

To be successful in this role you have:

  • Experience in leveraging or critically thinking about how to integrate AI into work processes, decision-making, or problem-solving. This may include using AI-powered tools, automating workflows, analyzing AI-driven insights, or exploring AI’s potential impact on the function or industry.
  • 10+ years in SRE, DevOps, or infrastructure engineering, including 3+ years in a leadership role.
  • Proven experience integrating AI/ML for observability, automation, and incident response.
  • In depth understanding of monitoring tools (LogicMonitor, Catchpoint, Redgate, ScienceLogic). 
  • Demonstrated expertise in implementing and optimizing OpenTelemetry (OTel) for comprehensive observability across endpoints, cloud environments, infrastructure, and SaaS applications, enabling proactive monitoring, tracing, and performance insights.
  • Proficiency in scripting languages (Python, Go, Bash) and infrastructure tools (Terraform, Ansible) with AI/ML integration.
  • In-depth knowledge of observability and data pipeline tools (Datadog, Prometheus, Splunk, AI-driven platforms like Cisco FSO).
  • Extensive experience in incident management and on-call rotations, with AI-enhanced predictive approaches.
  • Experience with CI/CD pipelines, GitOps, and infrastructure-as-code (IaC).

Preferred Qualifications:

  • Experience with data platforms or enterprise automation tools (e.g., ServiceNow, Salesforce, SAP).
  • Knowledge of AI/ML-based data automation technologies.
  • Familiarity with regulatory requirements for data privacy, such as GDPR and CCPA.
  • A passion for leveraging emerging technologies to drive business transformation.
  • A customer-first mentality with an ability to translate user feedback into actionable product features.
  • Experience in leading cross-functional teams in a matrixed organization.
  • Strong communication and leadership skills, with the ability to engage and influence stakeholders across technical and non-technical teams.
  • Ability to thrive in a rapidly evolving industry and adapt to new challenges and opportunities.

FD21

Not sure if you meet every qualification? We still encourage you to apply! We value inclusivity, welcoming candidates from diverse backgrounds, including non-traditional paths. Unique experiences enrich our team, and the willingness to dream big makes you an exceptional candidate!

Additional Information

Work Personas

We approach our distributed world of work with flexibility and trust. Work personas (flexible, remote, or required in office) are categories that are assigned to ServiceNow employees depending on the nature of their work. Learn more here.

Equal Opportunity Employer

ServiceNow is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, creed, religion, sex, sexual orientation, national origin or nationality, ancestry, age, disability, gender identity or expression, marital status, veteran status, or any other category protected by law. In addition, all qualified applicants with arrest or conviction records will be considered for employment in accordance with legal requirements. 

Accommodations

We strive to create an accessible and inclusive experience for all candidates. If you require a reasonable accommodation to complete any part of the application process, or are unable to use this online application and need an alternative method to apply, please contact globaltalentss@servicenow.com for assistance. 

Export Control Regulations

For positions requiring access to controlled technology subject to export control regulations, including the U.S. Export Administration Regulations (EAR), ServiceNow may be required to obtain export control approval from government authorities for certain individuals. All employment is contingent upon ServiceNow obtaining any export license or other approval that may be required by relevant export control authorities. 

From Fortune. ©2024 Fortune Media IP Limited. All rights reserved. Used under license. 

Apply now Apply later

* Salary range is an estimate based on our InfoSec / Cybersecurity Salary Index 💰

Job stats:  0  0  0

Tags: Analytics Ansible Automation Bash CCPA CI/CD Cloud Compliance DevOps GDPR Grafana Incident response Machine Learning Monitoring Privacy Prometheus Python Risk assessment SaaS SAP Scripting SLAs SLOs Splunk Terraform Threat detection

Perks/benefits: Flex hours

Region: North America
Country: Costa Rica

More jobs like this

Explore more career opportunities

Find even more open roles below ordered by popularity of job title or skills/products/technologies used.