Director Incident Management

Bengaluru, India

Apply now Apply later

Company Description

When you’re one of us, you get to run with the best. For decades, we’ve been helping marketers from the world’s top brands personalize experiences for millions of people with our cutting-edge technology, solutions and services. Epsilon’s best-in-class identity gives brands a clear, privacy-safe view of their customers, which they can use across our suite of digital media, messaging and loyalty solutions. We process 400+ billion consumer actions each day and hold many patents of proprietary technology, including real-time modeling languages and consumer privacy advancements. Thanks to the work of every employee, Epsilon India is now Great Place to Work-Certified™. Epsilon has also been consistently recognized as industry-leading by Forrester, Adweek and the MRC. Positioned at the core of Publicis Groupe, Epsilon is a global company with more than 8,000 employees around the world. For more information, visit epsilon.com/apac or our LinkedIn page.

Job Description

About the Role

We seek a seasoned, strategic leader with exceptional product engineering operational and technical acumen to spearhead our incident management and SRE function across the Product Engineering organization. This high-pressure, 24/7 role demands a relentless focus on driving operational excellence, minimizing system downtime, and ensuring rapid incident resolution. As a key member of the leadership team, you will be instrumental in defining and executing our incident management strategy, fostering a culture of reliability, and mitigating risks across our product portfolio.

What you’ll do:

Primary Responsibilities

Incident Management Strategy and Leadership:

Develop, implement, and continuously refine incident management strategies, policies, and procedures aligned with business objectives and regulatory requirements. Serve as the primary escalation point for all critical incidents, providing strategic direction and coordinating cross-functional response efforts.

  • Lead and manage the incident management team, providing direction, guidance, and mentorship.
  • Develop and implement incident management strategies, policies, and procedures to ensure rapid and effective incident resolution.
  • Serve as the primary point of contact for all major incidents, coordinating response efforts and ensuring timely resolution.

Incident Response and Resolution:

Drive rapid incident resolution by orchestrating cross-functional teams, leveraging data-driven decision-making, and ensuring effective communication. Conduct post-incident reviews to identify root causes, implement corrective actions, and prevent recurrence.

  • Oversee the incident response process, ensuring incidents are accurately identified, categorized, prioritized, and resolved.
  • Coordinate cross-functional teams to resolve incidents, including IT, security, operations, and business stakeholders.
  • Ensure detailed incident reports are created and communicated to relevant stakeholders.

Risk Management and Prevention:

Proactively identify, assess, and mitigate risks to system availability and performance. Collaborate with engineering teams to implement preventive measures and improve system resilience.

  • Develop and implement risk management strategies to proactively identify, assess, and mitigate potential risks that could impact business operations and system reliability.
  • Conduct regular risk assessments and vulnerability analyses to identify potential threats and weaknesses in IT infrastructure and processes.
  • Establish and enforce preventative measures and controls to minimize the likelihood of incidents occurring, including implementing robust security protocols, conducting regular system audits, and ensuring compliance with industry standards and best practices.
  • Collaborate with IT, security, and operations teams to design and implement effective preventative maintenance plans and system updates.

Continuous Improvement and Operational Excellence:

Champion a culture of operational excellence and reliability. Drive initiatives to improve incident response time, mean time to repair (MTTR), and overall system uptime.

  • Analyze incident data to identify trends, root causes, and areas for improvement.
  • Drive continuous improvement initiatives to enhance incident management processes, reduce incident recurrence, and improve overall system reliability.
  • Implement best practices and industry standards in incident management.

Stakeholder Communication and Management:

Build and maintain strong relationships with senior leadership, product management, engineering, operations, and other key stakeholders. Communicate incident status, impact, and recovery plans effectively and transparently.

  • Maintain clear and effective communication with senior management, providing regular updates on incident status and impact.
  • Ensure all stakeholders are informed and engaged throughout the incident lifecycle.
  • Develop and deliver incident management training and awareness programs for staff.

Performance Monitoring and Reporting:

  • Establish and monitor key performance indicators (KPIs) for incident management.
  • Prepare and present incident management performance reports to senior leadership.
  • Ensure compliance with regulatory requirements and internal policies.

Crisis Management:

  • Develop and maintain a comprehensive crisis management plan.
  • Lead crisis response efforts during major incidents, ensuring business continuity and minimal disruption.

SRE Implementation and Governance:

Lead a team of site reliability engineers responsible for tracking SLOs, solutioning recurring production incidents and implementing SRE principles of observability, alerting, error budgeting and continual improvement for cloud native, distributed SaaS systems.

Team Development and Performance:

Build, lead, and mentor a high-performing, globally distributed incident management and SRE teams. Foster a culture of ownership, accountability, and continuous improvement. Develop and implement performance metrics and reporting to measure team effectiveness and identify areas for enhancement.

Qualifications

What you’ll need

Experience:

  • 15+ years of experience in managing incident management across product engineering organizations, and IT operations, with at least 5 years in a leadership role.
  • Proven track record of successfully managing major incidents and leading cross-functional teams.
  • Experience implementing SRE concepts, tools and metrics to improve overall system reliability
  • At least 5 years of hands-on experience working with Ops automation tools, availability and resiliency focused deployment strategies, monitoring and alerting configuration, establishing availability, scalability and performance benchmarks.
  • Experience working on AIOps conceptualization and design to integrate with Enterprise Incident and change management processes.

Skills and Competencies:

  • Technical acumen and inclination to automate toil
  • Strong leadership and team management skills.
  • Excellent problem-solving and analytical skills.
  • Previous experience working in a product engineering or software development environment.
  • Familiarity with cloud platforms such as AWS, Azure, and Google Cloud.
  • Exceptional communication and interpersonal skills.
  • Ability to work under pressure and handle multiple incidents simultaneously.
  • Knowledge of ITIL or other incident management frameworks.
  • Experience with incident management tools and technologies.
  • Strong understanding of, Agile methodology, IT systems, infrastructure, and cybersecurity principles.

Certifications:

  • ITIL Certification: ITIL Foundation certification or higher to demonstrate knowledge of IT service management best practices.
  • Relevant Certifications: Certifications such as Certified Information Systems Security Professional (CISSP), Certified Information Security Manager (CISM), or Project Management Professional (PMP) are a plus.
  • Proven application of AI/ML skills with relevant  certifications.

Personal Attributes

  • High level of integrity and professionalism.
  • Strong attention to detail.
  • Ability to build strong relationships and work collaboratively.
  • Proactive and results-oriented mindset.
  • Resilient and adaptable to changing priorities and demands.

Additional Information

Epsilon is committed to promoting diversity, inclusion, and equal employment opportunities by using reasonable efforts to attract, recruit, engage and retain qualified individuals of all ethnicities and backgrounds, including, but not limited to, women, people of color, LGBTQ individuals, people with disabilities and any other underrepresented groups, traits or characteristics.

Apply now Apply later
  • Share this job via
  • 𝕏
  • or

* Salary range is an estimate based on our InfoSec / Cybersecurity Salary Index 💰

Job stats:  0  0  0

Tags: Agile Audits Automation AWS Azure CISM CISSP Cloud Compliance GCP Governance Incident response ITIL IT infrastructure KPIs Monitoring Privacy Risk assessment Risk management SaaS SLOs Strategy

Perks/benefits: Career development

Region: Asia/Pacific
Country: India

More jobs like this

Explore more career opportunities

Find even more open roles below ordered by popularity of job title or skills/products/technologies used.