Director Application Support Engineering

Dallas, TX, United States

Apply now Apply later

Are you ready to make an impact at DTCC? 

Do you want to work on innovative projects, collaborate with a dynamic and supportive team, and receive investment in your professional development? At DTCC, we are at the forefront of innovation in the financial markets. We're committed to helping our employees grow and succeed. We believe that you have the skills and drive to make a real impact. We foster a thriving internal community and are committed to creating a workplace that looks like the world that we serve.

Pay and Benefits:
  • Competitive compensation, including base pay and annual incentive
  • Comprehensive health and life insurance and well-being benefits, based on location
  • Pension / Retirement benefits 
  • Paid Time Off and Personal/Family Care, and other leaves of absence when needed to support your physical, financial, and emotional well-being.
  • DTCC offers a flexible/hybrid model of 3 days onsite and 2 days remote (onsite Tuesdays, Wednesdays and a third day unique to each team or employee). 
The Impact you will have in this role:

The Enterprise Application Support (EAS) team is responsible for providing technical application support for ITP and ECS lines of business. Within EAS, the Director Application Support Engineering / SRE Lead (Site Reliability Engineer Lead) is a senior technical role responsible for supervising a team of SREs, driving the overall reliability, scalability, and performance of critical systems by implementing standard methodologies, leading incident response, automating processes, and collaborating with development teams to ensure system stability and uptime across the organization, often acting as a technical leader in promoting a strong SRE culture within the company; key responsibilities include designing monitoring systems, capacity planning, and actively identifying and mitigating potential issues before they impact users. The SRE team works closely with development teams, infrastructure and network partners, security partners, Scrum Masters, and internal / external clients to improve observability, operational supportability, resiliency, and mean time to restore service through driving improvements to support capabilities.

Your Primary Responsibilities:
  • Scrum Participation: Join all project collaborators planning and design sessions, sprint zero and stand-ups for all new delivery, to champion NFRs reflective of a strong observability and resiliency traits.
  • Team Leadership: Lead and mentor a team of SREs, assigning tasks, providing technical guidance, and fostering a culture of continuous improvement. Actively contribute to Communities of Practice, Scrum meetings, constantly driving an engineering culture within the broad EAS community.
  • System Reliability Architecture: Drive Design and help implement reliable, resilient, and scalable systems, considering redundancy, fault tolerance, and disaster recovery strategies. Make design recommendations that will allow the application to recover without cleanup activities or create a recovery runbook for application support team to follow for improved application recovery times.
  • Monitoring and Alerting: Develop comprehensive monitoring systems to identify potential issues proactively, define actionable alerts, and establish SLIs (Service Level Indicators) and SLOs (Service Level Objectives).
  • Incident Management: Lead incident response during critical system outages, facilitating timely problem diagnosis and resolution, conducting post-mortem analysis to identify root causes and prevent future occurrences.
  • Automation and Tooling: Develop and maintain automation scripts to streamline operational tasks, including self-healing, application deployments, scaling, and infrastructure management.
  • Collaboration with Development Teams: Work closely with development teams to integrate SRE practices into the software development lifecycle, promoting code quality, reliability, and observability.
  • Security Integration: Collaborate with security teams to ensure system resilience against cyber threats, implementing security best practices and supervising for vulnerabilities.
  • Technical Expertise: Stay updated on emerging technologies and industry trends related to cloud computing, distributed systems, and reliability engineering.
  • Operational Readiness: Attend and present operational readiness with application support (EAS L2) at each project management meeting - raise any operational risks and concerns. Test NFRs in UAT environments to validate effectiveness and completeness of operational capabilities.
  • Risk Management: Partner with IT Embedded Risk Managers to identify strategic solutions for risk incidents.
  • Metrics and Reporting: Demonstrate operational improvements through defined KPIs.
  • Capacity Planning: Proactively assess system capacity needs, plan for future growth, and implement scaling strategies to ensure optimal performance under high load.
  • Performance Optimization: Analyze system performance metrics to identify bottlenecks and implement optimization strategies to improve system responsiveness and efficiency.

**NOTE: The Primary Responsibilities of this role are not limited to the details above. **

Qualifications:
  • Bachelor's degree preferred with Masters or equivalent experience
  • Minimum of 12 years of related experience
Talents Needed for Success:
  • Strong Programming Skills: Proficiency in one or more programming languages like Python, Java, Go, etc., for automation and development of monitoring tools. 
  • System AdministrationExpertise in Linux/Unix operating systems, network administration, and cloud platforms (AWS, Azure, GCP). Mainframe experience is a plus.
  • Monitoring and Observability: Deep understanding of monitoring tools (Splunk, Dynatrace, ITSI, etc.) and experience in designing robust monitoring systems.
  • Incident Management: Proven track record to lead incident response teams under pressure, effectively solving complex issues.

The salary range is indicative for roles at the same level within DTCC across all US locations. Actual salary is determined based on the role, location, individual experience, skills, and other considerations. We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status. We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.

Apply now Apply later

* Salary range is an estimate based on our InfoSec / Cybersecurity Salary Index 💰

Job stats:  0  0  0

Tags: Automation AWS Azure Cloud GCP Incident response Java KPIs Linux Mainframe Monitoring Python Risk management Scrum SDLC SLOs Splunk UNIX Vulnerabilities

Perks/benefits: Career development Competitive pay Flex vacation Health care Insurance Startup environment

Region: North America
Country: United States

More jobs like this

Explore more career opportunities

Find even more open roles below ordered by popularity of job title or skills/products/technologies used.