Lead Reliability Engineer

  • location: Glen Mills, PA
  • type: Permanent
  • salary: $130,000 - $160,000 per year
easy apply

job description

Lead Reliability Engineer

job summary:
Technical expertise is critical in order to imagine and drive technical improvements across our database, networking, and infrastructure teams, and to partner with our application teams, implementing more robust and performant applications for our internal solutions and business solutions.

 
location: Glen Mills, Pennsylvania
job type: Permanent
salary: $130,000 - 160,000 per year
work hours: 9am to 5pm
education: Bachelors
 
responsibilities:
  • Lead multiple teams of Reliability / DevOps engineers who automate & build release pipelines, infrastructure, cloud platforms and Operational tasks.
  • Manage end to end availability, security, and performance of mission-critical services
  • Providing leadership, architecture, development, and project management expertise in making our systems fail rarely, and are fast to fix when they do fail
  • Drive reliably systems engineering design and recovery by minimizing manual involvement and leading continuous improvements that create an operating environment that includes dynamically monitoring, alerting, and automated self-healing and recovering
  • Identify and/or analyze problems relating to mission critical services and manage the building of automation to prevent problem recurrence; with the goal of automating response to all non-exceptional service conditions.
  • Engage in service capacity planning and demand forecasting, software performance analysis and system tuning.
  • Oversee the incident management and drive root cause analysis initiatives to identify continuous improvements
  • Drive Operational Testing and Performance Engineering to certify solutions and provide critical thinking & recommendations to meet availability and performance targets
  • Improve our monitoring, troubleshooting, and resolution capabilities
  • Create clear presentations and communication to stakeholders that highlight the impact of the issues and solutions to service disruptions. Communicate the state of reliability to prioritize technical debt & improvements on technology team roadmaps. Equally capable at presenting analyses and recommendations to leadership or discussing the technical merits of solutions with engineers and architects.
  • Lead, build, and grow a diverse team across geographies toward a common goal; partner with our application development and project management teams on coordinating investigations into customer facing service issues.
  • Own the day-to-day health, uptime, monitoring, and reliability of services and server infrastructure
  • Lead, own, model and drive DevOps culture and behaviors
  • Practice and enforce Agile and Scrum methodologies
  • Ensure user visible uptime and quality, providing operational and development expertise in making our systems fail rarely, and are fast to fix when they do fail
  • Participate in architecture and design reviews to provide recommended improvements to the development teams to improve the reliability and performance of applications
  • Minimize manual involvement by imagining & implementing continuous improvements that create an operating environment, including the development of new tools, dynamically monitoring, alerting, & automated self-healing & recovery
  • Identify and/or analyze problems relating to mission critical services and implement automation to prevent problem recurrence; with the goal of automating response to all non-exceptional service conditions.
  • Engage in application performance analysis and system tuning, and capacity planning
  • Perform root cause analysis to identify & implement continuous improvements
  • Capable of presenting analyses and recommendations to leadership or discussing the technical merits of solutions with engineers and architects.
  • Own the day-to-day health, uptime, monitoring, and reliability of services and server infrastructure
  • Practice Agile and Scrum methodologies
 
qualifications:
  • Strong Software Engineering Experence
  • Knowledge of Azure Services, especially ARM templates
  • Strong experience with Azure DevOps, TFS 2010+, VSTS, or similar ALM tool
  • BS or higher degree in Computer Science/Engineering or related field w 7+ years of experience
  • Strong experience with PowerShell
  • Experience developing in a software development language (preferably C#/C++)
  • Experience and knowledge of database technologies, particularly MS SQL
  • Knowledge of virtualization and its benefits for improving reliability
  • Strong experience with instrumentation, monitoring, alerting, and responding relative to performance and availability of applications
  • Capable of technical deep dives into infrastructure, databases, and application, specifically in designing, coding, operating, and supporting high-performance, highly available services and infrastructure
  • Experience in designing for failure, including disaster recovery and business continuity planning
  • Experience operating and supporting mission-critical applications (e.g. incident and outage management)
  • Passionate for making things better and driving action with a sense of urgency
  • Experience problem solving issues on globally distributed systems and critical product service environments
  • Knows what is possible using latest networking, infrastructure, database, and application technologies to driving automation and reliability improvements
  • Brings new thinking to challenge existing technology and processes
  • Excellent at building relationships across teams
  • Firm sense of accountability and ownership
  • Desire to understand our businesses and users
  • Understanding of the concepts and principles behind DevOps, Continuous Delivery, Agile, Lean, etc.
  • Use of DevOps tools to deliver and operate end-user services a plus (e.g., Chef, New Relic, Puppet, etc.)
 
skills:
  • Strong Software Engineering Experence
  • Knowledge of Azure Services, especially ARM templates
  • Strong experience with Azure DevOps, TFS 2010+, VSTS, or similar ALM tool
  • BS or higher degree in Computer Science/Engineering or related field w 7+ years of experience
  • Strong experience with PowerShell
  • Experience developing in a software development language (preferably C#/C++)
  • Experience and knowledge of database technologies, particularly MS SQL
  • Knowledge of virtualization and its benefits for improving reliability
  • Strong experience with instrumentation, monitoring, alerting, and responding relative to performance and availability of applications
  • Capable of technical deep dives into infrastructure, databases, and application, specifically in designing, coding, operating, and supporting high-performance, highly available services and infrastructure
  • Experience in designing for failure, including disaster recovery and business continuity planning
  • Experience operating and supporting mission-critical applications (e.g. incident and outage management)
  • Passionate for making things better and driving action with a sense of urgency
  • Experience problem solving issues on globally distributed systems and critical product service environments
  • Knows what is possible using latest networking, infrastructure, database, and application technologies to driving automation and reliability improvements
  • Brings new thinking to challenge existing technology and processes
  • Excellent at building relationships across teams
  • Firm sense of accountability and ownership
  • Desire to understand our businesses and users
  • Understanding of the concepts and principles behind DevOps, Continuous Delivery, Agile, Lean, etc.
  • Use of DevOps tools to deliver and operate end-user services a plus (e.g., Chef, New Relic, Puppet, etc.)

Equal Opportunity Employer: Race, Color, Religion, Sex, Sexual Orientation, Gender Identity, National Origin, Age, Genetic Information, Disability, Protected Veteran Status, or any other legally protected group status.

easy apply

get jobs in your inbox.

sign up
{{returnMsg}}

related jobs

    Site Reliability Engineer

  • location: Conshohocken, PA
  • job type: Permanent
  • salary: $125,000 - $155,000 per year
  • date posted: 9/9/2019


    Reliability Engineer

  • location: Glen Mills, PA
  • job type: Permanent
  • salary: $120,000 - $140,000 per year
  • date posted: 9/12/2019