job summary: What it takes: Adept user of telemetry tools, including CloudWatch, Splunk, and HoneycombAbility to read and understand application code written in NodeJS, Java, and PythonAbility to write and update application code confidently in at least one of the following languages: NodeJS, Java, and PythonDeep familiarity with Siteminder, MFA and OIDC (Kong, envoy, OPA, etc.) protocols and implementationStrong conceptual thinking to quickly understand new and complex architectures, and ongoing incidentsExperience debugging production incidents using a combination of logs, metrics, and tracesFamiliarity with executing performance and chaos tests and analyzing resultsExperience working within the constraints of regulated workloads, including security restrictionsExperience building cloud-native applications/platforms*Ability to create, interpret, and update technical architecture diagrams Specializations that will make an impact: Experience with Distributed Tracing implementation eg OpenTelemetryExperience with circuit breakers or developing circuit breaker logic and bulkhead patternsExperience with structured and unstructured logging frameworks Duties and responsibilities: Cloud Platform SRE Engineer Lead (OR a really good hands senior cloud engineer) Help the client reach 99.999% availability for their mission critical applications in support of the Agility pillar for the Enterprise Technology Strategy!! You will be joining a high performing enterprise cloud container service (ECS) platform which hosts the majority of the web and batch applications developed by application teams, with many of them providing business\mission critical functionality for our external sites and services and internal applications. We can train on AWS container technology, ECS and partner with CTO SRE team on up skilling core SRE functions.As a Cloud Compute SRE Lead, within the ECS platform team, expanding to a global multi region presence, you will proactively seek out points of pain and opportunities for wide-reaching improvement by analyzing enterprise-wide telemetry data. Also, support reliability-centric tasks for cross-cutting concerns and applications spanning more than one sub-division. Expected to be a hands on developer to implement the found opportunities to enhance resiliency and harden the platform. You will partner with other shared services teams (performance, chaos, security and fraud and various ops teams) to bring a holistic approach to hardening the platform's security and resiliency posture. In this role you will: Proactively seek out operational anomalies using Honeycomb, Splunk, CloudWatch, and other telemetry toolsExecute chaos experiments and other resilience tests for spinal services and applications with cross-cutting impacts or high criticalityDefine SLIs and aligned SLOs for platform services. Implement automation via synthetic monitors and formulas to capture platform availabilityBuild\Deploy - Determine efficiencies to reduce build\deploy times and failures or application workloads. Assess build\deploy metrics to capture for further refinement and reportingUpdate application code based on findings to improve resilience and assist in automating workloads to be stood up in a Multi Region\Out of Region environmentImprove the platform's security posture by easing integration with modernized authorization\authentication protocols (OIDC, Auth0, Kong, Envoy) and identifying any potential vulnerabilitiesHelp product and platform teams and their SRE representatives diagnose complex technical problems, including performance issues and intermittent errorsListen in and participate on high severity major incident calls (SEV1s, some cross-cutting SEV2s) to assist with triage and recovery -also participate in post-incident reviews for these incidentsReview critical and complex architectures, including facilitation of FMEA exercisesMaintains product-level run-books for incident response to document the step-by-step process to recover from specific components within a system. location: Malvern, Pennsylvania job type: Contract work hours: 8am to 4pm education: No Degree Required responsibilities: What it takes: Adept user of telemetry tools, including CloudWatch, Splunk, and HoneycombAbility to read and understand application code written in NodeJS, Java, and PythonAbility to write and update application code confidently in at least one of the following languages: NodeJS, Java, and PythonDeep familiarity with Siteminder, MFA and OIDC (Kong, envoy, OPA, etc.) protocols and implementationstrong conceptual thinking to quickly understand new and complex architectures, and ongoing incidentsExperience debugging production incidents using a combination of logs, metrics, and tracesFamiliarity with executing performance and chaos tests and analyzing resultsExperience working within the constraints of regulated workloads, including security restrictionsExperience building cloud-native applications/platforms*Ability to create, interpret, and update technical architecture diagrams Specializations that will make an impact: Experience with Distributed Tracing implementation eg OpenTelemetryExperience with circuit breakers or developing circuit breaker logic and bulkhead patternsExperience with structured and unstructured logging frameworks Duties and responsibilities: Cloud Platform SRE Engineer Lead (OR a really good hands senior cloud engineer) Help the client reach 99.999% availability for their mission critical applications in support of the Agility pillar for the Enterprise Technology Strategy!! You will be joining a high performing enterprise cloud container service (ECS) platform which hosts the majority of the web and batch applications developed by application teams, with many of them providing business\mission critical functionality for our external sites and services and internal applications. We can train on AWS container technology, ECS and partner with CTO SRE team on up skilling core SRE functions.As a Cloud Compute SRE Lead, within the ECS platform team, expanding to a global multi region presence, you will proactively seek out points of pain and opportunities for wide-reaching improvement by analyzing enterprise-wide telemetry data. Also, support reliability-centric tasks for cross-cutting concerns and applications spanning more than one sub-division. Expected to be a hands on developer to implement the found opportunities to enhance resiliency and harden the platform. You will partner with other shared services teams (performance, chaos, security and fraud and various ops teams) to bring a holistic approach to hardening the platform's security and resiliency posture. In this role you will: Proactively seek out operational anomalies using Honeycomb, Splunk, CloudWatch, and other telemetry toolsExecute chaos experiments and other resilience tests for spinal services and applications with cross-cutting impacts or high criticalityDefine SLIs and aligned SLOs for platform services. Implement automation via synthetic monitors and formulas to capture platform availabilityBuild\Deploy - Determine efficiencies to reduce build\deploy times and failures or application workloads. Assess build\deploy metrics to capture for further refinement and reportingUpdate application code based on findings to improve resilience and assist in automating workloads to be stood up in a Multi Region\Out of Region environmentImprove the platform's security posture by easing integration with modernized authorization\authentication protocols (OIDC, Auth0, Kong, Envoy) and identifying any potential vulnerabilitiesHelp product and platform teams and their SRE representatives diagnose complex technical problems, including performance issues and intermittent errorsListen in and participate on high severity major incident calls (SEV1s, some cross-cutting SEV2s) to assist with triage and recovery -also participate in post-incident reviews for these incidentsReview critical and complex architectures, including facilitation of FMEA exercisesMaintains product-level run-books for incident response to document the step-by-step process to recover from specific components within a system. qualifications: Experience level: ExperiencedMinimum 8 years of experienceEducation: No Degree Required skills: JAVA DEVELOPERPythonnodeJSSiteminder Equal Opportunity Employer: Race, Color, Religion, Sex, Sexual Orientation, Gender Identity, National Origin, Age, Genetic Information, Disability, Protected Veteran Status, or any other legally protected group status. At Randstad, we welcome people of all abilities and want to ensure that our hiring and interview process meets the needs of all applicants. If you require a reasonable accommodation to make your application or interview experience a great one, please contact HRsupport@randstadusa.com. For certain assignments, Covid-19 vaccination and/or testing may be required by Randstad's client or applicable federal mandate, subject to approved medical or religious accommodations. Carefully review the job posting for details on vaccine/testing requirements or ask your Randstad representative for more information