Randstad is seeking a highly experienced and technically proficient Senior DevOps and Site Reliability Engineer (SRE) to join our client in the DC Metro area. This critical, senior-level role is responsible for driving the reliability, performance, security, and scalability of high-availability production environments on AWS. The ideal candidate is a hands-on technical leader who blends deep expertise in software development, infrastructure-as-code, and observability to automate operational toil, lead capacity planning, and serve as a primary on-call responder for critical incidents. This role demands a strong focus on applying SRE principles (SLIs/SLOs/Error Budgets), mentoring team members, and proactively influencing cross-functional teams to achieve world-class operational excellence.
location: Washington, Washington, D.C.
job type: Contract
salary: $75 - 85 per hour
work hours: 9am to 5pm
education: Bachelors
responsibilities:
Deployment & Automation Engineering
- Implement, maintain, and optimize robust CI/CD pipelines utilizing tools such as GitHub Actions, AWS CodePipeline, and Jenkins.
- Automate infrastructure provisioning and configuration management using Infrastructure-as-Code (IaC) tools like Terraform, CloudFormation, or AWS CDK.
- Design and develop automation scripts and self-service tools to significantly enhance development and operational efficiency.
- Proficiency in multiple programming languages (Python, Go, Java) to develop automation and troubleshoot applications.
- Serve as a production on-call responder, leading incident management and orchestrating critical service outages and disaster recovery failover activities.
- Facilitate detailed post-mortem meetings and drive systemic improvement patterns across teams.
- Define, monitor, and enforce Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets.
- Expertly leverage observability tools (Dynatrace, AppDynamics, ELK Stack, Dynatrace strongly preferred) for proactive monitoring and troubleshooting.
- Utilize distributed tracing and context propagation to identify performance bottlenecks and root causes of failures.
- Design and implement custom dashboards and anomaly detectors to generate actionable insights.
- Develop sophisticated capacity models and forecasting systems to ensure service scalability.
- Lead cost optimization initiatives, identifying and implementing efficiency gains across cloud services.
- Design and execute comprehensive Resiliency and Performance testing frameworks.
- Configure and maintain dynamic auto-scaling policies and thresholds for optimal resource utilization.
- Lead security incident investigations and execute swift remediation plans.
- Design and implement automated compliance validation and security automation frameworks.
- Drive the implementation of zero-trust architecture patterns within the cloud environment.
- Proficiently apply ITIL framework principles, preferably leveraging ITSM tools such as ServiceNow.
qualifications:
Education & Experience
Bachelor's degree in Computer Science, Engineering, or a related technical field.
5 to 8 years of progressive experience in DevOps, Site Reliability Engineering (SRE), or Platform Engineering.
3+ years of experience maintaining and optimizing high-availability production environments.
Proven track record of leading complex technical initiatives from conception to completion.
Technical Expertise
Expert-level knowledge of at least one major cloud platform, with AWS strongly preferred.
Deep expertise in cloud architecture, networking, and core services.
High proficiency in IaC tools such as Terraform, CloudFormation, or AWS CDK.
Expert-level experience with observability and APM tools, with a strong preference for Dynatrace.
Proficiency in modern programming languages like Python, Go, or Java.
Knowledge of relational, cloud-native, and NoSQL database technologies.
Professional & Leadership Skills
Strong leadership and mentoring capabilities, with the ability to elevate the technical skills of the team.
Exceptional ability to influence without direct authority across engineering and product teams.
Excellent technical writing and documentation skills (e.g., RCA development, Knowledge articles).
Ability to maintain flexible availability for on-call duties and to work outside of standard business hours as required for incident response.
Equal Opportunity Employer: Race, Color, Religion, Sex, Sexual Orientation, Gender Identity, National Origin, Age, Genetic Information, Disability, Protected Veteran Status, or any other legally protected group status.
At Randstad Digital, we welcome people of all abilities and want to ensure that our hiring and interview process meets the needs of all applicants. If you require a reasonable accommodation to make your application or interview experience a great one, please contact HRsupport@randstadusa.com.
Pay offered to a successful candidate will be based on several factors including the candidate's education, work experience, work location, specific job duties, certifications, etc. In addition, Randstad Digital offers a comprehensive benefits package, including: medical, prescription, dental, vision, AD&D, and life insurance offerings, short-term disability, and a 401K plan (all benefits are based on eligibility).
This posting is open for thirty (30) days.