job summary:
Job Description

As a Staff Engineer, SRE - you will play a crucial role in ensuring the reliability, scalability, and performance of software systems. Collaborating closely with teams, you will have the opportunity to set and enforce best practices, ensure scalability, reliability, and security of our cloud and on-premises environments.

This role is intended to be a technically broad role, requiring a strong understanding of the entire technology stack (network, storage, OS, virtualization, database, development, applications) to observe, monitor, troubleshoot, and automate activity.

location: Urbandale, Iowa
job type: Contract
salary: $80 - 90 per hour
work hours: 9am to 5pm
education: Bachelors

responsibilities:
Key Functions/Duties of Position

Define, and track reliability and observability OKRs. This includes defining and tracking Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
Implement robust monitoring and alerting systems to proactively monitor health, identify potential issues, analyze system performance, and facilitate quick response to incidents.
Implement AIOps functionality to enable auto-response, self-healing, and anomaly trend analysis.
Drive the development and implementation of automation solutions to remove "toil", streamline processes, reduce manual interventions, and enhance the overall efficiency of the product engineering and SRE teams.
Identifying and addressing performance bottlenecks in applications and infrastructure to improve efficiency and user experience.
Work closely with incident management to quickly address and resolve system outages or performance issues to minimize downtime and impact on users.
Collaborate actively with development and operations teams to implement observability and resiliency requirements in order to ensure smooth deployment and operation of software systems.
Lead the coordination with product, development, infrastructure, and architecture teams to conduct capacity planning, ensuring that systems can handle current and future demand; anticipate growth and scalability requirements.
Improve reliability by identifying and addressing gaps in our architecture, services, and tooling.
Modernize disaster recovery program for both on premise and Cloud-based Berkley solutions.
Provide technical leadership and mentorship to other engineers, fostering a culture of learning and continuous improvement.

Education Requirement

Bachelor's degree in computer science, Information Technology, or a related field (or a combination of education and equivalent experience).

qualifications:
Qualifications

15+ years of IT experience working with infrastructure support and development.
7+ years of experience of Site Reliability Engineering and DevOps.
Proficient in scripting languages like Python, Go, Bash, and/or Javascript, and experience with Shell Scripting.
Strong expertise of observability, monitoring, alerting, and logging tools (Dynatrace, Datadog, ELK Stack).
Practical expertise in creating and implementing logging and monitoring architectures through hands-on experience.
Expertise in designing and implementing on-premises, cloud, and hybrid resiliency solutions (HA, AA, AP), disaster recovery, and business continuity planning.
Deep understanding of cloud computing principles, including IaaS, PaaS, and SaaS models.
Experience with Kubernetes and other auto-scaling tools and technologies. Including proficiency with tools such as Helm and Prometheus for deployment and monitoring.
Proficient in leveraging GitOps with containerization technologies and CI/CD pipelines.
Develop and implement automated system reliability and performance solutions including infrastructure automation and configuration management tools (GitHub Actions, Terraform, Ansible, Chef, Puppet).
Solid understanding of security best practices in on-premises, cloud, and hybrid environments along with Network technologies.
Understanding of industry standard security frameworks and ability to interpret them for Berkley environments.
Ability to drive critical issues and system design discussions and moderate between multiple technology teams.
Demonstrated leadership experience, including mentoring junior engineers and leading technical projects.
Excellent problem-solving skills and the ability to troubleshoot complex issues in a distributed hybrid environment.
Strong communication skills to collaborate effectively with cross-functional teams and convey technical concepts to non-technical stakeholders.

skills:

Define, and track reliability and observability OKRs. This includes defining and tracking Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
Implement robust monitoring and alerting systems to proactively monitor health, identify potential issues, analyze system performance, and facilitate quick response to incidents.

7+ years of experience of Site Reliability Engineering and DevOps.
Proficient in scripting languages like Python, Go, Bash, and/or Javascript, and experience with Shell Scripting.
Strong expertise of observability, monitoring, alerting, and logging tools (Dynatrace, Datadog, ELK Stack).

Equal Opportunity Employer: Race, Color, Religion, Sex, Sexual Orientation, Gender Identity, National Origin, Age, Genetic Information, Disability, Protected Veteran Status, or any other legally protected group status.

At Randstad Digital, we welcome people of all abilities and want to ensure that our hiring and interview process meets the needs of all applicants. If you require a reasonable accommodation to make your application or interview experience a great one, please contact HRsupport@randstadusa.com.

Pay offered to a successful candidate will be based on several factors including the candidate's education, work experience, work location, specific job duties, certifications, etc. In addition, Randstad Digital offers a comprehensive benefits package, including: medical, prescription, dental, vision, AD&D, and life insurance offerings, short-term disability, and a 401K plan (all benefits are based on eligibility).

This posting is open for thirty (30) days.

staff engineer/ site reliability engineer.

job details

share this job.

related jobs.

remote business analyst

genesys systems analyst

genesys analyst

let similar jobs come to you