IT Jobs Post New Recruiter New Job Seeker Job Seeker Login Recruiter Login Advanced Search Home Advertise Jobs Site map Contact Us
Site Reliability Engineer Program Manager Jobs IT Recruitment and UK Job Vacancies from IT Jobs Post
Job Seeker Desktop New Job Seeker Job Seeker Login Advanced Job Search    
 
 
Search Results Search Results
  » Posted on: 31-07-2018

Position or Job Title Site Reliability Engineer Program Manager
Company CV-Library Ltd
Recruiter Reference itjobspost/208358988
Position Location (City) Maidenhead
County/Area Surrey
Country United Kingdom
Company Profile Contact Recruiter

Description & Requirements

Job Description
Site Reliability Engineer (SRE) – Program Manager

Location: Maidenhead, Berkshire

Responsibilities:

• Own end-to-end availability for a product service

• Work with product service teams to establish SLIs and error budget's, and nurture an environment that appreciates the value that they add

• Identify opportunity for increased monitoring capabilities (white-box & black-box)

• Identify long-term trends for product services (how is my traffic growing over time? How big is the database getting? What does our resource usage patterns look like over time?)

• Ensuring that short-term hacks, are replaced with long-term solutions

• Co-ordinating incident response as part of an on-call rotation, ensuring the SREs aren't being overloaded by on-call, and continually refine the process and tools that enable us to do incident response successfully

• Ensuring that RCAs are being carried out effectively, and that they are being done in a blame-free manner

• Attend the portfolio management team meetings to flag reliability considerations for upcoming work, and to reason about any reliability concerns from other stakeholders

• Populate the SRE backlog

• Identify requirements surrounding load testing, security testing, availability and disaster recovery

• Help mature the delivery process for teams; defining Jenkins pipelines, designing canary release deploys, building in automated fallbacks, optimizing the build chain etc

• Optimize product service code to ensure that it's secure, scalable and performant

• Optimize release engineering code to ensure that it's stable, repeatable and fast

• Improve the fault detection for our services

• Create dashboards which help communicate the metrics for a given product service

• Work with product owners and product engineering teams to perform capacity planning

• Work with product engineering teams to understand performance and behavior patterns

• Help carry out root cause analysis for incidents, and design solutions (both software and human processes) that will help to ensure the same problem doesn't happen in the same way again

Critical Skills / Competencies:

• Comfortable writing code with one or more of the following languages: Python / Go / Java / C# / C / C++

• Experience working with product owners and product development to prioritize work, flag risk and identify potential production engineering issues (e.g. scalability, resiliency, performance)

• A positive attitude and willingness to learn

• Experience managing services in AWS

• Experience with IaaS and Serverless services from a cloud provider

• An understanding in TCP/IP, DNS and experience designing networks

• Linux system administration experience

• Strong conflict resolution competence

• Excellent written and verbal communication skills

• Experience implementing fault detection, and automating fixes

• An understanding of SQL databases

• Experience designing scalable services

• Experience designing distributed, fault-tolerant systems

• Detail oriented. The ideal candidate is one who naturally digs as deep as they need to understand the why

Required Skills
See listing
Nice To Have Skills
See listing
Required Qualifications
None Listed

Additional Details

Employment Authorisation Type of Position
See listing Permanent
Salary and Package
£50000 - £55000/a
Start Date Required Experience
See listing
<
Required Education
Other