Primary
Responsibilities
Site Reliability Engineering (SRE) is an engineering discipline that combines
software and system engineering to build and run large scale, massively
distributed, fault-tolerant systems. SREs ensure managed service offerings
and customer deployments have reliability and uptime appropriate to user’s
needs and a fast rate of improvement while monitoring and validating capacity
and performance. Focused on reliability, scalability, and the development of
automation to manage a set of repetitive tasks at scale.
Knowledge &Skills
·
- In depth knowledge on
SRE practices and concepts like SLA, SLO, SLI, Error budget, Toil elimination,
Post-mortem etc.
- Should have experience
in any Monitoring and Observability tools: Grafana, Splunk, Dynatrace,
gcp operation suite etc.
- Should have
understanding and knowledge into any APM tools App dynamics, Datadog
etc – preferably
app dynamics.
- Should have experience
in IaC: Terraform, Ansible etc.
- Should have experience
working with cloud-native applications to manage them effectively in GCP or Azure.
- Should have experience
into creating pipelines in CI/CD any tools like GitHub action,
Azure devops, Jenkins etc.
- Should have knowledge
into version control any tools like Git,BitBucket etc.
- Knowledge into any of the
scripting
languages like powershell,python,bash etc.
- Coding infrastructure
automation across the CI/CD pipeline
- Responsible for
ensuring the availability, performance, and scalability of a website or
application.
- Knowledge into
containerization and orchestration: Docker, Kubernetes, Cloudrun(GCP) etc.
- Involved in capacity
planning and performance tuning to ensure that the site can handle
increased traffic without issue.
- Responsible for
ensuring the availability, performance, and scalability of a website or
application.
- Should have experience
working with cloud-native applications to manage them effectively.
- Work closely with developers to identify
and fix potential issues before they cause problems for users.
- Deep understanding of
how distributed systems work in order to be able to troubleshoot and
optimize them.
- Deep understanding of
how different types of databases work in order to be able to
effectively troubleshoot any issues that may arise.
- Ability to communicate
clearly and concisely about system alerts or outages to other members
of your team.
- Below points to be noted: Apart from JD,
Customer is looking for a candidate who can mature their SRE practice
across the division. Someone who is comfortable being a champion and
leader in the SRE space.
|