SRE GCP - TECH LEAD

28LPA Yearly
SRE GCP - TECH LEAD
  • SRE GCP - TECH LEAD @ IBM
  • Hyderabad / TELANGANA
Job Description

Primary Responsibilities

     Site Reliability Engineering (SRE) is an engineering discipline that combines software and system engineering to build and run large scale, massively distributed, fault-tolerant systems. SREs ensure managed service offerings and customer deployments have reliability and uptime appropriate to user’s needs and a fast rate of improvement while monitoring and validating capacity and performance. Focused on reliability, scalability, and the development of automation to manage a set of repetitive tasks at scale.

Knowledge &Skills

·          

    • In depth knowledge on SRE practices and concepts like SLA, SLO, SLI, Error budget, Toil elimination, Post-mortem etc.
    • Should have experience in any Monitoring and Observability tools: Grafana, Splunk, Dynatrace, gcp operation suite etc.
    • Should have understanding and knowledge into any APM tools App dynamics, Datadog etc – preferably app dynamics.
    • Should have experience in IaC: Terraform, Ansible etc.
    • Should have experience working with cloud-native applications to manage them effectively in GCP or Azure.
    • Should have experience into creating pipelines in CI/CD any tools like GitHub action, Azure devops, Jenkins etc.
    • Should have knowledge into version control any tools like Git,BitBucket etc.
    • Knowledge into any of the scripting languages like powershell,python,bash etc.
    • Coding infrastructure automation across the CI/CD pipeline
    • Responsible for ensuring the availability, performance, and scalability of a website or application.
    • Knowledge into containerization and orchestration: Docker, Kubernetes, Cloudrun(GCP) etc.
    • Involved in capacity planning and performance tuning to ensure that the site can handle increased traffic without issue.
    • Responsible for ensuring the availability, performance, and scalability of a website or application.
    • Should have experience working with cloud-native applications to manage them effectively.
    • Work closely with developers to identify and fix potential issues before they cause problems for users.
    • Deep understanding of how distributed systems work in order to be able to troubleshoot and optimize them.
    • Deep understanding of how different types of databases work in order to be able to effectively troubleshoot any issues that may arise.
    • Ability to communicate clearly and concisely about system alerts or outages to other members of your team.
    • Below points to be noted: Apart from JD, Customer is looking for a candidate who can mature their SRE practice across the division. Someone who is comfortable being a champion and leader in the SRE space.