Lead I - Cloud Infrastructure Services

UST Global

ID: 59801 5 - 7 Years 1 Opening Trivandrum

Role description

Primary Skills: Python, AppDynamics, AWS cloudwatch, Grafana, Kubernetes

looking for a contractor resource with strong expertise in Python, Observability, and monitoring platforms such as AppDynamics, CloudWatch, and Grafana. Ideally, the candidates should possess a skill set and level of proficiency like Dhana from devops team.

Job Title: Senior Observability Engineer / Site Reliability Engineer (SRE)

Job Summary

We are seeking a highly motivated and technically proficient Senior Observability Engineer to join our DSX Observability Engineering organization. The ideal candidate will have strong expertise in Observability platforms, Python development, and Site Reliability Engineering (SRE) practices to design, implement, and continuously improve monitoring, ing, logging, tracing, and reliability solutions across critical business applications and cloud platforms.

This role will be responsible for building scalable observability frameworks, driving operational excellence, improving service reliability, and enabling engineering teams with actionable insights into application and infrastructure performance.

Key Responsibilities

Observability Engineering

Design, implement, and maintain enterprise-wide observability solutions covering metrics, logs, traces, and user experience monitoring.
Develop and manage dashboards, s, and service health indicators using Grafana, CloudWatch, AppDynamics, ELK and related observability tools.
Define and implement SLIs, SLOs, and Error Budgets for critical services.
Drive adoption of observability best practices across engineering teams.
Establish standards for instrumentation, monitoring, ing, and incident response.
Build end-to-end observability solutions leveraging OpenTelemetry and cloud-native monitoring capabilities.
Analyze application and infrastructure performance data to identify trends, bottlenecks, and optimization opportunities.

Site Reliability Engineering (SRE)

Improve platform reliability, scalability, availability, and operational efficiency.
Participate in incident management, root cause analysis (RCA), and post-incident reviews.
Drive proactive reliability initiatives through automation and engineering improvements.
Support chaos engineering, resiliency testing, and disaster recovery exercises.
Define and monitor operational KPIs related to system health and reliability.
Collaborate with development teams to improve production readiness and operational excellence.

Python Development & Automation

Develop automation tools and frameworks using Python to streamline observability and operational workflows.
Build integrations between monitoring platforms, cloud services, ticketing systems, and collaboration tools.
Automate enrichment, reporting, dashboard provisioning, and operational tasks.
Develop scripts and APIs for data collection, analysis, and observability enhancements.
Create self-service solutions that improve developer productivity and operational visibility.

Cloud & Platform Monitoring

Implement monitoring solutions for cloud-native architectures hosted on AWS/AKS/Azure.
Configure and optimize CloudWatch metrics, logs, alarms, and dashboards.
Monitor distributed systems, microservices, APIs, containers, and serverless workloads.
Support observability for Kubernetes, containerized applications, and modern cloud platforms.
Ensure monitoring coverage for business-critical services and customer journeys.

Required Qualifications

Bachelor's degree in Computer Science, Engineering, or related field.
5+ years of experience in Observability, Monitoring, SRE, or Platform Engineering roles.
Strong hands-on expertise with:

Python
Grafana
AWS CloudWatch
AppDynamics

Experience implementing monitoring, logging, tracing, and ing solutions.
Strong understanding of SRE principles, reliability engineering, and production operations.
Experience with distributed systems and microservices architectures.
Knowledge of REST APIs and automation frameworks.
Experience with CI/CD pipelines and DevOps practices.
Strong troubleshooting and performance analysis skills.

Preferred Qualifications

Experience with OpenTelemetry instrumentation and distributed tracing.
Experience with ELK/OpenSearch, Prometheus, or similar observability platforms.
Experience with Kubernetes and container platforms.
Familiarity with Chaos Engineering and resiliency testing.
Knowledge of AWS cloud architecture and cloud-native services.
Experience defining and implementing SLOs, SLIs, and Error Budgets.
Exposure to large-scale production environments supporting critical customer-facing applications.

Key Skills

Python Development
Observability Engineering
Grafana
AWS CloudWatch
AppDynamics
OpenTelemetry
Site Reliability Engineering (SRE)
Monitoring & ing
Incident Management
Distributed Tracing
Logging & Metrics
Automation Engineering
Reliability Engineering
AWS Cloud Services
Kubernetes (Preferred)

Success Metrics

Improved service reliability and availability.
Reduction in Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR).
Increased observability coverage across applications and platforms.
Improved quality and reduced fatigue.
Enhanced automation and operational efficiency.
Successful implementation and adoption of SLOs and reliability best practices across engineering teams.

Skills

site reliability engineering,aws cloud watch,python,appdynamics,grafana,kubernetes,

About UST

UST is a global digital transformation solutions provider. For more than 20 years, UST has worked side by side with the world’s best companies to make a real impact through transformation. Powered by technology, inspired by people and led by purpose, UST partners with their clients from design to operation. With deep domain expertise and a future-proof philosophy, UST embeds innovation and agility into their clients’ organizations. With over 30,000 employees in 30 countries, UST builds for boundless impact—touching billions of lives in the process.