Senior Production Support Engineer
SurveySparrow

Job Title: Senior Production Support Engineer
Location: Chennai (Work from Office)
Job Summary
:We are looking for an experienced and proactive Senior Production Support Engineer to lead the stability, scalability, and performance of our production systems and cloud infrastructure. In this role, you will be expected to take ownership of complex technical challenges, drive system reliability initiatives, and guide the cloud operations strategy in collaboration with DevOps, SRE, and product engineering teams. The ideal candidate brings a strong technical background, leadership capabilities.
Key Responsibilities:
Production Reliability & Monitoring:
Own the performance, availability, and health of our production environments.
Lead complex incident investigations and drive resolution of application, infrastructure, and network-level issues.
Perform deep root cause analyses (RCA) for critical and recurring incidents and drive implementation of long-term fixes.
Design, maintain, and improve observability across systems using tools like Prometheus, Grafana, New Relic, and ELK Stack.
Set best practices for monitoring, alerting, and logging to enable proactive issue detection and rapid response.
Incident Management & Team Collaboration:
Lead incident response efforts, coordinate across teams, and ensure swift resolution of high-impact issues.
Provide Tier 3 support and technical escalation for unresolved complex issues.
Mentor junior engineers and foster a culture of reliability, accountability, and continuous improvement.
Collaborate with SRE, DevOps, and product engineering teams to design and build resilient and scalable systems.
Maintain and enforce high-quality documentation standards for incidents, playbooks, and system knowledge.
Automation & Process Optimization:
Drive the automation of manual and repetitive operational tasks to enhance system efficiency and consistency.
Architect and implement solutions for secure and seamless deployment of hotfixes, patches, and upgrades.
Improve and streamline incident response workflows, reducing Mean Time to Resolution (MTTR) and improving team productivity.
Partner with development teams early in the lifecycle to influence design for operability and supportability.
Requirements:
2+ years of hands-on experience in cloud infrastructure, production support, or system operations.
Proven experience with AWS, or GCP and managing distributed cloud-native applications at scale.
Advanced troubleshooting and performance tuning skills in high-availability environments.
Expertise in monitoring, observability, and alerting tools (Prometheus, Grafana, New Relic, etc.)
Experience with incident management platforms (e.g., JIRA, Zendesk).
Proficient in scripting or infrastructure-as-code tools (e.g., Bash, Python, Terraform).
Solid understanding of databases, networking, and security best practices in cloud environments.
Strong communication and leadership skills with a collaborative mindset. Experience mentoring and guiding junior engineers is a plus.
See more jobs in Chennai, Tamil Nadu