Senior GCP SRE - ELK
EPAM Systems, Inc.
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
We are seeking a Senior GCP SRE - ELK to join a B2B parts business undergoing commerce and digital transformation. In this client-facing role, you will support the platform COE within a DevOps/SRE function, working with teams across both the Americas and Belgium to build and maintain scalable, secure and reliable cloud infrastructure.
Responsibilities
- Design, implement and manage scalable and secure cloud infrastructure on Google Cloud Platform (GCP)
- Optimization of GCP resources for performance, cost and reliability
- Monitor and troubleshoot GCP services to ensure high availability and performance
- Deploy, manage and scale containerized applications using Google Kubernetes Engine (GKE)
- Implementation and maintenance of Kubernetes clusters, ensuring proper configuration and security
- Utilize Helm for package management and deployment of applications on Kubernetes
- Develop and maintain CI/CD pipelines using tools such as Jenkins, ArgoCD and GitLab to automate application deployment and infrastructure provisioning
- Collaborate with development teams to integrate CI/CD practices into the software development lifecycle
- Implementation of monitoring solutions using Prometheus, Grafana and ElasticSearch to track application performance and system health
- Set up logging and tracing mechanisms to facilitate troubleshooting and performance optimization
- Administer and optimize Confluent Kafka clusters deployed across both SaaS and on-premises environments to ensure high availability, fault tolerance and continuous data streaming
- Development and maintenance of automation scripts and tools for deployment, scaling and monitoring of Kafka services
- Investigate and resolve operational incidents, minimizing downtime and service disruption
- Collaborate with cross-functional teams to gather requirements and provide technical guidance on cloud and containerization best practices
- Documentation of architecture, processes and procedures to ensure knowledge sharing and compliance with best practices
Requirements
- 4-10 years of experience in a DevOps or SRE role
- Production expertise in Google Cloud Platform (GCP)
- Proficiency in Kubernetes (GKE, Helm)
- Background in managing Confluent Kafka Platform across SaaS and inhouse instances
- Competency in CI/CD tools (Jenkins, ArgoCD)
- Skills in logging, metrics and tracing (Prometheus, ElasticSearch, Grafana)
- Familiarity with Kibana
- Very good English level for a client-facing role with stakeholders in both Americas and Belgium
We offer
- Opportunity to work on technical challenges that may impact across geographies
- Vast opportunities for self-development: online university, knowledge sharing opportunities globally, learning opportunities through external certifications
- Opportunity to share your ideas on international platforms
- Sponsored Tech Talks & Hackathons
- Unlimited access to LinkedIn learning solutions
- Possibility to relocate to any EPAM office for short and long-term projects
- Focused individual development
- Benefit package:
- Health benefits
- Retirement benefits
- Paid time off
- Flexible benefits
- Forums to explore beyond work passion (CSR, photography, painting, sports, etc.)