Senior GCP SRE - ELK

EPAM Systems, Inc.

EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

We are seeking a Senior GCP SRE - ELK to join a B2B parts business undergoing commerce and digital transformation. In this client-facing role, you will support the platform COE within a DevOps/SRE function, working with teams across both the Americas and Belgium to build and maintain scalable, secure and reliable cloud infrastructure.

Responsibilities

  • Design, implement and manage scalable and secure cloud infrastructure on Google Cloud Platform (GCP)
  • Optimization of GCP resources for performance, cost and reliability
  • Monitor and troubleshoot GCP services to ensure high availability and performance
  • Deploy, manage and scale containerized applications using Google Kubernetes Engine (GKE)
  • Implementation and maintenance of Kubernetes clusters, ensuring proper configuration and security
  • Utilize Helm for package management and deployment of applications on Kubernetes
  • Develop and maintain CI/CD pipelines using tools such as Jenkins, ArgoCD and GitLab to automate application deployment and infrastructure provisioning
  • Collaborate with development teams to integrate CI/CD practices into the software development lifecycle
  • Implementation of monitoring solutions using Prometheus, Grafana and ElasticSearch to track application performance and system health
  • Set up logging and tracing mechanisms to facilitate troubleshooting and performance optimization
  • Administer and optimize Confluent Kafka clusters deployed across both SaaS and on-premises environments to ensure high availability, fault tolerance and continuous data streaming
  • Development and maintenance of automation scripts and tools for deployment, scaling and monitoring of Kafka services
  • Investigate and resolve operational incidents, minimizing downtime and service disruption
  • Collaborate with cross-functional teams to gather requirements and provide technical guidance on cloud and containerization best practices
  • Documentation of architecture, processes and procedures to ensure knowledge sharing and compliance with best practices

Requirements

  • 4-10 years of experience in a DevOps or SRE role
  • Production expertise in Google Cloud Platform (GCP)
  • Proficiency in Kubernetes (GKE, Helm)
  • Background in managing Confluent Kafka Platform across SaaS and inhouse instances
  • Competency in CI/CD tools (Jenkins, ArgoCD)
  • Skills in logging, metrics and tracing (Prometheus, ElasticSearch, Grafana)
  • Familiarity with Kibana
  • Very good English level for a client-facing role with stakeholders in both Americas and Belgium

We offer

  • Opportunity to work on technical challenges that may impact across geographies
  • Vast opportunities for self-development: online university, knowledge sharing opportunities globally, learning opportunities through external certifications
  • Opportunity to share your ideas on international platforms
  • Sponsored Tech Talks & Hackathons
  • Unlimited access to LinkedIn learning solutions
  • Possibility to relocate to any EPAM office for short and long-term projects
  • Focused individual development
  • Benefit package:
    • Health benefits
    • Retirement benefits
    • Paid time off
    • Flexible benefits
  • Forums to explore beyond work passion (CSR, photography, painting, sports, etc.)

How to apply

To apply for this job you need to authorize on our website. If you don't have an account yet, please register.