Lead Site Reliability Engineer (Azure) - Remote Remote in Canada
Lead Site Reliability Engineer (Azure) - Remote Description
Job #: 92518We’re looking for an expert who has a strategic view and can help to build (with close collaboration with the client) a baseline for the SRE team to identify SLOs/SLAs, form error budgets, define system tolerance, metric baseline, etc. Not only ‘motivate and collaborate’ but do things by hand like writing the docs of strategy, driving workshops with clients, etc.
Req.#554998732
#LI-KR3
Responsibilities
- Be responsible for the technical solution by providing leadership for the customer, project manager, domain architects, domain specialists and application engineers to advance and deliver solutions
- Analyzing, executing, and streamlining DevOps practices
- Automating processes with the right tools
- Facilitating development process and operations
- Setting up a continuous build environment to speed up the software development and deployment process
- Architecting overall, comprehensive, and efficient practices
- Guiding developers and operation teams in case of an issue
- Monitoring, reviewing, and managing technical operations
- Consult and Inform Architects to design and deliver solutions
- Assess the merits of alternative technical approaches and gain consensus on the best approach
- Learn, follow, promote, and improve recognized methodologies to design and deliver solutions
- Ensure that the non-functional requirements are satisfied including, but not limited to, security, disaster recovery, availability, and performance
- Mentor IT professionals
- Be able to work with Jira, Confluence, Bitbucket
Requirements
- Solid Linux/Unix systems administration background
- E-commerce domain
- Continuous Integration orchestration
- Continuous Delivery and Continuous Deployment orchestration
- Infrastructure as Code
- Public Cloud: Azure Cloud
- Container orchestration: Kubernetes (GKE), Docker Swarm
- Docker, Docker Compose
- Helm Charts
- Configuration Management - Ansible
- SCM - source control management
- GitHub, GitHub Actions, gitflow
- BuildTools: Ant, Maven, Gradle, Node
- Java support and troubleshooting, Apache Solr, ZooKeeper, SAP Hybris (e-commerce), Tomcat
- Artifacts management, Artifactory
- Sonarqube, quality gates, VeraCode
- Experience with load balancers / reverse proxies (nginx)
- Network, Network troubleshooting
Benefits
- Extended Healthcare with Prescription Drugs, Dental and Vision Insurance, and Healthcare Spending Account (Company Paid)
- Maternity/Parental/Adoption Leave Top-up
- Life and AD&D Insurance (Company Paid)
- Employee Assistance Program (Company Paid)
- Unlimited access to LinkedIn learning solutions
- Long-Term Disability
- Registered Retirement Savings Plan (RRSP) with company match
- Paid Time Off
- Critical Illness Insurance
- Employee Discounts
- Employee Stock Purchase Program
About EPAM
- EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential