- Daily Collaborate with cross-functional teams to identify and resolve issues related to performance, scalability, and security.
- Establish and track key performance indicators (KPIs) to measure the effectiveness of our SRE team.
- Establish service level objectives and monitor to ensure the objectives are met.
- Operates, monitors, and maintains high availability applications running in Azure Cloud environment.
- Drive continuous improvement through automation, monitoring, and testing.
- Executes automation for cloud-operations tasks and creates new automation for new situations or issues encountered; automates everything.
- Identifies and improves on possible points of failure in the infrastructure/applications.
- Lead and focus teams on root cause analysis, pattern identification and continuous improvement to optimize application performance, resiliency, and reliability.
- Facilitates blame-free root cause analysis meetings in the event of a production-systems incident so that the team can learn from mistakes and improve systems.
- Helps secure data and access policies to reduce risk.
- Looks for opportunities to drive operational efficiencies while reducing costs.
- Prepares and presents reports to all levels of leadership and staff.
- Stays abreast of industry leading best practices and brings them to the attention of the leadership team for innovative application.
- Serves as a guide and mentor to members of the Cloud Platform SRE teams to aid in their growth and development.
- Allocates available resources to meet operating objectives.
- Ensures the ongoing training and development of direct reports.
- Manage a 24/7 On call rotation schedule
- Experience managing teams (specifically SRE, Release and DevOps).
- Strong experience with Azure cloud platform.
- Experience working in a SRE environment and applying SRE Principles.
- Experience with CI/CD tools (Azure DevOps/Jenkins/GitLab).
- Familiarity with Agile methodologies and DevOps best practices.
- Strong grasp of infrastructure as code (e.g., Terraform, CloudFormation) and automation tools (e.g., Ansible, Chef, Puppet).
- Experience with Kubernetes, AKS/GKS, Docker, containerization, microservices, and serverless architectures.
- Proven track record of designing, implementing, and supporting highly available and scalable infrastructure in a cloud environment.
- Experience administering and troubleshooting both Windows Server and Linux operating systems. Familiarity with Internet Information Services (IIS), Apache, and Nginx.
- Proficiency in managing and monitoring relational databases (e.g., PostgreSQL, MySQL) and familiarity and experience with NoSQL databases such as MongoDB.
- Proficiency in at least one programming or scripting language (e.g., Python, Go, .NET, Bash).
- Skilled in developing monitoring strategies and frameworks that provide real-time insights into system health, performance bottlenecks, and security vulnerabilities.
- Expertise in automating alerting and troubleshooting processes to ensure rapid response to incidents and minimize downtime.
- Bachelor's degree in computer science or related field.
- At least 3 years of leadership experience, specifically managing SRE and DevOps teams. .
- 8+ years of experience in SRE or DevOps roles.
- Excellent communication and collaboration skills.
- Ability to work in a fast-paced, dynamic environment.
- Passion for technology and continuous learning.
Conexiom embraces diversity and equal opportunity. We are committed to building a team that represents a variety of backgrounds, perspectives, and skills. We are working to ensure that the profile of our staff reflects the profile of the communities we work in and serve. For that reason, we seek resumes and expressions of interest from a broad and diverse talent pool. Strength comes from the inclusion of diverse perspectives and experiences. Reasons to work for Conexiom: Our MISSION is to transform broken processes into business valueWe are DATA-driven and RESULTS-focused We show our COMMITMENT to the people that make-up Conexiom by:
- Training and development opportunities
- Competitive compensation
- Work/Life balance – Open PTO Policy in North America & Flex days in the UK
- Comprehensive health, dental, & vision insurance
- We care for each other
- We hold ourselves accountable
- We make our customers heroes
- We over-communicate
- We are inclusive & want to make change for the better