Senior Manager, Site Reliability Engineering (SRE)

Conexiom

Vancouver, BC

Posted 8 days ago

Job Details:

Full-time

Management

Benefits:

Paid Time Off

Our fully automated, purpose-built solution eliminates the manual processing of business-critical commercial documents with 100% data accurate, touchless transactions. These documents include purchase orders, invoices, advance shipping notices, and other essential commercial documents exchanged between business and their customers. About the Opportunity: Conexiom is seeking a dedicated and experienced Site Reliability Engineering (SRE) Senior Manager to lead our SRE team. The role involves leading the Cloud SRE team in day-to-day operations, which include monitoring, support activities, ensuring customer satisfaction through reliable service, and building and designing cloud infrastructure. You will collaborate with engineering and product teams to develop strategies that aim to achieve high service reliability, performance, scalability, and availability. This role is critical in guiding Conexiom's cloud services, emphasizing operational excellence and the integration of SRE principles into our processes. Your efforts will be key in maintaining our commitment to providing dependable and scalable cloud solutions. If you have a strong background in site reliability engineering and a track record of leading teams to success, we welcome your application to join Conexiom. Responsibilities:

Daily Collaborate with cross-functional teams to identify and resolve issues related to performance, scalability, and security.
Establish and track key performance indicators (KPIs) to measure the effectiveness of our SRE team.
Establish service level objectives and monitor to ensure the objectives are met.
Operates, monitors, and maintains high availability applications running in Azure Cloud environment.
Drive continuous improvement through automation, monitoring, and testing.
Executes automation for cloud-operations tasks and creates new automation for new situations or issues encountered; automates everything.
Identifies and improves on possible points of failure in the infrastructure/applications.
Lead and focus teams on root cause analysis, pattern identification and continuous improvement to optimize application performance, resiliency, and reliability.
Facilitates blame-free root cause analysis meetings in the event of a production-systems incident so that the team can learn from mistakes and improve systems.
Helps secure data and access policies to reduce risk.
Looks for opportunities to drive operational efficiencies while reducing costs.
Prepares and presents reports to all levels of leadership and staff.
Stays abreast of industry leading best practices and brings them to the attention of the leadership team for innovative application.
Serves as a guide and mentor to members of the Cloud Platform SRE teams to aid in their growth and development.
Allocates available resources to meet operating objectives.
Ensures the ongoing training and development of direct reports.
Manage a 24/7 On call rotation schedule

Qualifications:

Experience managing teams (specifically SRE, Release and DevOps).
Strong experience with Azure cloud platform.
Experience working in a SRE environment and applying SRE Principles.
Experience with CI/CD tools (Azure DevOps/Jenkins/GitLab).
Familiarity with Agile methodologies and DevOps best practices.
Strong grasp of infrastructure as code (e.g., Terraform, CloudFormation) and automation tools (e.g., Ansible, Chef, Puppet).
Experience with Kubernetes, AKS/GKS, Docker, containerization, microservices, and serverless architectures.
Proven track record of designing, implementing, and supporting highly available and scalable infrastructure in a cloud environment.
Experience administering and troubleshooting both Windows Server and Linux operating systems. Familiarity with Internet Information Services (IIS), Apache, and Nginx.
Proficiency in managing and monitoring relational databases (e.g., PostgreSQL, MySQL) and familiarity and experience with NoSQL databases such as MongoDB.
Proficiency in at least one programming or scripting language (e.g., Python, Go, .NET, Bash).
Skilled in developing monitoring strategies and frameworks that provide real-time insights into system health, performance bottlenecks, and security vulnerabilities.
Expertise in automating alerting and troubleshooting processes to ensure rapid response to incidents and minimize downtime.
Bachelor's degree in computer science or related field.
At least 3 years of leadership experience, specifically managing SRE and DevOps teams. .
8+ years of experience in SRE or DevOps roles.
Excellent communication and collaboration skills.
Ability to work in a fast-paced, dynamic environment.
Passion for technology and continuous learning.

About Conexiom:Conexiom is a cloud-based, purpose-built automation platform that automates the most critical and complex B2B document transactions between buyers and sellers. Manufacturers and distributors across the globe, such as Grainger, Genpak, Honeywell, and Lonza, trust Conexiom to create resilient operations that scale, drive growth, reduce costs, and build frictionless relationships with their customers. Conexiom is based in Vancouver, British Columbia, and has offices in Kitchener, Ontario; London, England; and Chicago, Illinois. Visit Conexiom.com.
Conexiom embraces diversity and equal opportunity. We are committed to building a team that represents a variety of backgrounds, perspectives, and skills. We are working to ensure that the profile of our staff reflects the profile of the communities we work in and serve. For that reason, we seek resumes and expressions of interest from a broad and diverse talent pool. Strength comes from the inclusion of diverse perspectives and experiences. Reasons to work for Conexiom: Our MISSION is to transform broken processes into business valueWe are DATA-driven and RESULTS-focused We show our COMMITMENT to the people that make-up Conexiom by:

Training and development opportunities
Competitive compensation
Work/Life balance – Open PTO Policy in North America & Flex days in the UK
Comprehensive health, dental, & vision insurance

We build products & internal processes that increase efficiencies and drive INNOVATION for our customers Our VALUES

We care for each other
We hold ourselves accountable
We make our customers heroes
We over-communicate
We are inclusive & want to make change for the better

Conexiom is proud to offer equal employment opportunities. If you have a need that requires accommodation at any time during the recruitment process, please reach out to our people team at [email protected] #LI-Hybrid

#Information Technology jobs

This position is no longer available.

Senior Manager, Site Reliability Engineering (SRE)

Share This Job:

We’ve updated our terms