Site Reliability Engineer

99x Europe

Full-time

Remote

Worldwide

Job Category: DevOps / Sysadmin

About the Role:
We are looking for an Engineering Manager – SRE & Observability to lead technical teams and ensure system reliability, scalability, and automation. You will play a key role in developing strategies for transforming our observability stack, optimizing cloud infrastructure, and fostering a culture of continuous improvement. Daily communication with the team and stakeholders will be in English, and work can be done remotely.

Job Responsibilities:

Lead SRE / Observability teams to ensure system reliability, resiliency, and automation.
Develop and implement strategies to transform and optimize the observability stack based on SaaS solutions.
Promote continuous improvement in system architecture, deployment, and operational processes through automation.
Collaborate with stakeholders to define, prioritize, and maintain high-performance systems aligned with business goals.
Oversee the development and implementation of monitoring, alerting, metrics, logs, and traces to enable rapid issue detection and resolution.
Foster a culture of continuous learning and knowledge sharing within the teams.
Ensure compliance with security standards and best practices in all infrastructure and operations.
Manage and optimize cloud infrastructure costs while maintaining high availability and performance.
Provide technical leadership in adopting new technologies and methodologies to improve system reliability and efficiency.

Skills & Qualifications:

Experience as an Engineering Manager or in a leadership role in SRE/DevOps.
Strong analytical and problem-solving skills with a focus on innovation and continuous improvement.
Proactive, results-driven, and committed to fostering excellence and continuous learning.

Technical Skills:

Cloud-native technologies and experience with AWS, Azure, or GCP.
Experience with monitoring tools (Grafana, Prometheus, Dynatrace), logging (ELK, Loki), and tracing (Jaeger, OpenTelemetry).
Expertise in containers (Docker, Kubernetes), cloud infrastructure, and Infrastructure as Code (Terraform, Ansible).
Strong programming skills in Python, Go, and SQL, along with knowledge of network protocols (HTTP, DNS, TCP/IP).
Experience with CI/CD pipelines and process automation.
Ability to troubleshoot distributed systems and identify performance bottlenecks.

If this sounds like you, share your CV with us, and let’s talk!

Apply Now

Share this job:

Twitter Facebook Linkedin Email

Site Reliability Engineer

More jobs

Area Leader (Manager) Trainee - Franchise Operations

7-Eleven

Data Analyst (On-going Contract)

The Lifetime Value Co.