9

Site Reliability Engineer

99x Europe
Full-time
Remote
Worldwide

Job Category: DevOps / Sysadmin

About the Role:
We are looking for an Engineering Manager – SRE & Observability to lead technical teams and ensure system reliability, scalability, and automation. You will play a key role in developing strategies for transforming our observability stack, optimizing cloud infrastructure, and fostering a culture of continuous improvement. Daily communication with the team and stakeholders will be in English, and work can be done remotely.

Job Responsibilities:

  • Lead SRE / Observability teams to ensure system reliability, resiliency, and automation.
  • Develop and implement strategies to transform and optimize the observability stack based on SaaS solutions.
  • Promote continuous improvement in system architecture, deployment, and operational processes through automation.
  • Collaborate with stakeholders to define, prioritize, and maintain high-performance systems aligned with business goals.
  • Oversee the development and implementation of monitoring, alerting, metrics, logs, and traces to enable rapid issue detection and resolution.
  • Foster a culture of continuous learning and knowledge sharing within the teams.
  • Ensure compliance with security standards and best practices in all infrastructure and operations.
  • Manage and optimize cloud infrastructure costs while maintaining high availability and performance.
  • Provide technical leadership in adopting new technologies and methodologies to improve system reliability and efficiency.

Skills & Qualifications:

  • Experience as an Engineering Manager or in a leadership role in SRE/DevOps.
  • Strong analytical and problem-solving skills with a focus on innovation and continuous improvement.
  • Proactive, results-driven, and committed to fostering excellence and continuous learning.

Technical Skills:

  • Cloud-native technologies and experience with AWS, Azure, or GCP.
  • Experience with monitoring tools (Grafana, Prometheus, Dynatrace), logging (ELK, Loki), and tracing (Jaeger, OpenTelemetry).
  • Expertise in containers (Docker, Kubernetes), cloud infrastructure, and Infrastructure as Code (Terraform, Ansible).
  • Strong programming skills in Python, Go, and SQL, along with knowledge of network protocols (HTTP, DNS, TCP/IP).
  • Experience with CI/CD pipelines and process automation.
  • Ability to troubleshoot distributed systems and identify performance bottlenecks.

If this sounds like you, share your CV with us, and let’s talk!

Apply Now