Site Reliability Engineer
Über diesen Job
Job Description:
We are hiring a Senior Site Reliability/DevOps Engineer to drive the reliability, scalability, and security of our financial platforms. This role is ideal for a seasoned engineer with deep experience in automating infrastructure, optimizing deployments, and building fault-tolerant systems in a regulated, high-stakes environment.
Key Responsibilities:
- Lead the design and implementation of resilient, scalable infrastructure using Infrastructure as Code (Terraform, CloudFormation, etc.)
- Own and optimize CI/CD pipelines and deployment strategies
- Proactively monitor, troubleshoot, and resolve system issues to minimize downtime
- Develop and maintain comprehensive observability solutions—logging, metrics, tracing, and alerting—to ensure full visibility into system performance and reliability
- Support and optimize AWS EMR clusters for data processing workloads, ensuring stability, cost-efficiency, and integration with data pipelines
- Champion automation and DevOps best practices across teams
- Collaborate with security and compliance teams to meet regulatory requirements
- Mentor junior engineers and contribute to architectural decisions
Requirements:
- 8+ years in SRE, DevOps or infrastructure engineering roles
- Expert-level knowledge of AWS (including EMR), Kubernetes, and Linux systems
- Strong experience with Docker, Terraform, CI/CD tools (e.g., Jenkins, GitLab CI), and scripting (Python, Bash)
- Proven track record managing mission-critical systems in financial or similarly regulated industries
- Deep understanding of observability tools and practices (e.g., Prometheus, Grafana, ELK, OpenTelemetry)
- Hands-on experience deploying, tuning, and managing AWS EMR clusters in production environments
Preferred:
- Experience with SOC2, PCI, or other compliance frameworks
- Relevant certifications (AWS, Kubernetes, etc.)