Job Desc :
- Design, build, and maintain scalable and reliable infrastructure and services.
- Implement and maintain monitoring, alerting, and logging systems to ensure the health and performance of our systems.
- Collaborate with software engineering teams to design and implement reliable and efficient deployment pipelines.
- Identify and troubleshoot issues across the stack, from infrastructure to application level, and implement solutions to prevent recurrence.
- Automate manual tasks and processes to improve efficiency and reliability.
- Participate in on-call rotation and respond to incidents to maintain availability.
- Collaborate with engineering during incident and create post-mortem reports.
- Continuously evaluate and implement best practices and tools to optimize system performance and reliability.
- Support daily operational inquiries and involved in pairing sessions to unblock technical difficulties
Requirements:
- Proven experience as a Site Reliability Engineer (SRE) or similar role.
- Proficiency in at least one programming language (e.g., Bash, Python, Go, Java).
- Solid understanding of Linux/Unix systems and networking fundamentals.
- Experience with cloud platforms such as GCP, AWS or Alicloud.
- Hands-on experience with containerization technologies (e.g., Docker, Kubernetes).
- Proficiency in configuration management tools such as Terraform or Ansible.
- Experience with monitoring and logging tools such as Prometheus, Grafana, OpenTelemetry, etc.
- Excellent problem-solving and troubleshooting skills.
- Strong communication and collaboration skills, with the ability to work effectively in a team environment.
- Experience with CI/CD pipelines tools such as Gitlab CI and Github Action.
- Experience with RDBMS like MySQL, PostgreSQL or NoSQL platform like MongoDB
- Experience with cache platform like Redis or Memcache
- Experience with distributed log platform or queue like Kafka, RedPanda or RabbitMQ
- Humbleness, fairness and good common sense during solving issues