Site Reliability Engineer (SRE)
Petaling Jaya, Selangor
|Hybrid
|Direct hire
|Job ID 7614|Posted Jan 2, 2025JOB DESCRIPTION
Job Title: Site Reliability Engineer (SRE)
Position Type: FT Permanent
Working Location: Petaling Jaya, Malaysia
About Horizontal: Established since 2003 in the US, Horizontal solves complex challenges across two distinct businesses: Horizontal Digital and Horizontal Talent. We are consistently recognized for being a top workplace and one of the fastest-growing private companies. Horizontal Talent specializes in staffing for IT, Digital & Creative, and Business & Strategy markets. We have global offices in US, UAE, India, Malaysia, and Australia.
Our client is a world's leading taste and nutrition company for the food, beverage and pharmaceutical industries. Every day we partner with customers to create healthier, tastier and more sustainable products that are consumed by billions of people across the world. Our vision is to be our customers' most valued partner, creating a world of sustainable nutrition. A career here will offer you an opportunity to shape the future of food while providing you opportunities to explore and grow in a truly global environment.
About the role
Job Purpose: The Site Reliability Engineer (SRE) will be responsible for ensuring the reliability, availability, and performance of the Kerry's Infrastructure systems and services. The SRE will work closely with IPE and Automation Teams to implement best practices, automate processes, and respond to incidents & service requests to maintain a high level of service reliability. Your work will ensure that Kerry continues to meet its operational and security KPIs across cloud, endpoint and network technologies.
Key Responsibilities:
• Incident Response: Monitor system performance, detect issues, and respond to incidents promptly to minimize downtime and impact on users.
• Blameless Postmortems: Participate in post-incident reviews (MIR&PAB) to identify root causes and implement improvements to prevent future occurrences.
• Automation: Develop and maintain automation scripts and tools to streamline operations and reduce manual intervention. Work with the automation team and IPE team to implement your scripts into production.
• Monitoring and Observability: Maintain monitoring and observability tools to gain insights into system performance and detect anomalies.
• Continuous Improvement: Collaborate with development and application teams to identify areas for improvement and implement changes to enhance system reliability and performance.
• Documentation: Create and maintain comprehensive documentation for systems, processes, and incident response procedures, to be shared with our Service Desk and other support teams.
• On-Call Support: Participate in on-call rotations to provide 24/7 support for critical systems and services.
Key Performance Indicators (KPIs):
• Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) incidents.
• System uptime and availability metrics.
• Number of incidents and postmortem reports completed.
• Automation coverage and reduction in manual tasks.
Qualifications:
• Relevant certifications in cloud platforms, DevOps, or SRE practices, with a focus on Azure (e.g., AZ-305 certification, Azure Admin Associate, Azure Network Associate, Azure IAM or similar).
Experience:
• Proven experience in a Site Reliability Engineering, DevOps, or an Operational Support role.
• Experience with cloud platforms (e.g., Azure Primary, others and advantage) and infrastructure as code (IaC) tools (e.g., Terraform, Azure DevOps, Ansible/Chef/Puppet).
• Strong background in system administration, networking, and security.
• Experience with monitoring and observability tools (e.g., Solarwinds, SiteScope, Thousand Eyes).
• Experience with automation and orchestration tools (e.g., Ansible, Kubernetes).
Skills and Competencies:
• Technical Skills: Proficiency in programming and scripting languages.
• Problem-Solving: Excellent analytical and problem-solving skills to diagnose and resolve complex issues.
• Communication: Strong communication and collaboration skills to work effectively with cross-functional teams.
• Attention to Detail: Meticulous attention to detail to ensure accuracy and reliability in all tasks.
• Adaptability: Ability to adapt to changing priorities and work in a fast-paced environment.
• Continuous Learning: Commitment to continuous learning and staying up to date with industry trends and best practices.
Personal Attributes:
• Proactive: Takes initiative to identify and address potential issues before they become problems.
• Team Player: Works well in a team environment and contributes to a positive team culture.
• Customer-Focused: Committed to delivering high-quality services and ensuring customer satisfaction.
Position Type: FT Permanent
Working Location: Petaling Jaya, Malaysia
About Horizontal: Established since 2003 in the US, Horizontal solves complex challenges across two distinct businesses: Horizontal Digital and Horizontal Talent. We are consistently recognized for being a top workplace and one of the fastest-growing private companies. Horizontal Talent specializes in staffing for IT, Digital & Creative, and Business & Strategy markets. We have global offices in US, UAE, India, Malaysia, and Australia.
Our client is a world's leading taste and nutrition company for the food, beverage and pharmaceutical industries. Every day we partner with customers to create healthier, tastier and more sustainable products that are consumed by billions of people across the world. Our vision is to be our customers' most valued partner, creating a world of sustainable nutrition. A career here will offer you an opportunity to shape the future of food while providing you opportunities to explore and grow in a truly global environment.
About the role
Job Purpose: The Site Reliability Engineer (SRE) will be responsible for ensuring the reliability, availability, and performance of the Kerry's Infrastructure systems and services. The SRE will work closely with IPE and Automation Teams to implement best practices, automate processes, and respond to incidents & service requests to maintain a high level of service reliability. Your work will ensure that Kerry continues to meet its operational and security KPIs across cloud, endpoint and network technologies.
Key Responsibilities:
• Incident Response: Monitor system performance, detect issues, and respond to incidents promptly to minimize downtime and impact on users.
• Blameless Postmortems: Participate in post-incident reviews (MIR&PAB) to identify root causes and implement improvements to prevent future occurrences.
• Automation: Develop and maintain automation scripts and tools to streamline operations and reduce manual intervention. Work with the automation team and IPE team to implement your scripts into production.
• Monitoring and Observability: Maintain monitoring and observability tools to gain insights into system performance and detect anomalies.
• Continuous Improvement: Collaborate with development and application teams to identify areas for improvement and implement changes to enhance system reliability and performance.
• Documentation: Create and maintain comprehensive documentation for systems, processes, and incident response procedures, to be shared with our Service Desk and other support teams.
• On-Call Support: Participate in on-call rotations to provide 24/7 support for critical systems and services.
Key Performance Indicators (KPIs):
• Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) incidents.
• System uptime and availability metrics.
• Number of incidents and postmortem reports completed.
• Automation coverage and reduction in manual tasks.
Qualifications:
• Relevant certifications in cloud platforms, DevOps, or SRE practices, with a focus on Azure (e.g., AZ-305 certification, Azure Admin Associate, Azure Network Associate, Azure IAM or similar).
Experience:
• Proven experience in a Site Reliability Engineering, DevOps, or an Operational Support role.
• Experience with cloud platforms (e.g., Azure Primary, others and advantage) and infrastructure as code (IaC) tools (e.g., Terraform, Azure DevOps, Ansible/Chef/Puppet).
• Strong background in system administration, networking, and security.
• Experience with monitoring and observability tools (e.g., Solarwinds, SiteScope, Thousand Eyes).
• Experience with automation and orchestration tools (e.g., Ansible, Kubernetes).
Skills and Competencies:
• Technical Skills: Proficiency in programming and scripting languages.
• Problem-Solving: Excellent analytical and problem-solving skills to diagnose and resolve complex issues.
• Communication: Strong communication and collaboration skills to work effectively with cross-functional teams.
• Attention to Detail: Meticulous attention to detail to ensure accuracy and reliability in all tasks.
• Adaptability: Ability to adapt to changing priorities and work in a fast-paced environment.
• Continuous Learning: Commitment to continuous learning and staying up to date with industry trends and best practices.
Personal Attributes:
• Proactive: Takes initiative to identify and address potential issues before they become problems.
• Team Player: Works well in a team environment and contributes to a positive team culture.
• Customer-Focused: Committed to delivering high-quality services and ensuring customer satisfaction.
Horizontal is proud to be an Equal Opportunity and Affirmative Action Employer.
We seek to provide employment opportunities to talented, qualified candidates regardless of race, color, sex/gender including gender identity and/or expression, national origin, religion, sexual orientation, disability, marital status, citizen status, veteran status, or any other protected classification under federal, state or local law.
In addition, Horizontal will provide reasonable accommodations for qualified individuals with disabilities. If you need to request a reasonable accommodation in order to complete the application or interview process, please contact us.
All applicants applying must be legally authorized to work in the country of employment.