We are looking for a Site Reliability Engineer (SRE) to support reliable, high-performing production systems for automotive operations clients. This position focuses on strengthening service stability across edge and cloud environments through automation, observability, and disciplined operational practices. The role works closely with engineering and technical stakeholders to improve uptime, manage incidents, and deploy changes safely in real-time manufacturing settings.
Responsibilities:
• Maintain dependable and secure production environments across plant-edge and cloud-based systems, with a focus on uptime, responsiveness, and operational stability.
• Design, refine, and support monitoring dashboards, alerting frameworks, and operational runbooks using tools such as Prometheus, Grafana, and modern telemetry solutions.
• Build and manage infrastructure through code using Terraform, applying version control standards, peer reviews, and controlled deployment processes.
• Create automation scripts and lightweight tools in Bash and Python to streamline routine operations, recovery procedures, backup workflows, and environment setup.
• Take part in incident response and on-call coverage, troubleshoot service disruptions, coordinate initial communication, and document follow-up actions through blameless reviews.
• Establish and measure service reliability indicators and objectives, helping stakeholders balance system dependability with release speed and operational risk.
• Support secure connectivity between factory networks and cloud resources by configuring and maintaining VPNs, routing, private networking, and access controls.
• Administer and optimize relational or time-series databases, including backup planning, replication, performance tuning, and long-term storage health.
• Contribute to CI/CD delivery practices by improving deployment pipelines, supporting controlled release strategies, and preparing rollback procedures when needed.
• Partner with controls, software, and data teams to enable reliable data flow from industrial systems and ensure safe deployment to edge infrastructure.
• Bachelor’s degree in Information Technology, Computer Science, Computer Engineering, or comparable practical experience.
• At least 5 years of experience supporting production environments in a corporate, startup, or similarly fast-paced technical setting.
• Hands-on expertise with infrastructure as code, including Terraform, along with experience in cloud platforms and related services.
• Working knowledge of container technologies such as Docker and orchestration platforms like Kubernetes.
• Experience supporting live systems, participating in on-call rotations, and contributing to incident reviews and corrective actions.
• Proficiency with automation and scripting using Bash and Python to reduce manual operational effort.
• Strong communication skills with the ability to explain technical decisions and tradeoffs to cross-functional or non-technical stakeholders.
• Willingness and ability to travel to customer or plant locations as business needs require.