1

Linux Site Reliability Engineer Jobs in Tennessee

Reliability Engineer

Morristown, TN · On-site

$89K - $112K/yr

Continuous Improvement - Support site-wide initiatives to improve cost, safety, and operational ... reliability, maintenance, engineering, or manufacturing * Demonstrated ability to plan, organize ...

Reliability Engineer

Morristown, TN · On-site

$91K - $114K/yr

Continuous Improvement - Support site-wide initiatives to improve cost, safety, and operational ... reliability, maintenance, engineering, or manufacturing * Demonstrated ability to plan, organize ...

Reliability Engineer

Morristown, TN

$89K - $112K/yr

Continuous Improvement - Support site-wide initiatives to improve cost, safety, and operational ... reliability, maintenance, engineering, or manufacturing * Demonstrated ability to plan, organize ...

Reliability Engineer

Morristown, TN · On-site

$91K - $114K/yr

Continuous Improvement - Support site-wide initiatives to improve cost, safety, and operational ... reliability, maintenance, engineering, or manufacturing * Demonstrated ability to plan, organize ...

Be Seen First

Reliability Engineer

Morristown, TN · On-site

$85K - $95K/yr

Continuous Improvement - Support site-wide initiatives to improve cost, safety, and operational ... reliability, maintenance, engineering, or manufacturing * Demonstrated ability to plan, organize ...

Reliability Engineer

Newport, TN

$87K - $110K/yr

Reporting to the Plant Manager, the Plant Reliability Engineer is responsible for leading ... Act as a catalyst of strategic thinking by challenging site paradigms. * Responsible for directing ...

Reliability Engineer

Newport, TN · On-site

$87K - $110K/yr

Reporting to the Plant Manager, the Plant Reliability Engineer is responsible for leading ... Act as a catalyst of strategic thinking by challenging site paradigms. * Responsible for directing ...

Reliability Engineer

Oak Ridge, TN · On-site

$98K - $123K/yr

ABOUT THE ROLE We are seeking a Reliability Engineer with an active Department of Energy (DOE) Q ... About the Site The NNSA's Y-12 National Security Complex, in Oak Ridge, Tennessee, is the nation ...

Preferred : • 7+ years of experience in SRE or infrastructure roles, ideally in hyperscale, cloud ... optimizing Linux-based systems for AI workloads, GPU clusters, or high-throughput compute ...

next page

Showing results 1-20

Linux Site Reliability Engineer information

What are some common challenges faced by Linux Site Reliability Engineers when scaling infrastructure, and how can they be addressed?

Linux Site Reliability Engineers often encounter challenges related to maintaining system stability and performance as infrastructure scales. Issues such as configuration drift, automation bottlenecks, and monitoring gaps can arise when managing numerous servers or services. Addressing these challenges typically involves implementing robust configuration management tools, investing in automated deployment pipelines, and enhancing observability through comprehensive monitoring and alerting solutions. Collaboration with development and operations teams is essential to ensure that scalability solutions align with business needs and technical requirements.

What are the key skills and qualifications needed to thrive as a Linux Site Reliability Engineer, and why are they important?

To thrive as a Linux Site Reliability Engineer, you need deep expertise in Linux system administration, scripting (such as Bash or Python), and a solid understanding of networking concepts, usually backed by a computer science degree or equivalent experience. Familiarity with configuration management tools (like Ansible, Puppet, or Chef), containerization (Docker, Kubernetes), and cloud platforms (AWS, GCP, or Azure) is typically required, along with relevant certifications like RHCE or AWS Certified SysOps Administrator. Strong problem-solving skills, effective communication, and the ability to work under pressure are crucial soft skills for this role. These competencies ensure the reliability, scalability, and security of complex infrastructure, minimizing downtime and supporting seamless operations.

Who gets paid more, SRE or DevOps?

Generally, Site Reliability Engineers (SREs) tend to have higher salaries than DevOps engineers due to their specialized focus on system reliability, automation, and incident management. Both roles require strong skills in cloud platforms, scripting, and monitoring tools, but SREs often have more advanced expertise in reliability engineering practices, which can lead to higher compensation.

Will AI replace SRE jobs?

AI is expected to augment the work of Linux Site Reliability Engineers by automating routine tasks such as monitoring, incident response, and log analysis. However, SRE roles require complex problem-solving, system design, and decision-making that currently cannot be fully replaced by AI, making human expertise essential. SREs will likely focus more on overseeing automation tools and managing system reliability rather than being replaced entirely.

What engineer makes $500,000 a year?

A senior Linux Site Reliability Engineer or similar high-level engineering roles in cloud infrastructure and large-scale systems can earn $500,000 or more annually, especially with bonuses and stock options. These positions typically require extensive experience, advanced skills in automation, scripting, and cloud platforms, and often involve leadership responsibilities.

What engineers make $300,000 a year?

Senior Linux Site Reliability Engineers with extensive experience, advanced skills in automation, cloud platforms, and monitoring tools can earn $300,000 or more annually, especially in high-cost-of-living areas or large tech companies. Achieving this salary often requires specialized certifications, leadership roles, and a strong track record of managing complex infrastructure at scale.

What is the difference between Linux Site Reliability Engineer vs Linux DevOps Engineer?

AspectLinux Site Reliability EngineerLinux DevOps Engineer
CredentialsLinux certifications, SRE-specific trainingLinux certifications, DevOps tools certifications
Work EnvironmentFocus on system reliability, monitoring, incident responseFocus on automation, CI/CD pipelines, deployment
Employer & IndustryTech companies, cloud providers, large enterprisesStartups, tech firms, software development teams
Search & Comparison IntentUnderstanding reliability roles, incident managementAutomation, deployment, continuous integration

While both roles involve Linux expertise, a Linux Site Reliability Engineer primarily focuses on maintaining system reliability, monitoring, and incident response. In contrast, a Linux DevOps Engineer emphasizes automation, continuous integration, and deployment processes. Both roles require Linux skills and often overlap, but their core responsibilities differ based on organizational needs.

What is a Linux Site Reliability Engineer?

A Linux Site Reliability Engineer (SRE) is an IT professional responsible for ensuring the reliability, scalability, and performance of systems running on the Linux operating system. They bridge the gap between software development and operations by automating processes, monitoring infrastructure, and managing incidents. Linux SREs focus on system availability, building tools for deployment and monitoring, and improving system robustness through best practices and automation. Their work helps organizations deliver reliable online services and quickly recover from outages or system failures.
What are popular job titles related to Linux Site Reliability Engineer jobs in Tennessee? For Linux Site Reliability Engineer jobs in Tennessee, the most frequently searched job titles are:
What job categories do people searching Linux Site Reliability Engineer jobs in Tennessee look for? The top searched job categories for Linux Site Reliability Engineer jobs in Tennessee are:
What cities in Tennessee are hiring for Linux Site Reliability Engineer jobs? Cities in Tennessee with the most Linux Site Reliability Engineer job openings:

Sr. Software Engineer (Data Center Automation)

xAI

Memphis, TN • On-site

$119K - $157K/yr

Full-time

Posted 8 days ago


Job description

Job Summary:
xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. They are seeking a highly skilled Sr. Software Engineer to manage and enhance reliability across a multi-data center environment, focusing on automating processes and building robust observability solutions for mission-critical AI infrastructure.
Responsibilities:
• Design, develop, and deploy scalable code and services (primarily in Python and Rust, with flexibility for emerging languages) to automate reliability workflows, including monitoring, alerting, incident response, and infrastructure provisioning. We value adaptability to new tools and paradigms in the fast-evolving AI space.
• Implement and maintain observability tools and practices, such as metrics collection, logging, tracing, and dashboards, to provide real-time insights into system health across multiple data centers—open to innovative stacks beyond traditional ones like ELK.
• Collaborate with cross-functional teams—including software development, network engineering, site operations, and facility operations (critical facilities, mechanical/electrical teams, and data center infrastructure management)—to identify reliability bottlenecks, automate solutions for fault tolerance, disaster recovery, capacity planning, and physical/environmental risk mitigation (e.g., power redundancy, cooling efficiency, and environmental monitoring integration).This role encourages broad skill sets from diverse technical backgrounds to foster innovation.
• Troubleshoot and resolve complex issues in data center environments, including hardware failures, environmental anomalies, software bugs, and network-related problems, while adhering to reliability principles like error budgets and SLAs.**Key Insight:** By applying SWE rigor to troubleshooting, team members can create reusable diagnostic tools that accelerate resolution, turning unscheduled events (e.g., hardware faults) into opportunities for system hardening and reducing overall end-user impact through targeted SLAs that prioritize critical AI services. We seek versatile problem-solvers who adapt to bleeding-edge challenges.
• Optimize Linux-based systems for performance, security, and reliability, including kernel tuning, container orchestration (e.g., Kubernetes or emerging alternatives), and scripting for automation.
• Understand network topologies and concepts in large-scale, multi-data center environments to effectively troubleshoot connectivity, routing, redundancy, and performance issues; integrate observability into data center interconnects and facility-level controls for rapid diagnosis and automation.**Key Insight:** In multi-site setups, network insights allow for automated failover mechanisms that handle both digital and physical disruptions, ensuring seamless continuity for end-users during events like fiber cuts or power outages. This attracts candidates from varied networking and systems backgrounds to drive forward-thinking solutions.
• Participate in on-call rotations, post-incident reviews (blameless postmortems), and continuous improvement initiatives to enhance overall site reliability, including joint exercises with facility teams for physical failover and recovery scenarios. We prioritize growth-minded individuals who embrace evolving practices.
• Mentor junior team members and document processes to foster a culture of automation, knowledge sharing, and adaptability to new technologies.
Qualifications:
Required:
• Bachelor's degree in Computer Science, Computer Engineering, Electrical Engineering, or a closely related technical field (or equivalent professional experience).
• 3+ years of hands-on experience in site reliability engineering (SRE), infrastructure engineering, DevOps, or systems engineering, preferably supporting large-scale, distributed, or production environments.
• Strong programming skills with proven production experience in Python (required for automation and tooling); experience with Rust or willingness to work in Rust is a plus, but strong coding fundamentals in at least one systems-level language (e.g., Python, Go, C++) are essential.
• Solid experience with Linux systems administration, performance tuning, kernel-level understanding, and scripting/automation in production environments.
• Practical knowledge of containerization and orchestration technologies, such as Docker and Kubernetes (or similar systems).
• Experience implementing observability solutions, including metrics, logging, tracing, monitoring tools (e.g., Prometheus, Grafana, or alternatives), alerting, and dashboards.
• Familiarity with troubleshooting complex issues in distributed systems, including software bugs, hardware failures, network problems, and environmental factors.
• Understanding of networking fundamentals (TCP/IP, routing, redundancy, DNS) in large-scale or multi-site environments.
• Experience participating in on-call rotations, incident response, post-incident reviews (blameless postmortems), and reliability practices such as error budgets or SLAs.
• Ability to collaborate effectively with cross-functional teams (software engineers, network teams, site/facility operations, mechanical/electrical teams).
Preferred:
• 5+ years of experience in SRE or infrastructure roles, ideally in hyperscale, cloud, or AI/ML training infrastructure environments with multi-data center setups.
• Hands-on experience operating or scaling Kubernetes clusters (or equivalent orchestration) at large scale, including automation for provisioning, lifecycle management, and high-availability.
• Proficiency in Rust for systems programming and performance-critical components.
• Direct experience integrating software reliability tools with physical data center infrastructure (e.g., power, cooling, environmental monitoring, facility controls) and automating responses to physical events.
• Exposure to advanced or innovative observability stacks beyond traditional tools (e.g., exploring cutting-edge alternatives for metrics, logs, and tracing).
• Experience building automated remediation, fault tolerance, disaster recovery, capacity planning, or predictive failure detection systems.
• Background in optimizing Linux-based systems for AI workloads, GPU clusters, or high-throughput compute environments.
• Demonstrated success reducing downtime, MTTR, or improving resource efficiency (e.g., through automation or observability) in high-stakes production settings.
• Prior work with bare-metal provisioning, data center interconnects, or hybrid/multi-site failover mechanisms.
• Mentoring experience, strong documentation skills, and a track record of fostering knowledge sharing and automation culture.
• Comfort with rapid technology adaptation in fast-evolving domains like AI infrastructure.
Company:
XAI is an artificial intelligence startup that develops AI solutions and tools to enhance reasoning and search capabilities. It is a sub-organization of SpaceX. Founded in 2023, the company is headquartered in Palo Alto, USA, with a team of 1001-5000 employees. The company is currently Late Stage.