1

Ceph Storage Jobs in Michigan (NOW HIRING)

Manager, Reliability Operations

Lansing, MI · On-site +1

$110K - $150K/yr

Work across environments including virtualization platforms (VMware), distributed storage (Ceph), Linux-based systems, and hybrid cloud infrastructure * Support platforms that span dedicated hosting ...

Ceph Storage information

What are some common challenges faced by professionals working in Ceph Storage administration, and how can they be addressed?

Professionals managing Ceph Storage environments often encounter challenges such as maintaining cluster health, balancing performance and scalability, and troubleshooting hardware or network failures. These issues can be addressed by regularly monitoring cluster metrics, using Ceph's built-in tools for diagnosis, and proactively planning for capacity expansion. Collaborating closely with system administrators and network engineers is also crucial to ensure optimal integration and quick resolution of issues. Staying updated with Ceph community best practices and documentation helps administrators adapt to evolving technologies and maintain robust, high-performing storage solutions.

What is Ceph Storage?

Ceph Storage is an open-source, distributed storage system designed to provide excellent performance, reliability, and scalability. It supports object, block, and file storage in a unified platform, making it suitable for cloud infrastructure and enterprise environments. Ceph automatically manages data replication and recovery, helping to ensure high availability and fault tolerance. The system can run on commodity hardware, reducing costs and simplifying expansion as storage needs grow.

What is the difference between Ceph Storage vs Storage Engineer?

AspectCeph StorageStorage Engineer
CredentialsKnowledge of distributed storage, Linux, and storage protocolsCertifications like Cisco, CompTIA Storage+, or vendor-specific certifications
Work EnvironmentData centers, cloud environments, large-scale storage deploymentsData centers, enterprise IT teams, cloud providers
Industry UsageOpen-source storage solutions, cloud infrastructureDesign, implementation, and management of storage systems
Search/Comparison IntentTechnical understanding of storage solutionsCareer options, job roles, and skills

Ceph Storage is an open-source, distributed storage platform used to build scalable storage clusters, often in cloud and data center environments. Storage Engineers design, deploy, and maintain storage systems, including Ceph, focusing on performance and reliability. While Ceph Storage refers to the technology itself, Storage Engineers are professionals who work with Ceph and other storage solutions to meet organizational needs.

What are the key skills and qualifications needed to thrive as a Ceph Storage Engineer, and why are they important?

To thrive as a Ceph Storage Engineer, you need a strong background in Linux systems administration, networking, and distributed storage concepts, often supported by a relevant degree or certifications like Red Hat Certified Engineer (RHCE). Familiarity with tools such as Ceph, Ansible, monitoring systems like Prometheus, and scripting languages is typically required. Critical soft skills include problem-solving, attention to detail, and effective communication for troubleshooting and collaborating with IT teams. These abilities are vital to ensuring high availability, scalability, and reliability of storage solutions in enterprise environments.
What job categories do people searching Ceph Storage jobs in Michigan look for? The top searched job categories for Ceph Storage jobs in Michigan are:
What cities in Michigan are hiring for Ceph Storage jobs? Cities in Michigan with the most Ceph Storage job openings:
Manager, Reliability Operations

Manager, Reliability Operations

Nexcess

Lansing, MI • On-site, Remote

$110K - $150K/yr

Full-time

Retirement

Posted 29 days ago


Job description

About Nexcess
Nexcess provides specialty cloud solutions for organizations where performance and compliance have to coexist. We serve businesses worldwide, from agencies scaling client sites to enterprises running mission-critical operations. We've built our reputation on deep technical expertise and genuine partnership with every client we work with. Behind every environment we manage is a team of people who take the craft seriously and keep showing up when it matters.
About the Role
We're looking for a Manager of Reliability Operations to lead how we detect, respond to, and learn from failures across our platform ecosystem.
This role sits at the intersection of Operations and Engineering, bringing structure to incident response, accountability to follow-through, and clarity to reliability insights. You'll ensure that what we learn from production directly improves how our platforms are built, operated, and scaled.
This is a permanent, full-time, remote position.
US Pay Band - $110K - $150K Actual compensation will vary based on experience, skills, and location.
What You'll Do
Own Reliability Operations & Incident Command
  • Continuously evolve and improve incident management, change management, and post-incident practices
  • Establish clear standards for incident declaration, severity, escalation, and communication
  • Ensure consistent execution across teams and continuous process improvement
  • Own the incident command function, including roles, structure, and operating procedures
  • Lead or oversee major incident response in a 24/7 production environment
  • Build and manage on-call incident commander rotations with global coverage
Drive Learning, Accountability & Reliability Strategy
  • Own post-incident reviews, ensuring strong root cause analysis and clear documentation
  • Translate incident trends into actionable reliability improvements
  • Drive completion of corrective actions across teams; escalate when needed
  • Define and maintain service performance and reliability targets (availability, latency, error rates)
  • Own observability strategy, including monitoring, alerting, and signal quality
  • Improve detection, reduce time to resolution, and increase platform resilience
  • Partner with Engineering and Operations on capacity planning, patching, and lifecycle decisions
  • Ensure reliability insights directly inform platform and infrastructure roadmaps
  • Collaborate with Security on vulnerability response, patch prioritization, and compliance alignment
Operate Across a Complex Platform Environment
  • Work across environments including virtualization platforms (VMware), distributed storage (Ceph), Linux-based systems, and hybrid cloud infrastructure
  • Support platforms that span dedicated hosting, managed applications, and high-availability cloud services
  • Ensure reliability practices scale across multiple products, brands, and customer environments
  • Provide regular, data-driven reporting to leadership on availability, incident trends, and operational performance
  • Act as the central authority on reliability insights across teams
What You Bring
  • Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
  • 7+ experience in systems operations, site reliability, or platform engineering
  • 2+ years experience leading teams or major operational functions
  • Proven experience managing incidents in a 24/7 production environment
  • Strong background in troubleshooting, root cause analysis, and operational improvement
  • Experience with change management practices
Platform & Tooling Experience
  • Monitoring and observability platforms (e.g., Datadog, Prometheus, Grafana, New Relic)
  • Incident management and alerting tools (e.g., PagerDuty, Opsgenie)
  • Infrastructure and platform technologies (Linux systems, VMware, Ceph, cloud platforms)
  • Logging and telemetry systems (centralized logging, metrics, tracing)
  • Ability to translate complex technical data into clear insights
  • Strong communication skills, especially in high-pressure situations

Nice to Have
  • Background in Computer Science, Engineering, or a related field
  • Experience in managed hosting, cloud infrastructure, or SaaS environments
  • Experience defining and tracking system reliability and performance targets
  • Familiarity with ITIL or similar operational frameworks
  • Exposure to VMware, Ceph, Linux, and Windows platforms
  • Relevant certifications (AWS, RHCE, etc.)
What We Offer
  • Comprehensive benefits package
  • Traditional and Roth 401(k) with company matching
  • A collaborative, team-oriented culture
  • Consistent and predictable work hours
  • Engaging, varied work that keeps each day different
  • Opportunities to contribute ideas and influence how work gets done
Disclaimer:
This job description is only a summary of the typical functions of the position. It is not intended to be an exhaustive or comprehensive list of all job responsibilities, tasks, or duties. Additional duties and tasks may be assigned as part of the job function. Nexcess reserves the right to modify, interpret, or apply this job description in a way that best supports the organizational needs. The job description in no way creates or implies an employment contract. The employment contract remains "at will".
Equal Employment Opportunity Policy:
Nexcess is committed to offering equal employment opportunity without regard to age, color, disability, gender, gender identity, genetic information, marital status, military status, national origin, race, religion, sexual orientation, veteran status, or any other legally protected characteristic.
#LI-Remote