1

Live In Dcim Jobs (NOW HIRING)

Site Operations Lead

Fremont, CA · On-site

$160K - $200K/yr

... live sites -- keeping power, cooling, network, and compute infrastructure available and healthy in ... Monitor facility health continuously (DCIM, building management, environmental) and manage capacity ...

Site Operations Lead

Sunnyvale, CA · On-site

$160K - $200K/yr

... live sites -- keeping power, cooling, network, and compute infrastructure available and healthy in ... Monitor facility health continuously (DCIM, building management, environmental) and manage capacity ...

Site Operations Lead

Sonoma, CA · On-site

$160K - $200K/yr

... live sites -- keeping power, cooling, network, and compute infrastructure available and healthy in ... Monitor facility health continuously (DCIM, building management, environmental) and manage capacity ...

Site Operations Lead

San Mateo, CA · On-site

$160K - $200K/yr

... live sites -- keeping power, cooling, network, and compute infrastructure available and healthy in ... Monitor facility health continuously (DCIM, building management, environmental) and manage capacity ...

Site Operations Lead

Alameda, CA · On-site

$160K - $200K/yr

... live sites -- keeping power, cooling, network, and compute infrastructure available and healthy in ... Monitor facility health continuously (DCIM, building management, environmental) and manage capacity ...

Site Operations Lead

Santa Clara, CA · On-site

$160K - $200K/yr

... live sites -- keeping power, cooling, network, and compute infrastructure available and healthy in ... Monitor facility health continuously (DCIM, building management, environmental) and manage capacity ...

Lead BMS, EPMS, and DCIM integration, validation, and operational readiness testing activities ... Strong experience leading root cause analysis and technical problem resolution in live operational ...

Site Operations Lead

Mundelein, IL · On-site

$160K - $200K/yr

... live sites -- keeping power, cooling, network, and compute infrastructure available and healthy in ... Monitor facility health continuously (DCIM, building management, environmental) and manage capacity ...

This is a remote position but candidate must reside in Northern California territory. Candidate ... We are all about delighting our clients and live/breathe the end client/user experience * We have ...

This is a remote position but candidate must reside in Northern California territory. Candidate ... We are all about delighting our clients and live/breathe the end client/user experience * We have ...

Controls Automation PM - Data Center

Ashburn, VA · On-site

$85K - $113K/yr

... DCIM platforms * Coordinate site logistics, factory witness testing, procurement, installation ... Experience managing projects in live mission-critical environments, including retrofits and system ...

... DCIM platforms * Coordinate site logistics, factory witness testing, procurement, installation ... Experience managing projects in live mission-critical environments, including retrofits and system ...

Controls Automation PM - Data Center

Des Moines, IA · On-site

$81K - $107K/yr

... DCIM platforms * Coordinate site logistics, factory witness testing, procurement, installation ... Experience managing projects in live mission-critical environments, including retrofits and system ...

next page

Showing results 1-20

Live In Dcim information

What cities are hiring for Live In Dcim jobs? Cities with the most Live In Dcim job openings:
What are the most commonly searched types of Dcim jobs? The most popular types of Dcim jobs are:
What states have the most Live In Dcim jobs? States with the most job openings for Live In Dcim jobs include:

Site Operations Lead

AI Fabrik

Fremont, CA • On-site

$160K - $200K/yr

Other

Posted 15 days ago


Key responsibilities

  • Own day-to-day operation and uptime of live sites, ensuring availability and health of power, cooling, network, and compute infrastructure in a 24x7 environment.

  • Manage the vendor ecosystem by defining, tracking, and enforcing SLAs, and coordinating facility maintenance, security, and related services.

  • Lead incident and outage response efforts, drive rapid resolution, conduct root-cause analysis, and implement preventive actions.


Job description

About AI Fabrik


AI Fabrik builds an edge inference delivery network for high-performance tokens, with faster time-to-market from grid to tokens. Our mission is to build the inference infrastructure we wished every enterprise already had — close to users, close to the cloud, and extremely resilient for real-time workloads. We are builders, architects, engineers, and researchers with hands-on experience in real-world AI deployment in production, and decades of data center experience that taught us exactly what needs to change.


AI Fabrik was incubated inside Gruve and backed by Mayfield, Xora (Temasek), Acclimate Ventures, Cisco Investments — existing investors from Gruve who followed us into this new chapter. We are deploying five initial production sites, with the first one coming online in July 2026.


About the Role


We are seeking an experienced operations leader to oversee the day-to-day management of our mission-critical infrastructure. In this role, you will be responsible for ensuring the reliability, availability, and scalability of live 24x7 production environments, while maintaining exceptional service levels for customers and stakeholders. The ideal candidate has hands-on experience operating critical facilities, establishing and managing service level agreements (SLAs), building strong vendor and partner relationships, and proactively identifying and mitigating risks before they impact operations. You will lead incident response efforts, drive capacity planning initiatives, manage operating budgets, and continuously improve operational processes to support business growth. Experience with high-density GPU deployments, AI infrastructure, and liquid cooling technologies is highly desirable. This is a unique opportunity to help shape and scale the operational foundation of next-generation AI infrastructure.


Key Responsibilities


  • Own day-to-day operation and uptime of our live sites — keeping power, cooling, network, and compute infrastructure available and healthy in a 24x7 environment
  • Manage the ongoing vendor ecosystem (facility maintenance, smart/remote hands, cooling, UPS and generator service, fire systems, physical security) — defining, tracking, and enforcing SLAs and holding each vendor to performance, response times, and budget
  • Build and run the preventive and corrective maintenance program, scheduling maintenance windows and coordinating vendors with minimal disruption to live workloads
  • Lead incident and outage response — own on-call and escalation, drive rapid resolution, and close the loop with root-cause analysis and preventive actions
  • Monitor facility health continuously (DCIM, building management, environmental) and manage capacity — power, cooling, space, and rack utilization — ahead of the engineering team's growth
  • Run change management for the live environment, and coordinate ongoing hardware operations (installs, moves, decommissions, cabling, cross-connects, spares) in support of engineering
  • Own operating budget and efficiency (opex, utility costs, PUE), physical security operations, and compliance, inspections, and audits (fire/safety, environmental, frameworks such as SOC 2)
  • Maintain operational documentation (runbooks, MOPs, SOPs/EOPs), report to leadership on uptime, capacity, incidents, and spend, and support new site bring-up and handover into operations as locations come online.


Basic Qualifications


  • Proven experience operating live data center or critical facilities — owning uptime, maintenance, and vendor performance in a 24x7 environment. This is a hard requirement
  • Strong vendor and service-provider management: setting and enforcing SLAs and maintenance contracts, and holding multiple vendors accountable on availability, quality, and cost
  • Working knowledge of critical facility systems in operation — power (utility, switchgear, UPS, generators, PDUs), mechanical and liquid cooling, fire suppression, cabling, and physical security
  • Hands-on with monitoring and management tooling (DCIM, building/facility management, environmental), plus solid capacity planning for power, cooling, and space
  • A track record in incident and outage management — on-call ownership, fast resolution, root-cause analysis, and preventive follow-through
  • Experience managing operating budgets with demonstrated cost and efficiency control (including PUE/energy), and familiarity with relevant codes, standards, and audits (fire/safety, Uptime Institute, TIA-942)
  • Strong documentation discipline and stakeholder communication — crisp reporting to leadership and coordination across a distributed US/India team; willing to be on-site, carry on-call, and travel as operations demand
  • Exposure to high-density and GPU/AI infrastructure and liquid/immersion cooling is a strong plus, as is new-site bring-up experience and relevant certifications (e.g., CDCP/CDCDP/DCOM, PMP)


Salary Range

$160,000 - $200,000 USD + Benefits


Why AI Fabrik


At AI Fabrik, we hire for impact. We want those who challenge how inference infrastructure is built and who excel at delivering it in production. We are builders, architects, engineers, and researchers. We move fast, work with rigor, and care deeply about what runs in the real world.


We are committed to building a diverse and inclusive team. AI Fabrik is an equal opportunity employer. We welcome applicants from all backgrounds and thank all who apply; however, only those selected for an interview will be contacted.


Please note that this is an onsite position based out of AI Fabrik’s Redwood City, California office.