Google Cloud Platform Supercomputer Solutions Google is seeking a supplier to provide engineering, maintenance, and enhancement services for its Google Cloud Platform ("GCP") Supercomputer Solutions.
Google Cloud Platform Supercomputer Solutions Google is seeking a supplier to provide engineering, maintenance, and enhancement services for its Google Cloud Platform ("GCP") Supercomputer Solutions.
Head of Supercomputing
San Jose, CA · On-site
$2.0K/mo
We are seeking a Head of Supercomputing to define and lead the architecture, software stack, and operational model for Etched's cluster-scale AI compute systems. This leader will own the end-to-end ...
Head of Supercomputing
San Jose, CA · On-site
$2.0K/mo
We are seeking a Head of Supercomputing to define and lead the architecture, software stack, and operational model for Etched's cluster-scale AI compute systems. This leader will own the end-to-end ...
Senior Software Engineer, TPU Supercomputer, Infrastructure, Cloud
Sunnyvale, CA · On-site
$127K - $173K/yr
Design and maintain TPU supercomputer software across various layers of the stack, including host-side daemons and hardware-level interfaces. * Implement network routing rules directly within TPU ...
Senior Software Engineer, TPU Supercomputer, Infrastructure, Cloud
Sunnyvale, CA · On-site
$127K - $173K/yr
Design and maintain TPU supercomputer software across various layers of the stack, including host-side daemons and hardware-level interfaces. * Implement network routing rules directly within TPU ...
Head of Supercomputing
San Jose, CA · On-site
$2.0K/mo
We are seeking a Head of Supercomputing to define and lead the architecture, software stack, and operational model for Etched's cluster-scale AI compute systems. This leader will own the end-to-end ...
Quick apply
Head of Supercomputing
San Jose, CA · On-site
$2.0K/mo
We are seeking a Head of Supercomputing to define and lead the architecture, software stack, and operational model for Etched's cluster-scale AI compute systems. This leader will own the end-to-end ...
Supercomputing Data Processing Engineer
Annapolis Junction, MD · Hybrid
$117K - $140K/yr
Supercomputing Data Processing Engineer LOCATIONAnnapolis Junction, MD 20701 CLEARANCETS/SCI Full Poly (Please note this position requires full U.S. Citizenship) KEY SUMMARYWe are seeking a highly ...
Supercomputing Data Processing Engineer
Annapolis Junction, MD · Hybrid
$117K - $140K/yr
Supercomputing Data Processing Engineer LOCATIONAnnapolis Junction, MD 20701 CLEARANCETS/SCI Full Poly (Please note this position requires full U.S. Citizenship) KEY SUMMARYWe are seeking a highly ...
Internship Description Building on the ALCF's robust training program in the areas of AI and supercomputing, we are hosting a series of hands-on courses that will teach attendees to use leading-edge ...
Internship Description Building on the ALCF's robust training program in the areas of AI and supercomputing, we are hosting a series of hands-on courses that will teach attendees to use leading-edge ...
As a member of the TPU Machine Learning Supercomputer (MLSC) team, you will design and develop features to significantly improve the scalability and reliability of large-scale software across TPUs ...
As a member of the TPU Machine Learning Supercomputer (MLSC) team, you will design and develop features to significantly improve the scalability and reliability of large-scale software across TPUs ...
Supercomputing Intern
San Jose, CA · On-site
Job Summary Our supercomputing role focuses on the design, development, and deployment of ML system software required for operating rack-scale systems. Your work will span network performance ...
Supercomputing Intern
San Jose, CA · On-site
Job Summary Our supercomputing role focuses on the design, development, and deployment of ML system software required for operating rack-scale systems. Your work will span network performance ...
Supercomputing Intern
San Jose, CA · On-site
Job Summary Our supercomputing role focuses on the design, development, and deployment of ML system software required for operating rack-scale systems. Your work will span network performance ...
Quick apply
Supercomputing Intern
San Jose, CA · On-site
Job Summary Our supercomputing role focuses on the design, development, and deployment of ML system software required for operating rack-scale systems. Your work will span network performance ...
Supercomputing Engineer
San Jose, CA · On-site
$200K - $275K/yr
We are seeking a highly skilled and motivated Engineer to join our Supercomputing team to help build the foundational software that powers our cluster-scale AI compute deployments. This role on the ...
Quick apply
Supercomputing Engineer
San Jose, CA · On-site
$200K - $275K/yr
We are seeking a highly skilled and motivated Engineer to join our Supercomputing team to help build the foundational software that powers our cluster-scale AI compute deployments. This role on the ...
As a Staff Software Engineer in the TPU Machine Learning Supercomputer team, you will design and develop features to enhance the scalability and reliability of large-scale software across TPUs and ...
As a Staff Software Engineer in the TPU Machine Learning Supercomputer team, you will design and develop features to enhance the scalability and reliability of large-scale software across TPUs and ...
About the Role RadixArk is hiring a Member of Technical Staff - Supercomputing to help build, deploy, and operate production-grade AI infrastructure for frontier-scale inference and training ...
About the Role RadixArk is hiring a Member of Technical Staff - Supercomputing to help build, deploy, and operate production-grade AI infrastructure for frontier-scale inference and training ...
The Mission StaffRight Associates is recruiting for a Systems Engineer (Agents/Clusters/Supercomputers) . As a Systems Engineer, you will serve as a vital catalyst within a premier organization. The ...
The Mission StaffRight Associates is recruiting for a Systems Engineer (Agents/Clusters/Supercomputers) . As a Systems Engineer, you will serve as a vital catalyst within a premier organization. The ...
Supercomputing Engineer
San Jose, CA · On-site
$2.0K/mo
We are seeking a highly skilled and motivated Engineer to join our Supercomputing team to help build the foundational software that powers our cluster-scale AI compute deployments. This role on the ...
Supercomputing Engineer
San Jose, CA · On-site
$2.0K/mo
We are seeking a highly skilled and motivated Engineer to join our Supercomputing team to help build the foundational software that powers our cluster-scale AI compute deployments. This role on the ...
About the Role RadixArk is hiring a Member of Technical Staff - Supercomputing to help build, deploy, and operate production-grade AI infrastructure for frontier-scale inference and training ...
About the Role RadixArk is hiring a Member of Technical Staff - Supercomputing to help build, deploy, and operate production-grade AI infrastructure for frontier-scale inference and training ...
Supercomputing Engineer
San Jose, CA · On-site
$200K - $275K/yr
We are seeking a highly skilled and motivated Engineer to join our Supercomputing team to help build the foundational software that powers our cluster-scale AI compute deployments. This role on the ...
Supercomputing Engineer
San Jose, CA · On-site
$200K - $275K/yr
We are seeking a highly skilled and motivated Engineer to join our Supercomputing team to help build the foundational software that powers our cluster-scale AI compute deployments. This role on the ...
Member of Technical Staff, Supercomputing Platform & Infrastructure
San Francisco, CA · On-site +1
$200K - $550K/yr
About the role As an engineer on the Supercomputing Platform & Infrastructure team, you will design, build, and operate the large-scale GPU infrastructure that powers Magic's model training and ...
Member of Technical Staff, Supercomputing Platform & Infrastructure
San Francisco, CA · On-site +1
$200K - $550K/yr
About the role As an engineer on the Supercomputing Platform & Infrastructure team, you will design, build, and operate the large-scale GPU infrastructure that powers Magic's model training and ...
The role leads the team responsible for maintaining and upgrading the infrastructure supporting NERSC's supercomputing data center, including next-generation systems, liquid-cooled HPC and AI ...
The role leads the team responsible for maintaining and upgrading the infrastructure supporting NERSC's supercomputing data center, including next-generation systems, liquid-cooled HPC and AI ...
The role leads the team responsible for maintaining and upgrading the infrastructure supporting NERSC's supercomputing data center, including next-generation systems, liquid-cooled HPC and AI ...
The role leads the team responsible for maintaining and upgrading the infrastructure supporting NERSC's supercomputing data center, including next-generation systems, liquid-cooled HPC and AI ...
Software Engineer, Frontier Systems - Power Management
San Francisco, CA · On-site
$295K - $445K/yr
With large-scale supercomputers consuming substantial amounts of power, managing this efficiently is key to maximizing computational capacity. This role is critical to ensuring that our cutting-edge ...
Software Engineer, Frontier Systems - Power Management
San Francisco, CA · On-site
$295K - $445K/yr
With large-scale supercomputers consuming substantial amounts of power, managing this efficiently is key to maximizing computational capacity. This role is critical to ensuring that our cutting-edge ...
Supercomputer information
What are the key skills and qualifications needed to thrive in the Supercomputer position, and why are they important?
To thrive as a Supercomputer Engineer, you need expertise in high-performance computing (HPC), computer architecture, parallel programming, and advanced mathematics, often supported by a degree in computer science, engineering, or a related field. Familiarity with tools such as MPI, OpenMP, Linux systems, and certifications like Certified HPC Professional can be critical. Strong problem-solving abilities, collaboration, and communication skills set exceptional candidates apart in multidisciplinary environments. These competencies are essential for building, optimizing, and managing supercomputing resources that drive scientific discovery and innovation.
What are the typical responsibilities of a Supercomputer Engineer on a daily basis?
Supercomputer Engineers are responsible for designing, configuring, and maintaining high-performance computing systems to support complex computations in fields such as scientific research, weather modeling, and data analytics. On a daily basis, they might monitor system performance, troubleshoot hardware or software issues, optimize code for scalability, and collaborate closely with researchers and IT professionals to ensure workloads run efficiently. Additionally, they often assist in upgrading systems and implementing the latest technologies to maximize computational power. Working in this role offers opportunities for ongoing professional development and cross-functional teamwork, making each day both challenging and rewarding.
What is a Supercomputer job?
A Supercomputer job typically involves working with high-performance computing (HPC) systems to process complex calculations at extremely high speeds. Professionals in this field may develop software, optimize system performance, manage hardware infrastructure, or support scientific and engineering research. These roles are common in fields such as climate modeling, artificial intelligence, biomedical research, and financial simulations.

Other
Posted 5 days ago
Job description
Google is seeking a supplier to provide engineering, maintenance, and enhancement services for its Google Cloud Platform ("GCP") Supercomputer Solutions. The supplier will be responsible for supporting and enhancing two key product areas: Cluster Toolkit and HyperCompute Cluster Service (HCS). This work involves a combination of ongoing operational tasks, testing, documentation, and specific development deliverables.
Scope of Work & DeliverablesThe supplier will be responsible for the services and deliverables detailed below.
Ongoing Maintenance- The contractor must provide ongoing maintenance and enhancements for all 6 projects covered under the original Statement of Work.
Cluster Toolkit is an open-source software solution that simplifies the deployment of high-performance computing (HPC), artificial intelligence (AI), and machine learning (ML) workloads on Google Cloud.
Ongoing Responsibilities:- Stability Testing: Test the stability of new products, beginning with A3U. This includes:
- Building NVIDIA Collective Communications Library (NCCL) tests on a Slurm cluster.
- Setting up and running pairwise tests to identify and report bad nodes.
- Integration Test Triage: Perform rotational duties to manage and triage integration test failures. This includes:
- Monitoring daily failure chats and flake tools.
- Reporting on failures and performing advanced handling, such as creating new bug reports and categorizations.
- Documentation: Improve, organize, and maintain the Cluster Toolkit documentation. This process involves:
- Gathering existing documents and identifying information gaps.
- Creating new documentation and updating existing materials.
- Organizing the information in g3docs, consolidating it in a team Google Drive, and establishing a review process.
- Project Cleanup: Once a week, clean up the 'hpc-toolkit-dev' project by identifying and deleting unused resources.
- Security: Triage and address security alerts by checking for them, creating PageRanks (PRs) to resolve them, and applying the necessary updates.
- HPC VM Image Releases: Deliver 4-6 High-Performance Computing Virtual Machine (HPC VM) image releases during 2025.
- Software Widget Releases: Release new software widgets every two weeks during 2025, including managing any necessary hotfixes.
HCS is a service that enables the deployment and management of resilient, high-performance AI and HPC systems at scale.
Key Deliverables:- API Integration Testing: Add comprehensive integration tests for all HCS Application Programming Interface (API) surfaces. Coverage must include:
- HypercomputeClusters: Create, Delete, Update, Get, and List requests and responses.
- Network: NetworkInitialize params.
- Storage: StorageInitialize, FileStoreInitialize, Filestore tier, ParallelstoreInitialize, and GcsInitialize params.
- Compute: Resource request, Guest accelerator, Disk, Provisioning model, Reservation affinity and type, Orchestrator, Slurm, Node test, Storage configuration, and Slurm partition.
- Critical User Journey (CUJ) Validation: Add integration tests to validate the following critical user journeys:
- Creating a cluster that consumes a reservation.
- Creating a cluster with a new network and new storage.
- Creating a cluster using a pre-existing network and storage created both outside of HCS and by a previous HCS deployment.
- Destroying all components of an HCS-created cluster.
- Destroying a cluster while leaving the network and storage intact.
Updating a Slurm cluster to add a new reservation to both new and existing partitions