Job Summary:
Texas A&M University is making a bold leap into the future of artificial intelligence with significant investments in supercomputing. The Senior HPC Engineer will provide technical expertise for the design and deployment of HPC systems, manage cluster operations, and lead enterprise-wide HPC projects.
Responsibilities:
• Manage large-scale HPC cluster operations, including OS upgrades, firmware patching, and performance tuning.
• Oversee networking, security, and infrastructure for HPC systems.
• Lead the development of specialized HPC computing clouds and scalable storage systems.
• Collaborate with stakeholders to develop service-based solutions.
• Serve as a strategic technical resource across departments.
• Lead enterprise-wide HPC projects using established project management protocols.
• Mentor junior system administrators and enforce performance standards.
Qualifications:
Required:
• Bachelor’s degree in applicable field or equivalent combination of education and experience
• 12 years of related experience
• Must be a United States citizen, permanent resident, or a person granted asylum or refugee status in accordance with 15 CFR, Part 762; 22 CFR §§122.5, 123.22 and 123.26; and 31 CFR § 501.601
Preferred:
• Experience with High Performance Computing (HPC) environments
• Advanced Linux system administration skills
• Familiarity with computer networking concepts and protocols
• Experience with container orchestration tools such as Kubernetes
• Knowledge of Run:ai for AI workload management
• Proficiency with Slurm workload manager
• Experience working with NVIDIA DGX systems
• Understanding of virtualization technologies
• Familiarity with Infrastructure as a Service (IaaS) platforms
• Experience with DDN storage solutions
• Knowledge of network-attached storage systems
• Expertise in scalable supercomputing architectures, interconnects, and storage systems.
• Proficiency in scripting (Python, Bash, Perl) and scientific computing (MPI, OpenMP, CUDA).
• Experience with configuration management tools (Ansible, Puppet).
• Familiarity with container technologies (Docker, Singularity, Kubernetes).
• Strong troubleshooting, communication, and strategic planning skills.
Company:
Texas A&M University has a proud history that stretches back to 1876 when The Agricultural and Mechanical College of Texas became the first public institution of higher learning in the state of Texas. Founded in 1876, the company is headquartered in College Station, TX, US, , with a team of 10001+ employees. The company is currently Late Stage.