Mackin is hiring a talented Software Engineer III. Please apply prepared to show you have a thorough understanding of machine learning and, more specifically, deep learning (gradient descent, stochastic gradient descent, online learning). We are seeking a Software Engineer III that is able to understand which hardware, software and software frameworks are used to speed up model training and inference why (TensorFlow, PyTorch, CUDA, GPUs, NVIDIA DGX). Please keep in mind that our ideal candidate understands software which is used to clusterize the related hardware (Docker, Docker Swarm, Slurm, Kubernetes) and that LINUX experience is an absolute must.
- Troubleshoot performance\resource related problems with PyTorch/TensorFlow model training and cluster utilization.
- Build cluster utilization dashboard, monitoring and alerting system
- Manage and evolve high performance compute infrastructure used by research scientists for deep learning model training
- Help research scientists to troubleshoot various issues related to mode training (infrastructural, performance, etc.)
- Build software to facilitate routine operations of a research group
- Maintain and improve scripts to run distributed jobs on a HPC cluster of Linux-based machines while connected via high speed network
- Computational techniques: gradient descent, stochastic gradient descent
- Hardware and software frameworks: TensorFlow and/or PyTorch, CUDA
- Experienced C/C++, Python, Ruby software developer
- Expert level knowledge of Linux-based systems and cluster management
- High speed network performance profiling and optimization
- Advanced understanding of Linux containers
- Advanced knowledge in cluster resource managers like Slurm, Kubernetes, Docker Swarm
- Previous experience with MPI and InfiniBand is very welcome
Bachelor’s degree in Computer Science, Mathematics, or a related field or