Job Summary:
Oak Ridge National Laboratory (ORNL) is a U.S. Department of Energy national laboratory with a legacy of addressing the nation’s most pressing challenges. They are seeking a Research Software Engineer to join the Incident Modeling and Computational Sciences Group, where the role involves designing and developing AI and data infrastructure to support various modeling and simulation tools.
Responsibilities:
• Design, develop, and operate enterprise AI and data infrastructure.
• Build, maintain, and scale Docker-based microservices, large language model (LLM) inference servers on GPU clusters, vector database and retrieval-augmented generation (RAG) pipelines, and observability stacks.
• Work independently and collaboratively with a multidisciplinary team of scientists, data engineers, and system administrators to deliver reliable, secure, and high-performance AI services to ORNL researchers.
Qualifications:
Required:
• A BS degree in computer science, software engineering, or a related technical field and a minimum of five years of relevant experience. A combination of education and experience may also be considered.
• Experience with software development life cycle, including version control with Git, code review practices, and collaborative development workflows.
• Experience writing and maintaining production-quality code in Python, with exposure to one or more additional languages (e.g., JavaScript, Bash, C++).
• Experience deploying and debugging containerized applications using Docker and Docker Compose, including multi-service environments.
• Experience with Linux shell scripting in a command-line environment.
• Experience working in multi-disciplinary teams across all phases of the software development life cycle.
• This position requires the ability to obtain and maintain a Secret Compartmented Information (SCI) clearance from the Department of Energy. As such, this position is a Workplace Substance Abuse (WSAP) testing designated position. WSAP positions require passing a pre-placement drug test and participation in an ongoing random drug testing program. In addition, due the SCI, you may also be subject to random polygraph testing.
Preferred:
• Experience deploying or operating AI/ML serving infrastructure, including LLM serving frameworks such as vLLM, Ollama, or similar.
• Familiarity with model routing or proxy tools such as LiteLLM or comparable API gateway solutions.
• Experience with vector databases or retrieval-augmented generation (RAG) pipelines (e.g., Milvus, ChromaDB, Weaviate, or similar).
• Knowledge of reverse proxy and web infrastructure concepts, including Nginx configuration, TLS/mTLS certificate management, WebSocket proxying, and authentication subrequests.
• Experience with relational databases, including PostgreSQL administration and schema management.
• Familiarity with observability tooling such as OpenTelemetry, Prometheus, Grafana, Loki, or Tempo.
• Experience with HPC environments and job schedulers such as SLURM, or general experience deploying services on remote GPU clusters.
• Experience maintaining forks of open-source projects, including upstream merge management, patch backporting, and dependency CVE remediation.
• Familiarity with JavaScript or TypeScript and component-based frontend frameworks such as Svelte or React.
• Excellent written and oral communication skills.
• Motivated self-starter with the ability to work independently and to participate creatively in collaborative teams across the laboratory.
• Ability to function well in a fast-paced research environment, set priorities to accomplish multiple tasks within deadlines, and adapt to ever-changing needs.
Company:
Oak Ridge National Laboratory holds a range of R&D assignments, from fundamental nuclear physics to applied R&D on advanced energy systems. Founded in 1943, the company is headquartered in Oak Ridge, USA, with a team of 5001-10000 employees. The company is currently Late Stage.