Job Summary:
Databricks is the data and AI company, and they are seeking a Staff Software Engineer for GenAI Performance and Kernel. In this role, you will own the design and optimization of high-performance GPU kernels for GenAI inference, leading development and mentoring others in performance engineering.
Responsibilities:
• Lead the design, implementation, benchmarking, and maintenance of core compute kernels (e.g. attention, MLP, softmax, layernorm, memory management) optimized for various hardware backends (GPU, accelerators)
• Drive the performance roadmap for kernel-level improvements: vectorization, tensorization, tiling, fusion, mixed precision, sparsity, quantization, memory reuse, scheduling, auto-tuning, etc.
• Integrate kernel optimizations with higher-level ML systems
• Build and maintain profiling, instrumentation, and verification tooling to detect correctness, performance regressions, numerical issues, and hardware utilization gaps
• Lead performance investigations and root-cause analysis on inference bottlenecks, e.g. memory bandwidth, cache contention, kernel launch overhead, tensor fragmentation
• Establish coding patterns, abstractions, and frameworks to modularize kernels for reuse, cross-backend portability, and maintainability
• Influence system architecture decisions to make kernel improvements more effective (e.g. memory layout, dataflow scheduling, kernel fusion boundaries)
• Mentor and guide other engineers working on lower-level performance, provide code reviews, help set best practices
• Collaborate with infrastructure, tooling, and ML teams to roll out kernel-level optimizations into production, and monitor their impact
Qualifications:
Required:
• BS/MS/PhD in Computer Science, or a related field
• Deep hands-on experience writing and tuning compute kernels (CUDA, Triton, OpenCL, LLVM IR, assembly or similar sort) for ML workloads
• Strong knowledge of GPU/accelerator architecture: warp structure, memory hierarchy (global, shared, register, L1/L2 caches), tensor cores, scheduling, SM occupancy, etc.
• Experience with advanced optimization techniques: tiling, blocking, software pipelining, vectorization, fusion, loop transformations, auto-tuning
• Familiarity with ML-specific kernel libraries (cuBLAS, cuDNN, CUTLASS, oneDNN, etc.) or open kernels
• Strong debugging and profiling skills (Nsight, NVProf, perf, vtune, custom instrumentation)
• Experience reasoning about numerical stability, mixed precision, quantization, and error propagation
• Experience in integrating optimized kernels into real-world ML inference systems; exposure to distributed inference pipelines, memory management, and runtime systems
• Experience building high-performance products leveraging GPU acceleration
• Excellent communication and leadership skills — able to drive design discussions, mentor colleagues, and make trade-offs visible
• A track record of shipping performance-critical, high-quality production software
Preferred:
• Bonus: published in systems/ML performance venues (e.g. MLSys, ASPLOS, ISCA, PPoPP), experience with custom accelerators or FPGA, experience with sparsity or model compression techniques
Company:
Databricks is a data and AI platform that unifies data engineering, analytics, and machine learning on a lakehouse architecture. Founded in 2013, the company is headquartered in San Francisco, USA, with a team of 5001-10000 employees. The company is currently Late Stage.