- Establish the detailed specification of the DGX A100 that reflects a representative customer’s planning, deployment, and on-going operations optimization requirements on TCO, throughput, scalability, and flexibility with their varied workloads
- Set up the DGX/Super POD reference environment including DGX A100 compute nodes, fabrics (storage/compute), management networks & software (DeepOps), key system software for optimizing GPU communications I/O and application performance, and user run-time tools for SLURM and Kubernetes containers
- Design and document the most efficient setup to meet success metrics (TCO, performance, scale). Specific areas of focus:
- Network switch & fabric considerations for non-blocking, scalable bandwidth needs for best performance with varying dataset sizes & locations
- Storage and caching hierarchy implementations based on training vs inferencing workloads. Establish storage management guidelines for RAM/NVMe (internal storage) and external high speed storage (DDN, Netapp, etc.) allocation to optimize performance and cost of running varying data-sets and workloads. Establish rules for when to trigger GPU Direct Storage (GDS) feature for lower latency and faster I/O workloads.
- Management Servers - infrastructure design & setup for enabling– user logins, provisioning (OS images & other internal infrastructure services for the pod), Work-load management (resource management and scheduling/orchestration), container mgmt., system monitors/logs
- Operations/run-time optimization of A100 compute resources (MIG partitions) for varying workloads to maximize the utilization and throughput of jobs being scheduled in a given node cluster
- Validate the commercial model with the MVP operational run/playbook
- Bachelor's Degree equivalent experience in Computer Architecture, Computer Science, Electrical Engineering or related field. Advanced degree preferred
- 6+ years of proven experience in design, deployment, and operations of HPC production grade environments leveraging both SLURM and Kubernetes clusters
- Deep understanding of scale out compute, networking, and external storage architectures for optimizing performance and acceleration of AI/HPC workloads
- Proven experience deploying, upgrading, migrating, and driving user adoption of sophisticated enterprise scale systems.
- Prior software, solutions development background and proven ability to demonstrate complex new technologies
- Programming skills to build distributed storage and compute systems, backend services, microservices, and web technologies
- Well versed in agile methodology
- Comfortable with a customer focused, high paced environment
- Ability to travel up to 50% on average, based on the work you do and the clients and industries/sectors you serve
- Limited immigration sponsorship may be available