- Develop an ML pipeline & model management environment for building, training and inferencing models in AV development, simulation, and last mile testing
- Ensure support of multiple opensource frameworks (e.g. TensorFlow, PyTorch) and programming languages in multi-GPU workload scenarios involving both model and/or data parallelism
- Orchestrate and schedule multiple parallel experiments (AI models for training for example) in pooled GPU resources in a Kubernetes cluster for maximizing utilization, throughput, and priorities
- Ensure role-based/self-provisioning of infrastructure resources for data-scientists with automated workflow (model access, build, train, simulate, last mile testing)
- Integrate with data pipeline process for target datasets – models during training and simulation
- Evaluate MLOps ISVs to determine build vs buy for additional features
- Work directly with key AV customers to understand their technology and deliver the best solutions
- Experience in HPC/AI distributed computing environments leveraging K8S orchestration and SLURM schedulers + optimization
- Understands hybrid cloud considerations for burst capacity and run-time allocation for model training or development in the cloud vs on-prem
- Well-versed with orchestration and scheduling of multiple parallel experiments (AI models for training for example) in pooled GPU resources in a Kubernetes cluster for maximizing utilization, throughput, and priorities
- Experience with scalability, operations/run-time considerations for dynamic provisioning, suspend-resume, monitors and trouble-shooting model corruption and change control issues
- Bachelor's Degree in CS or IE/Data science with 6+ years in this field. Advanced degree preferred
- Ability to travel up to 50% on average, based on the work you do and the clients and industries/sectors you serve
- Limited immigration sponsorship may be available