- Build CPU and GPU clustered compute systems
- Design, implement, and support our internal and cloud systems
- Track key metrics and logs
- Oversee capacity and planning of our clusters
- Work directly with application developers to help investigate upgrades, system tweaks, and next generation hardware.
- Build cool stuff
- Participate in a 24x7 on-call rotation
- Familiar with GPU usage in Compute Cluster
- Familiar with Cuda and TensorFlow workloads
- Expert level knowledge of virtual platforms (vSphere, Xen, Docker, or KVM)
- Experience with larger HPC clusters (>10,000 cores)
- Familiar with container clustering (K8S/Kubernetes, Swarm, etc.)
- Familiar with job and resource scheduling managers (Slurm (preferred), LSF, etc.)
- Ability to script in any of the following: Perl, Python, Ruby or Bash
- 10+ years of experience and ability to work with little or no supervision
About us Zoox is a Menlo Park, CA-based robotics company founded by Tim Kentley-Klay and Dr. Jesse Levinson to create autonomous mobility. Operating at the intersection of design, computer science, and electro-mechanical engineering, Zoox is a multidisciplinary team working to imagine and build an advanced mobility experience that will support the future needs of urban mobility for both people and the environment.