Description
WHAT YOU DO AT AMD CHANGES EVERYTHING
At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.
THE ROLE:
We are looking for a Senior Compute Cluster Administrator who will be responsible for compute clusters in upcoming datacenter buildouts using AMD Instinct products. This incumbent is responsible for Day Two+ operations, including both proactive maintenance and reactive support issues.
This is an operational role where the userbase is primarily made up of AI Server hardware, software, and firmware developers. Incumbents will be responsible for a combination of R&D lab, and production lab equipment, each with their own release cycles and requirements.
Collaboration is expected with IT, Infosec, highly technical end users, and infrastructure automation teams to ensure we meet delivery and governance targets.
KEY RESPONSIBILITIES:
- Work directly with the tenants and stakeholders to maximize the service quality, utilization, and availability of the clusters you manage
- Collaborate with highly technical users working deep within AMD's Instinct platform (e.g., ROCM) to troubleshoot misconfigurations that lead to poor performance of high-performance computing (HPC) resources.
- Lead the resolution of complex issues that arise during new deployments or ongoing operations.
- Work with various hardware vendors on technical escalations involving third-party OEMs and platforms. Be aware of upcoming releases from upstream vendors and coordinate update/maintenance cycles to accommodate.
- Be comfortable supporting multiple distributions of Linux in both the RedHat and Ubuntu/Debian families.
- Act as a subject matter expert in one or more cluster scheduling technologies (such as Slurm, LSF, SunGrid Engine, OpenLava, or Kubernetes).
- Compare notes and configurations between clusters across AMD's estate, matching or differing behaviors between heterogeneous systems as required.
- Be willing to engage with vendors and technologies in which very limited formal technical documentation exists. The incumbent will be responsible for a combination of white-box hardware platforms running pre-beta devices.
- Maintain and revise compute images on various HPC clusters, leveraging or building automated CI/CD pipelines to integrate software components, and where such automation does not exist, be able to deploy such software manually.
- Monitor the compute clusters' status for overall health, performance, and availability. We typically standardize around Grafana/Prometheus/Zabbix for our monitoring stack.
- A patient willingness to work with other team members in reproducing difficult to duplicate issues.
- Train and enable on-site L1 support teams
- Occasionally perform on-call incident response as L2
PREFERRED EXPERIENCE:
- Preferred applicants will have experience administering HPC clusters in a production/research/academic environment before.
- Barring that, applicants must have both 1-3 years' experience as a user in an HPC environment (i.e., a scientific computing user) and 3-5 years' experience as a Linux system administrator in a corporate environment.
- Barring that, applicants must have both 1-2 years' experience as a software developer (any language) and 5-8 years' experience working with Linux in an enterprise/server environment at L2 or higher.
- And if nothing else, evidence of intermediate to advanced Linux skills. LPIC, RHCSA/RCHT, RHCE, SCA certification all count.
- A strong understanding of network fundamentals (OSI model, multi-homed machines, firewall troubleshooting with nmap/traceroute) and be comfortable supporting various interconnects (InfiniBand/CX5/CX6/CX7, 200GB+ Ethernet, and others)
- A willingness to experiment with technologies which are both OpenSource and may not conform well to existing standards. As a semi-conductor company, we tend to be on the leading edge of new hardware releases!
- Knowledge and experience with supporting infrastructure service technologies including DNS, DHCP, BOOTP, PXE, TFTP, NTP, PAM
- An understanding of IPC (Interprocess Communication) is and what function it serves. Ideal applicants should be familiar with OpenMPI, mpich, or similar technologies.
- A strong understanding of standard Linux troubleshooting tools (e.g., nmap, gdb, lsof, sar, and others) and IPMI/iDRAC/iLO.
- Experience with RDMA is a plus. Experience with PCIe fundamentals, I2C, gcc optimization, and other low-level technologies is also a great plus. Neither of these are must haves, but a willingness to go deep down the stack is wonderful.
- Enough familiarity with virtualization technology, switch VLANs, and DNS/Active Directory to be able to work with teams supporting services in which you do not have logins to communicate business needs clearly.
- Possess a strong understanding of the English language and draft technical documentation read easily by your peers and end users.
- Experience with developing python/ansible scripts to automate infrastructure tasks (as needed).
- Familiarity with versioning tools (e.g., Git, svn, cvs, etc)
- A self-starter able to work independently but comfortable working in a team environment. You must possess a bias-towards-action in an industry which receives a lot of public scrutiny.
- Good analytical and problem-solving skills
- Dependable and flexible when necessary
ACADEMIC CREDENTIALS
- Bachelor or Masters degree in Computer Science, Computer Engineering, or a related technical field
#LI-GS1
#Hybrid
Benefits offered are described: AMD benefits at a glance.
AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.
AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD's “Responsible AI Policy” is available here.
This posting is for an existing vacancy.
Apply on company website