Back to Search Results

Get alerts for jobs like this Get jobs like this tweeted to you

Company: AMD

Location: Santa Clara, CA

Career Level: Director

Industries: Technology, Software, IT, Electronics

Apply on company website View all jobs at this company

Description

WHAT YOU DO AT AMD CHANGES EVERYTHING

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.

THE ROLE:

We are seeking a hands‑on Principal Networking Engineer to own end‑to‑end QoS strategy and implementation across data center SmartNICs/DPUs. You will define traffic classification, shaping, scheduling, and congestion control policies spanning Top‑of‑Rack (ToR)/leaf/spine switches and host offload (SmartNIC/DPU), ensuring predictable performance for AI/ML, storage, and latency‑sensitive services. The ideal candidate combines deep knowledge of L2/L3/L4 QoS, RDMA/RoCE, PFC/ETS/ECN, and switch silicon schedulers/queues, with practical experience deploying policies at fleet scale.

THE PERSON:

We are seeking an experienced Principal Networking Engineer to drive the continuation of existing and future software systems and products. The successful candidate will be responsible for ensuring the functionality, reliability, and performance of our software products while keeping an outlook for future enabling and related technology. The ideal candidate will have a strong background in software engineering, excellent technical skills, and communication skill.

KEY RESPONSBILITIES:

Own QoS architecture across network tiers (host → NIC/DPU including classification, policing, shaping, queue mapping, and scheduling strategies for mixed workloads (AI collectives, storage, RPC, control plane).
Design and implement SmartNIC QoS: map DSCP/PCP to NIC traffic classes, configure hardware TX/RX queues, rate limiters, WFQ/DRR schedulers, and offload paths for RDMA/TCP/UDP.
Switch QoS policy design: configure PFC, ETS, ECN/RED/WRED, buffer pools, queue thresholds, shared vs. dedicated buffers, and congestion control across multiple ASICs (e.g., Broadcom, NVIDIA/Mellanox, Marvell).
RDMA/RoCE tuning end‑to‑end: lossless/loss‑tolerant modes, CNP/ECN parameters, RNR/retry behavior, MTU/Jumbo frames, and scalable multi‑tenant profiles.
Performance engineering: build test plans and run micro/macro benchmarks (e.g., ib_send_lat/ib_write_bw, RCCL/NCCL, iperf, switch counters/telemetry) to validate latency, throughput, tail performance, and fairness.
Instrumentation & observability: define SLI/SLOs for QoS (tail latency, drops, PFC events, ECN marks, queue depth, buffer occupancy); integrate with streaming telemetry (gNMI/INT/SFlow) and develop dashboards and alerts.
Troubleshoot complex incidents: incast, PFC deadlocks, microbursts, head‑of‑line blocking, unfair scheduling, and noisy neighbors; lead root‑cause analysis and corrective actions.
Scale & automation: deliver declarative QoS via intent‑based configs and CI/CD (e.g., Ansible/Salt, NAPALM, gNMI/gNOI, Netconf/YANG), including pre‑deployment simulation and automated canary/rollback.
Documentation & standards: author design docs, runbooks, and guidance for tenant teams; contribute to internal standards and vendor requirements.

MINIMUM QUALIFICATIONS:

Strong experience datacenter networking or systems engineering, with direct ownership of QoS on switches and/or SmartNICs/DPUs.
Deep knowledge of QoS mechanisms: classification/marking (DSCP/PCP), policing, shaping, queueing (PRIO, WRR/WFQ/DRR), scheduling hierarchies, and buffer management.
Hands‑on with PFC, ETS, ECN/WRED, explicit buffer tuning, and RDMA/RoCE performance/correctness in production.
Experience configuring merchant switch silicon (e.g., Broadcom Trident/Tomahawk, NVIDIA Spectrum, Marvell Teralynx) via NOS CLIs/SDKs (e.g., SONiC, Cumulus, NX‑OS, EOS, Onyx).
SmartNIC/DPU experience (e.g., NVIDIA BlueField, Intel IPU, AMD Pensando, Netronome/Agilio): queue configuration, rate limiting, hardware offloads, and host‑NIC QoS mapping.
Proficiency with Linux networking (TC, qdisc, mqprio, XDP/eBPF), ethtool, RDMA tools (perftest, rdma-core utilities), and packet/flow analysis (tcpdump, Wireshark, INT/sFlow).
Strong automation skills: Python and/or Go for network automation, telemetry pipelines, and CI/CD integration; Git‑based workflows.
Demonstrated ability to debug low‑level performance issues (NIC queues, IRQ affinity, NUMA, PCIe/xGMI topology, driver/firmware interactions).
Excellent written/verbal communication; strong design documentation and cross‑team leadership.

PREFERRED QUALIFICATIONS:

Large‑scale operations experience (10K+ servers or multi‑region fabrics) with QoS at fleet scale and multi‑tenant isolation.
Practical experience with AI/ML workloads (RCCL/NCCL AllReduce, parameter servers, distributed training) and storage (NVMe‑oF, NFS, SMB, object) QoS trade‑offs.
Experience with traffic engineering and congestion control in Clos fabrics; familiarity with INT, gNMI, Inband telemetry, and P4 concepts.
Contributions to SONiC, DPDK, eBPF/XDP, or OpenConfig; experience with YANG/Netconf, gNOI.
Vendor engagement/bring‑up: working with ASIC/NIC vendors on buffer models, scheduling algorithms, and firmware roadmaps.
Security awareness for multi‑tenant environments (DSCP abuse, QoS starvation, control‑plane protection, CoPP/ACL integration).

ACADEMIC CREDENTIALS:

Bachelor's degree in Computer Science, Computer Engineering, or related field, Master's preferred

#LI-BW1

This role is not eligible for visa sponsorship.

Benefits offered are described: AMD benefits at a glance.

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.

AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD's “Responsible AI Policy” is available here.

This posting is for an existing vacancy.

Apply on company website

Principal Networking Engineer - QoS / Networking Job Listing at AMD in Santa Clara, CA (Job ID 78053-en-us)

Description

About CareerArc

HR Solutions

Job Seekers

Principal Networking Engineer - QoS / Networking Job Listing at AMD in Santa Clara, CA (Job ID 78053-en-us)

Description

Find Connections via Linkedin

General Tips

Asking for Help

Getting Introduced

About CareerArc

HR Solutions

Job Seekers