Back to Search Results

Get alerts for jobs like this Get jobs like this tweeted to you

Company: AMD

Location: Markham, ON, Canada

Career Level: Director

Industries: Technology, Software, IT, Electronics

Apply on company website View all jobs at this company

Description

WHAT YOU DO AT AMD CHANGES EVERYTHING

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.

THE ROLE

We're looking for a hands-on Director of Test Engineering to lead and transform the quality function for ROCm. This is not a program management role — it's a deeply technical leadership position for someone who understands the hardware/software interface of GPUs, has built test engineering organizations from the ground up, and is ready to lead the next wave of AI-native, agentic quality engineering.

You will own the vision, strategy, and execution of test engineering for ROCm — from kernel-level driver validation to user-space ML framework testing. Critically, you will be the driving force behind scaling your team's impact through AI and agentic tooling, building a modern, autonomous quality organization that moves faster than any traditional QA team could.

The ROCm software organization at AMD builds and maintains the open-source GPU software stack powering AI training, inference, and HPC workloads across AMD's data center and consumer GPU portfolio. ROCm is the foundation on which developers, researchers, and enterprises run their most demanding AI and HPC workloads. Quality and reliability are existential to our success. We operate at the intersection of cutting-edge hardware and software — and we move fast. Our team is deeply invested in open-source, community-driven development, and engineering excellence at every layer of the stack.

THE PERSON

The ideal candidate is a technical leader who has built and scaled test engineering teams in complex, hardware-adjacent software environments. You are hands-on when it matters — able to prototype a test framework, debug a GPU driver failure, or design a validation architecture. You also understand how customers actually use the product: the AI inference and training workloads they run, the parallelism strategies they deploy, the performance they expect, and the failure modes they hit. That customer-workload knowledge is what separates a QA team that writes blackbox sanity checks from one that designs tests targeting the exact code paths real users exercise. You see AI agents not as a novelty but as the primary lever for scaling your team's output. You are impatient with manual, reactive QA and energized by building systems that catch bugs before humans even see them.

KEY RESPONSIBILITIES

Own the overall test engineering strategy and architecture for ROCm, spanning driver validation, runtime testing, compiler/toolchain quality, and ML framework integration — with test coverage designed around real customer workload patterns, not synthetic benchmarks.
Lead, grow, and mentor a team of SDETs and test engineers, instilling SDET-level engineering discipline and a culture of automation-first quality.
Architect and operate continuous testing/validation infrastructure: staging environments for soak testing, stress testing, failure injection, recovery validation, and long-duration reliability runs.
Champion AI-first and agentic test engineering: drive adoption of LLM-assisted test generation, autonomous failure triage, intelligent test prioritization, and agentic CI/CD workflows.
Hands-on prototyping of new test frameworks, validation tooling, and agentic testing pipelines — especially in early-stage or high-ambiguity situations.
Define, track, and improve quality KPIs: test coverage, defect escape rate, time-to-detection, device utilization, and validation cycle time.
Collaborate closely with hardware, firmware, and software engineering teams to ensure quality is integrated from design through release.
Partner with DevOps and infrastructure teams to evolve the CI/CD pipeline with robust, scalable, GPU-aware test automation.
Engage with the open-source ROCm community and external customers on quality feedback loops and reliability expectations, translating their workload patterns and failure reports into structured test coverage.
Partner with compiler, runtime, and framework integration teams on numerical correctness validation — understanding shared scope boundaries and ensuring the test organization contributes meaningfully to catching precision regressions across floating-point formats and parallelism configurations.
Establish and maintain HW/SW test automation for both Linux and Windows platforms across AMD's GPU product lines.

The Impact you will have:

Define and own the test engineering strategy for ROCm across the full HW/SW stack, from driver interfaces to ML framework validation.
Transform the quality organization into an AI-first, agentic team — scaling coverage, speed, and reliability without proportional headcount growth.
Build and operate continuous testing and validation infrastructure including long-running soak, stress, failure/recovery, and staging environments for product reliability.
Raise the bar on test engineering discipline: shift-left practices, SDET-caliber test development, and deep ownership of quality metrics.
Partner directly with hardware, firmware, and software engineers to ensure quality is embedded at every stage of development.
Drive adoption of AI-assisted testing workflows, intelligent test selection, automated root cause analysis, and agentic CI/CD pipelines across the organization.

PREFERRED QUALIFICATIONS

Experience in software engineering or test engineering, with significant experience in hardware-adjacent or systems-level software.
Engineering management, including building and scaling test engineering or SDET organizations.
Deep hands-on expertise in test automation at scale — framework design, CI/CD pipeline development, and continuous validation systems.
Demonstrated experience with hardware + software test automation, including HW bring-up, driver validation, or firmware/software co-testing.
Strong understanding of GPU architecture or hardware/software interfaces (PCIe, memory subsystems, compute kernels, or equivalent).
Experience designing and operating always-on test infrastructure: soak/stress environments, failure injection, and reliability/recovery validation pipelines.
Proven track record of adopting and scaling AI or automation tooling to multiply team throughput.
Python proficiency: able to write test automation, tooling, and scripted validation workflows independently.
Practical understanding of how AI inference and training workloads are deployed on GPU hardware — including common parallelism strategies (tensor parallel, pipeline parallel, data parallel), serving configurations, and performance expectations — sufficient to translate customer use cases into targeted test coverage.
Hands-on software development skills sufficient to prototype test frameworks, write automation tooling, and review SDET-level code.

PREFERRED EXPERIENCE

Direct experience with ROCm, CUDA, or GPU compute software stacks (runtime, compiler, ML frameworks).
Experience integrating LLMs, AI agents, or agentic workflows into software development or test engineering processes.
Expertise in open-source development practices and community-facing quality processes (GitHub Actions, open CI, etc.).
Background in SDET or test engineering in a semiconductor, HPC, or AI infrastructure company.
Experience with GPU-specific test challenges: non-determinism, thermal behavior, multi-device coordination, driver stability.
Track record of shipping test frameworks or validation tools used across large engineering organizations.
Familiarity with ML training/inference workload validation: throughput, latency, numerical stability across precision formats (FP32/BF16/FP8), and multi-GPU collective communication correctness.
Experience with GPU profiling and trace analysis tooling (e.g., rocprof, omniperf, PyTorch profiler) to identify kernel-level performance and correctness anomalies.
Familiarity with HIP, CUDA, or low-level GPU programming — sufficient to understand what is being tested at the runtime and kernel level, even if not writing kernels directly.

PREFERRED ACADEMIC CREDENTIALS:

Master's degree or PhD in related discipline preferred

#LI-G11

#LI-HYBRID

Note: This role is intentionally scoped as a hands-on technical leadership position. Candidates whose primary background is program management or traditional QA management without deep engineering execution experience may not be the right fit.

Benefits offered are described: AMD benefits at a glance.

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.

AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD's “Responsible AI Policy” is available here.

This posting is for an existing vacancy.

Apply on company website

Director of Software Validation Engineering – ROCm Job Listing at AMD in Markham, ON (Job ID 86264-en-us)

Description

Job Seekers

Director of Software Validation Engineering – ROCm Job Listing at AMD in Markham, ON (Job ID 86264-en-us)

Description

Find Connections via Linkedin

General Tips

Asking for Help

Getting Introduced

Job Seekers