Back to Search Results
Get alerts for jobs like this Get jobs like this tweeted to you
Company: AMD
Location: Shanghai, China
Career Level: Mid-Senior Level
Industries: Technology, Software, IT, Electronics

Description



WHAT YOU DO AT AMD CHANGES EVERYTHING 

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond.  Together, we advance your career.  



Position Overview
We are seeking a highly experienced engineer specializing in large language model (LLM) inference performance optimization. You will be a core member of our team, responsible for building and optimizing the LLM inference performance with high-throughput, low-latency on AMD Instinct GPUs. If you are passionate about pushing performance boundaries and have deep, hands-on expertise with cutting-edge technologies like vLLM or SGLang, we invite you to join us.
Key Responsibilities
1. Core System Optimization: Lead the development, tuning, and customization of LLM performance optimization on AMD GPUs, leveraging and extending frameworks like vLLM or SGLang to address performance bottlenecks in production environments.
2. Performance Analysis & Tuning: Conduct end-to-end performance profiling using specialized tools. Perform deep optimization of compute-bound operators (e.g., Attention), memory I/O, and communication to significantly increase throughput and reduce latency.
3. Model Architecture Adaptation: Demonstrate expertise in mainstream LLM architectures (e.g., DeepSeek, Qwen, Llama, ChatGLM) and optimize inference for their specific characteristics (e.g., RoPE, SWA, MoE, GQA).
4. Algorithm & Principle Application: Leverage your deep understanding of core algorithms (Transformer, Attention, MoE) to implement advanced optimization techniques such as PagedAttention, FlashAttention, continuous batching, quantization, and model compression.
5. Technology Foresight & Implementation: Research and prototype state-of-the-art optimization techniques (e.g., Speculative Decoding, Weight-Only Quantization) and drive their adoption into production systems.


Qualifications:
Mandatory Requirements:
1. Expertise in Inference Frameworks: Proven, hands-on experience with vLLM or SGLang, including deep understanding of their source code, deployment, configuration, and performance tuning. (Please describe relevant projects in your resume).
2. Mastery of Model Architectures: In-depth understanding and practical experience with inference workflows of mainstream LLMs (e.g., DeepSeek, Qwen), including their tokenizers, model configurations, and architecture definitions.
3. Strong Theoretical Foundation: Solid grasp of the principles behind Transformer, Self-Attention, MoE, KV Cache, and their impact on inference performance.
4. Proven Optimization Experience: Familiarity with end-to-end LLM inference optimization techniques such as PagedAttention, FlashAttention, continuous/dynamic batching, and quantization (INT8/INT4/GPTQ/AWQ), demonstrated with successful case studies.
5. Programming Skills: Proficiency in Python and strong software engineering best practices.
Preferred Qualifications (Plus):
1. Low-Level Development Skills: Experience with CUDA C++ programming for writing and debugging high-performance GPU kernels; or practical experience using Triton to develop and optimize deep learning operators.
2. Compiler Knowledge: Understanding or practical experience with compiler technologies like TVM or MLIR is a significant advantage.
3. Distributed Systems Experience: Hands-on experience with distributed inference for large-scale models (e.g., Tensor Parallel, Pipeline Parallel).
4. Education: Master's or Ph.D. in Computer Science, Artificial Intelligence, Electrical Engineering, or a related field.

 

#LI-FL1



Benefits offered are described:  AMD benefits at a glance.

 

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law.   We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.


 Apply on company website