AI-Created CUDA Kernels Outspeed PyTorch in GPU Benchmarks

In the evolving landscape of machine learning and artificial intelligence, a remarkable development has occurred where AI-generated CUDA (Compute Unified Device Architecture) kernels have demonstrated outstanding performance in a series of GPU-intensive benchmarks. These innovative kernels have been shown to surpass the performance of traditional frameworks like PyTorch, co-developed by Meta, which has been a staple in the machine learning community due to its extensive library of prebuilt GPU operations. Conducted by researchers at Stanford University, this experiment unveiled the potential of large language models to generate highly efficient GPU kernels capable of executing essential operations, such as matrix multiplications and image processing, directly on NVIDIA GPUs.

The Emergence of AI-Driven CUDA Kernels

Language Models Take on PyTorch

The central focus of this groundbreaking research involved the successful application of language models to create CUDA kernels, which exhibited significant improvements in computational speed and efficiency when benchmarked against PyTorch’s built-in routines. Notably, these AI-generated kernels excelled in several tests, adhering to higher efficiency standards. For instance, a kernel designed for layer normalization—a crucial process in neural networks—demonstrated a speed increase of 4.8 times over PyTorch. These results highlight the potential of AI-generated kernels to revolutionize AI operations by offering substantial time-saving benefits and improved processing capabilities.

A key element in this research was the utilization of a benchmark named KernelBench. This involved a sophisticated process where language models attempted to replace specific PyTorch operators with their CUDA counterparts to accelerate GPU execution. The team engaged two large language models, OpenAI o3 and Gemini 2.5 Pro, employing parallel optimization strategies over multiple iterations. Each crafted kernel underwent rigorous examinations for both correctness and performance, ensuring high-quality results. This methodical approach underscores the meticulous care taken by the researchers in developing these advanced AI-driven kernels, marking a critical step forward in the field.

A New Approach to Kernel Optimization

Unique to this research were two pioneering modifications in the kernel creation process. Researchers initially articulated optimization concepts in straightforward language before generating multiple code variations from these concepts concurrently. These variations were simultaneously executed, with only the fastest progressing to subsequent rounds. This branching search strategy facilitated numerous potential solutions, where the most effective kernels utilized known methods such as optimizing memory access, overlapping arithmetic with memory operations, reducing data precision, maximizing GPU compute unit utilization, or simplifying loop structures.

This approach of generating, compiling, and benchmarking custom CUDA kernels—automatically validated for accuracy and performance against PyTorch’s standard code—has yielded not only faster runtime code but also valuable synthetic data that can aid in training future models. This dual benefit highlights the strategic advantage of this method in both accelerating operations and refining language models. Such a win-win situation emphasizes the potential long-term impact on AI and machine learning applications, setting a precedent for more efficient and insightful model training.

Concrete Advancements in AI Processing

Demonstrating Enhanced Image Convolution

An illustrative example of the research’s impact can be seen in the AI-created kernel for image convolution (Conv2D), which improved its speed from an initial 20 percent to nearly 180 percent of PyTorch’s efficiency following 13 iterations. Image convolution is a fundamental task in image processing, involving the integration of input images with filter matrices. This kernel’s performance was bolstered by transforming convolution into matrix multiplication, leveraging the GPU’s specialized tensor cores, employing double buffering for simultaneous computation and memory access, and precomputing memory indices for expedited data retrieval.

The resulting kernel demonstrated advanced CUDA programming techniques, akin to those employed by experienced developers in GPU optimization. Such advancements underscore the potential of AI-generated code to transform and expedite traditional processing tasks, thereby fostering greater efficiency and effectiveness across a range of applications. The ability to achieve such high levels of performance with minimal human intervention presents a promising future for automated AI and machine learning systems.

Challenges and Opportunities Ahead

Despite the impressive accomplishments, challenges persist. AI-generated kernels have shown limitations with newer AI tasks that employ lower-precision data types, like FP16. In one example, an FP16 matrix multiplication kernel achieved only 52 percent of PyTorch’s speed. The performance was even more limited for Flash Attention—a memory-intensive technique used in large language models—where the AI-generated kernel reached merely 9 percent of PyTorch’s speed. These limitations reflect inherent challenges in fine-tuning AI-generated kernels for emerging AI technologies, suggesting room for further refinement and enhancement.

Nevertheless, the researchers maintain optimism about these findings, emphasizing the reliability in automatically generating such kernels. This optimism is supported by other contemporary research indicating that parallel search strategies, when integrated with robust language models, can develop impressive system components. Similar achievements have been documented in other initiatives, such as DeepMind’s AlphaEvolve and the Deep Think feature of Gemini 2.5 Pro. This alignment of findings reinforces the potential of AI to continue advancing the realm of computing in meaningful and transformative ways.

A Promising Future for AI and Machine Learning

In the ever-evolving realm of machine learning and artificial intelligence, a groundbreaking advancement has emerged, showcasing AI’s ability to generate CUDA (Compute Unified Device Architecture) kernels that exhibit exceptional performance across various GPU-intensive benchmarks. These cutting-edge kernels have been found to outperform traditional frameworks, including PyTorch, which was co-developed by Meta and is renowned in the machine learning community for its comprehensive library of prebuilt GPU operations. Researchers at Stanford University conducted an experiment revealing the potential of large language models to create highly efficient GPU kernels, adept at executing crucial tasks such as matrix multiplications and image processing directly on NVIDIA GPUs. This remarkable capability indicates a promising future for AI-generated solutions in GPU computing, paving the way for more effective and streamlined processes in machine learning and artificial intelligence applications.

Subscribe to our weekly news digest.

Join now and become a part of our fast-growing community.

Invalid Email Address
Thanks for Subscribing!
We'll be sending you our best soon!
Something went wrong, please try again later