Abstract: Transformer has been widely applied across various domains due to its outstanding performance, driving the development of numerous application services. However, in many real-world scenarios ...
This code accompanies the blog post Matrix Multiplication Faster Than Nvidia, Sometimes. It provides a CUDA kernel for single-precision matrix-matrix multiplication, with two notable features: use of ...
Abstract: Analog computing-in-memory accelerators promise ultra-low-power, on-device AI by reducing data transfer and energy usage. Yet inherent device variations and high energy consumption for ...
QiMeng-GEMM is an innovative approach to automatically generate high-performance matrix multiplication (GEMM) code using LLMs. This codebase provides a comprehensive solution for efficiently computing ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results