TurboQuant
This article may be too technical for most readers to understand. (May 2026) |
TurboQuant is an online vector quantization algorithm for compressing high-dimensional Euclidean vectors while preserving their geometric structure. It was proposed in 2025 by Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni in the paper TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate.[1] The paper lists Zandieh and Mirrokni as affiliated with Google Research, Daliri with New York University, and Hadian with Google DeepMind. The method was developed for applications including large language model (LLM) inference, key–value (KV) cache compression, vector databases, and nearest neighbor search.[2]
TurboQuant consists of two related algorithms: TurboQuantmse, which is optimized for mean squared error (MSE), and TurboQuantprod, which is optimized for unbiased inner product estimation. The algorithm uses a random rotation of input vectors, applies scalar quantizers to the rotated coordinates, and, for inner-product estimation, applies a one-bit Quantized Johnson–Lindenstrauss (QJL) transform to the residual error.[1]
Background
[edit]Vector quantization is a compression method that maps high-dimensional vectors to a finite set of codewords. The problem has roots in Shannon's source coding theory and rate–distortion theory.[1] In machine learning and information retrieval, vector quantization is used to reduce the memory required to store embeddings, activation vectors, and other numerical representations.
In Transformer-based large language models, the KV cache stores key and value vectors from previous tokens during autoregressive decoding. The size of this cache grows with context length, the number of attention heads, and the number of concurrent requests, making it a major memory bottleneck in LLM serving.[1] Similar compression problems appear in vector search, where large collections of embedding vectors must be stored and searched efficiently.
Earlier approaches to vector quantization include product quantization, scalar quantization, and data-dependent k-means codebook construction. The TurboQuant paper argues that many existing methods either require offline preprocessing and calibration or suffer from suboptimal distortion guarantees in online settings.[1]
Algorithm
[edit]TurboQuantmse
[edit]TurboQuantmse is the version of the algorithm optimized for mean-squared error. For a unit vector , the algorithm first applies a random rotation matrix and sets . Each coordinate of the rotated vector follows a shifted and scaled beta distribution, which converges to a normal distribution in high dimensions. In high dimensions, distinct coordinates also become nearly independent, allowing the algorithm to apply scalar quantizers independently to each coordinate.[1]
The scalar quantizer is constructed by solving a one-dimensional continuous k-means or Lloyd–Max quantization problem. If the centroids are , the quantization step stores, for each coordinate,
During dequantization, the stored index for each coordinate is replaced by the corresponding centroid, giving a reconstructed rotated vector . The algorithm then rotates back:[1]
The paper gives the following bound for TurboQuantmse:
It also reports finer-grained MSE values of approximately 0.36, 0.117, 0.03, and 0.009 for bit-widths , respectively.[1]
TurboQuantprod
[edit]TurboQuantprod is optimized for unbiased inner-product estimation. The authors note that an MSE-optimized quantizer may introduce bias when used to estimate inner products. To address this, TurboQuantprod first applies TurboQuantmse with bit-width , then applies a one-bit Quantized Johnson–Lindenstrauss transform to the remaining residual vector.[1]
Let be the residual after MSE quantization, and let . The QJL step stores a sign vector for the residual. For , this can be written using the normalized residual : where is a random projection matrix. Since the sign function is invariant under positive rescaling, this is equivalent to when . If , the residual correction is zero.
TurboQuantprod stores the MSE quantization, the QJL sign vector, and the residual norm:
The dequantized vector is reconstructed as
The paper proves that TurboQuantprod is unbiased for inner-product estimation: It also gives the distortion bound
Performance and applications
[edit]The TurboQuant paper reports that the algorithm achieves near-optimal distortion rates within a small constant factor of information-theoretic lower bounds. The authors report that, for KV cache quantization, TurboQuant achieved quality neutrality at 3.5 bits per channel and marginal degradation at 2.5 bits per channel.[1]
In long-context LLM experiments using Llama 3.1 8B Instruct, the paper evaluated the method on a "needle-in-a-haystack" retrieval task with document lengths from 4,000 to 104,000 tokens. It reported that TurboQuant matched the uncompressed full-precision baseline while using more than 4× compression, and compared the method against PolarQuant, SnapKV, PyramidKV, and KIVI.[1]
Google Research stated that TurboQuant was evaluated on long-context benchmarks including LongBench, Needle in a Haystack, ZeroSCROLLS, RULER, and L-Eval using open-source models including Gemma and Mistral.[2] According to a report in Tom's Hardware, Google described the method as reducing KV-cache memory by at least six times and achieving up to an eightfold improvement in attention-logit computation on Nvidia H100 GPUs compared with unquantized 32-bit keys.[3]
TurboQuant has also been applied to nearest-neighbor vector search. The original paper reports experiments on DBpedia entity embeddings and GloVe embeddings, comparing TurboQuant with product quantization and other vector-search quantization baselines.[1]
Relationship to other methods
[edit]TurboQuant is related to several methods for efficient large language model inference and high-dimensional search:
- Product quantization – a vector quantization technique widely used for approximate nearest-neighbor search
- Quantization (machine learning) – reducing the numerical precision of weights, activations, or cached tensors in machine learning models
- PagedAttention – a memory-management algorithm for LLM serving that reduces fragmentation in the KV cache
- Johnson–Lindenstrauss lemma – a result in high-dimensional geometry used in random projection methods
- Lloyd's algorithm – an algorithm for scalar and vector quantization, including k-means-style codebook construction
Unlike PagedAttention, which focuses on memory allocation and cache layout, TurboQuant reduces the numerical storage cost of the vectors themselves. Unlike many product-quantization methods, TurboQuant is designed to be data-oblivious and online, avoiding dataset-specific codebook training.[1]
Limitations
[edit]The strongest performance claims for TurboQuant come from the original paper and Google Research's own publication. Coverage in technology media has noted that the broader impact of the method will depend on real-world implementation details, workloads, and hardware architectures.[4]
See also
[edit]- Approximate nearest neighbor search
- Attention mechanism
- List of artificial intelligence algorithms
- Outline of algorithms
- vLLM – open-source LLM inference engine that uses PagedAttention algorithm to improve key–value cache memory efficiency
- 2024–present global memory supply shortage
- AlphaDev, AlphaEvolve, AlphaTensor — AI systems by Google DeepMind for discovering and optimizing algorithms
References
[edit]- ^ a b c d e f g h i j k l m Zandieh, Amir; Daliri, Majid; Hadian, Majid; Mirrokni, Vahab (April 28, 2025). "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate". arXiv:2504.19874 [cs.LG].
- ^ a b Zandieh, Amir; Mirrokni, Vahab (March 24, 2026). "TurboQuant: Redefining AI efficiency with extreme compression". Google Research Blog. Retrieved April 23, 2026.
- ^ James, Luke (March 25, 2026). "Google's TurboQuant reduces AI LLM cache memory capacity requirements by at least six times — up to 8x performance boost on Nvidia H100 GPUs, compresses KV caches to 3 bits with no accuracy loss". Tom's Hardware. Retrieved April 23, 2026.
- ^ Udinmwen, Efosa (March 29, 2026). "'A high-speed digital cheat sheet': Google unveils TurboQuant AI-compression algorithm, which it claims can hugely reduce LLM memory usage". TechRadar. Retrieved April 23, 2026.