Can We Use LLMs Itself to Speed Up LLM Inference?

Zangwei Zheng, zangwei@u.nus.edu
National University of Singapore

Other version: [arXiv] [Github] [中文]
Discuss on X with the author.

TL;DR

We discover that large language models (LLMs) possess a remarkable ability to anticipate the length of their generated responses. By leveraging this capability, we propose a novel technique called Sequence Scheduling to enhance the efficiency of LLM batch inference. By grouping queries with similar anticipated response lengths together, we significantly reduce redundant computations and achieve an impressive 86% improvement in inference throughput without compromising performance.

Sequence Scheduling Pipeline

LLM is aware of its response length

We begin our investigation by examining whether LLMs possess the ability to anticipate the length of their generated responses. To explore this capability, we designed a prompt technique called "Perception in Advance (PiA)" that asks the model to predict the length of its generated response.

short answer by ChatGPT long answer by ChatGPT

We find that popular LLMs such as GPT-4, ChatGPT, and Vicuna can follow the instruction and provide response length estimations. The above two figures demonstrate PiA examples with ChatGPT. For the short response, ChatGPT predicted a length of 10 words with an actual length of 6 words. For the long response, ChatGPT predicted a length of 112 words with an actual length of 119 words. Although ChatGPT may not be explicitly trained for response length prediction, it accurately estimates the length of generated responses.

Instruction tuning improves response length perception

For open-source instruction-finetuned language models (LLMs) like Vicuna-7B, accurately anticipating the length of responses remains challenging. When considering estimates within a range of 100 words as accurate, it achieves only 65% accuracy on the Alpaca dataset. Furthermore, LLMs have a weaker understanding of tokens compared to words, which limits their ability to improve inference efficiency.

To enhance the model's proficiency in anticipating response length, we developed a dataset comprising pairs of instructions and their corresponding token lengths. By leveraging efficient instruction tuning with LoRA, we aimed to improve its performance. As a result of this tuning process, the model achieves an improved accuracy of 81% on the Alpaca dataset.

LLM batch inference

Batch inference statistics

Let's now examine the LLM inference process. Batch inference is a commonly used technique to enhance inference efficiency. In the left figure, we observe that as the batch size increases, the inference throughput increases almost linearly (as indicated by the blue line). However, when performing LLM inference in batches, incorporating sequences with varying response lengths introduces inefficiencies. Shorter sequences must wait for longer ones to complete, resulting in reduced efficiency. We found that approximately 66% of the computations performed are redundant. As the batch size continues to grow, the throughput performance begins to decline (as shown by the red line). This decline occurs because larger batch sizes are more likely to include longer response lengths, leading to a significant increase in redundant computations.

Furthermore, the right figure demonstrates that inference time increases with the token position index. This increase occurs because self-attention operations must be performed on an increasing number of keys and values.

Distribution of Length

In real-world scenarios, query response lengths vary significantly, as shown in the length distribution of queries in the Alpaca dataset for ChatGPT and Vicuna models (left side of the figure). This highlights the need to address the challenge of varying response lengths in LLM inference. Additionally, on the right side, we observe that different samplings of the same data point can result in different lengths, adding to the complexity of anticipating and handling response lengths.

Sequence Scheduling via Response Length Perception

Sequence Scheduling Pipeline

We propose a novel technique called Sequence Scheduling to enhance the efficiency of LLM batch inference. By organizing queries with similar anticipated response lengths into groups, we can significantly minimize redundant computations, resulting in a remarkable 42% improvement in throughput.

Sequence Scheduling Experimental Results

To further optimize throughput, we introduce several additional techniques. Implementing these techniques collectively leads to an impressive 86% enhancement in inference throughput without compromising performance.

Failure Collection and Recomputation (FCR): We limit the number of newly generated tokens to at most the maximum predicted length within a batch. Instructions that exceed this predicted length are deemed failures and are separated for recomputation at the end of the inference process for a group of a specific size. Given the relatively low failure ratio, this approach enables faster generation of shorter responses while minimizing the time spent on regenerating failed instructions.
Variable Batch Size (VBS): We assign larger batch sizes for shorter responses. This approach allows us to process more queries simultaneously, thereby optimizing overall throughput.
Max Length Prediction: The response length perception module predicts the maximum length of multiple sampled responses. Underestimation of response length has more severe consequences compared to overestimation. Therefore, we prioritize accurately predicting the maximum length to avoid truncation and ensure the desired response length.
Binning: We group queries with similar response lengths into bins. This approach reduces the number of bins and enables more efficient scheduling.

Discussion

In this study, we leverage the capabilities of LLMs to enhance their own inference process, leading to the development of what we refer to as an "LLM-Empowered LLM Inference Pipeline." This approach can be viewed as a software-hardware co-design within the realm of AI, and we believe it holds great promise for future research endeavors.

Our research findings demonstrate that LLMs possess a profound understanding of the responses they generate. This insight presents exciting opportunities for developing faster inference techniques, such as non-autoregressive methods, that can overcome the limitations associated with sequential token generation and significantly improve performance.

As LLMs have the potential to become pervasive infrastructure akin to search engines, the volume of queries they handle is expected to rise significantly. Moreover, the advent of models like GPT-4, which supports sequence lengths of up to 32k, and Claude, with 100K sequence length support, further exacerbates the challenge of accommodating varying response lengths. In this context, our approach stands out for its relevance and effectiveness in addressing this challenge.