PowerInfer-2 Achieves 29x Speedup, Running 47-Billion Parameter LLMs on Smartphones

PowerInfer-2 runs massive LLMs (47B+) on smartphones at record speeds by optimizing for heterogeneous hardware and minimizing I/O overhead.


This content originally appeared on HackerNoon and was authored by Writings, Papers and Blogs on Text Models

:::info Authors:

(1) Zhenliang Xue, Co-first author from Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(2) Yixin Song, Co-first author from Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(3) Zeyu Mi, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University (yzmizeyu@sjtu.edu.cn);

(4) Le Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(5) Yubin Xia, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University;

(6) Haibo Chen, Institute of Parallel and Distributed Systems (IPADS), Shanghai Jiao Tong University.

:::

Abstract and 1. Introduction

  1. Background and Motivation
  2. PowerInfer-2 Overview
  3. Neuron-Aware Runtime Inference
  4. Execution Plan Generation
  5. Implementation
  6. Evaluation
  7. Related Work
  8. Conclusion and References

Abstract

This paper introduces PowerInfer-2, a framework designed for high-speed inference of Large Language Models (LLMs) on smartphones, particularly effective for models whose sizes exceed the device’s memory capacity. The key insight of PowerInfer-2 is to utilize the heterogeneous computation, memory, and I/O resources in smartphones by decomposing traditional matrix computations into fine-grained neuron cluster computations. Specifically, PowerInfer-2 features a polymorphic neuron engine that adapts computational strategies for various stages of LLM inference. Additionally, it introduces segmented neuron caching and fine-grained neuroncluster-level pipelining, which effectively minimize and conceal the overhead caused by I/O operations. The implementation and evaluation of PowerInfer-2 demonstrate its capability to support a wide array of LLM models on two smartphones, achieving up to a 29.2× speed increase compared with state-of-the-art frameworks. Notably, PowerInfer-2 is the first system to serve the TurboSparse-Mixtral-47B model with a generation rate of 11.68 tokens per second on a smartphone. For models that fit entirely within the memory, PowerInfer-2 can achieve approximately a 40% reduction in memory usage while maintaining inference speeds comparable to llama.cpp and MLC-LLM. For more details, including a demonstration video, please visit the project site at www.powerinfer.ai/v2.

\

1 Introduction

Large Language Models (LLMs), with their exceptional ability to comprehend and produce human-like text, have fundamentally enhanced our daily lives and transformed our work environments. The most advanced LLMs today, such as GPT4 [26] and Claude-3 [6], are hosted in data centers equipped with state-of-the-art GPUs (e.g., NVIDIA H100 [24]). These GPUs provide extensive high-bandwidth memory and deliver computational capabilities reaching thousands of teraflops. Concurrently, there is an emerging trend towards deploying LLMs on ubiquitous smartphones [33, 38], transforming them into intelligent personal assistants. This shift aims to fully leverage rich personal data while maintaining privacy by avoiding transmission of private data to cloud services. However, smartphones, despite their widespread use, struggle to meet the complex demands of LLM inference due to their constrained processing power and limited memory size.

\ To address these issues, researchers have explored two promising approaches for serving LLM inference under resource-constrained conditions. Given the limited memory capacity of smartphones, one strategy deploys scaled-down LLMs. For example, Google’s Gemini Nano 3.25B [32], which uses less than 2GB of memory, represents a compromise by reducing intelligent capabilities to fit within memory constraints. This is due to larger models having enhanced intelligence, a phenomenon known as the “scaling law” [17].

\ Alternatively, some techniques aim to lower the computational and storage demands of LLM weights during inference. PowerInfer [30] achieves an 11-fold increase in inference speed on personal computers (PC) by allocating hot-activated neurons to the GPU and cold neurons to the CPU. Another method, LLM in a Flash [4], mitigates memory limits by using flash-based NVMe storage for large model weights. However, these solutions falter on smartphones, which have less powerful, heterogeneous hardware and storage devices with lower bandwidth and no support for concurrent accesses due to a single command queue. This makes I/O activities a frequent bottleneck in LLM inference on mobile devices.

\ This paper introduces PowerInfer-2, the first framework that performs high-speed inference of LLMs on smartphones, accommodating models with up to 47 billion parameters that surpass the device’s memory capacity. PowerInfer-2 is the follow-up work to the PowerInfer project, designed specifically for smartphones. Like its predecessor, PowerInfer-2 harnesses the dynamic sparse activation inherent in LLM inference: each inference iteration requires only a subset of neurons, rather than the entirety of the model weights. This method substantially lowers computational demands during inference as PowerInfer-2 needs to process only a select group of neurons per iteration. The inherent sparsity also enhances locality, enabling PowerInfer-2 to build an efficient in-memory cache that maintains the most frequently used neurons in memory, thus mitigating the I/O overhead associated with reading weights.

\ Different from PowerInfer, a key challenge of LLM inference for PowerInfer-2 lies in the ability to leverage the highly heterogeneous XPUs present in contemporary smartphones, such as asymmetric big.LITTLE CPU cores, GPU, and NPU. Inference procedures without fully utilizing hardware features lead to suboptimal generation speed. Another challenge is the inevitable I/O overhead caused by cache misses. Although PowerInfer-2 utilizes sparse activation to reduce the amount of weights required during inference, it still incurs a substantial amount of I/O read operations to retrieve weights from storage, which can adversely affect inference performance.

\ To address these challenges, the core insight of PowerInfer-2 involves breaking down the coarse-grained matrix computations typical in LLM inference into fine-grained neuron cluster computations. A neuron cluster consists of multiple neurons, whose number is determined by the characteristics of the XPUs, memory, and I/O to fully harness the capabilities of the specific hardware components. Specifically, to leverage the heterogeneous XPU within smartphones, PowerInfer-2 designs a polymorphic neuron engine that provides distinct computation patterns for the prefill and decoding stages of the LLM inference process. During the prefill stage, which processes all tokens in the user input sequence concurrently, PowerInfer-2 merges all neurons into a big neuron cluster to maximize the advantages of the NPU in handling large matrix computations. Conversely, in the decoding stage, which has a batch size of one and exhibits significant sparsity, PowerInfer-2 uses small neuron clusters to exploit the flexibility of CPU cores for this comparatively lighter computational task.

\ The neuron cluster granularity further allows PowerInfer-2 to mitigate the impact of I/O overhead on the inference process. PowerInfer-2 introduces a segmented cache that operates in the neuron granularity. This cache is designed with specific caching strategies for different LLM weight types, effectively enhancing the cache hit rate. Furthermore, to reduce computational delays caused by I/O operations, PowerInfer-2 proposes a fine-grained neuron-cluster-level pipelining technique that overlaps I/O operations with neuron cluster computations. This approach significantly minimizes the waiting bubbles associated with I/O latency.

\ To support a broad range of LLMs and smartphones with different configurations, PowerInfer-2 executes an offline planner before the first inference of a new model on a smartphone. This planner receives user requirements and analyzes the model and hardware, and generates an execution plan. The plan describes the configurations of various components that guide the online inference process, including the usage ratios of different XPUs at various stages, the sizes of different cache regions.

\ We have implemented PowerInfer-2 by extending PowerInfer [30] with an addition of 12K lines of code (LoCs), and deployed it on two smartphones (OnePlus 12 and Ace 2), both of which are equipped with heterogeneous Qualcomm XPUs, and have 24GB and 16GB DRAM memory, respectively. PowerInfer-2 supports a diverse array of LLMs across different model sizes, including Llama-2 [29, 34] (7B, 13B), TurboSparse-Mistral [31] (7B), and TurboSparse-Mixtral [31] (47B). Our evaluation demonstrates that PowerInfer-2 realizes an average speedup of 3.94× (up to 4.38×) and 25.4× (up to 29.2×) compared to the current state-of-the-art frameworks: LLM in a Flash [4] and llama.cpp [13]. Notably, PowerInfer-2 is the first system to support the TurboSparse-Mixtral-47B model on mobile platforms, achieving a generation speed of 11.68 tokens/s, which is 21.2× faster than that of llama.cpp. Another significant advantage of PowerInfer-2 is its ability to reduce memory usage during model inference. For instance, with smaller models such as the 7B size, PowerInfer-2’s techniques can save nearly 40% of memory usage while achieving the same inference speed as llama.cpp and MLC-LLM [33].

\

:::info This paper is available on arxiv under CC BY 4.0 license.

:::

\


This content originally appeared on HackerNoon and was authored by Writings, Papers and Blogs on Text Models


Print Share Comment Cite Upload Translate Updates
APA

Writings, Papers and Blogs on Text Models | Sciencx (2025-08-26T15:14:11+00:00) PowerInfer-2 Achieves 29x Speedup, Running 47-Billion Parameter LLMs on Smartphones. Retrieved from https://www.scien.cx/2025/08/26/powerinfer-2-achieves-29x-speedup-running-47-billion-parameter-llms-on-smartphones/

MLA
" » PowerInfer-2 Achieves 29x Speedup, Running 47-Billion Parameter LLMs on Smartphones." Writings, Papers and Blogs on Text Models | Sciencx - Tuesday August 26, 2025, https://www.scien.cx/2025/08/26/powerinfer-2-achieves-29x-speedup-running-47-billion-parameter-llms-on-smartphones/
HARVARD
Writings, Papers and Blogs on Text Models | Sciencx Tuesday August 26, 2025 » PowerInfer-2 Achieves 29x Speedup, Running 47-Billion Parameter LLMs on Smartphones., viewed ,<https://www.scien.cx/2025/08/26/powerinfer-2-achieves-29x-speedup-running-47-billion-parameter-llms-on-smartphones/>
VANCOUVER
Writings, Papers and Blogs on Text Models | Sciencx - » PowerInfer-2 Achieves 29x Speedup, Running 47-Billion Parameter LLMs on Smartphones. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/08/26/powerinfer-2-achieves-29x-speedup-running-47-billion-parameter-llms-on-smartphones/
CHICAGO
" » PowerInfer-2 Achieves 29x Speedup, Running 47-Billion Parameter LLMs on Smartphones." Writings, Papers and Blogs on Text Models | Sciencx - Accessed . https://www.scien.cx/2025/08/26/powerinfer-2-achieves-29x-speedup-running-47-billion-parameter-llms-on-smartphones/
IEEE
" » PowerInfer-2 Achieves 29x Speedup, Running 47-Billion Parameter LLMs on Smartphones." Writings, Papers and Blogs on Text Models | Sciencx [Online]. Available: https://www.scien.cx/2025/08/26/powerinfer-2-achieves-29x-speedup-running-47-billion-parameter-llms-on-smartphones/. [Accessed: ]
rf:citation
» PowerInfer-2 Achieves 29x Speedup, Running 47-Billion Parameter LLMs on Smartphones | Writings, Papers and Blogs on Text Models | Sciencx | https://www.scien.cx/2025/08/26/powerinfer-2-achieves-29x-speedup-running-47-billion-parameter-llms-on-smartphones/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.