Rui Pan 潘瑞

Publications

(*Equal contributions)

2025

SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning
Rui Pan, Yinwei Dai, Zhihao Zhang, Gabriele Oliaro, Zhihao Jia, Ravi Netravali
NeurIPS 2025, San Diego, CA
🏆Spotlight Paper at the NeurIPS 2025 Workshop on Efficient Reasoning

Abstract

Recent advances in inference-time compute have significantly improved performance on complex tasks by generating long chains of thought (CoTs) using Large Reasoning Models (LRMs). However, this improved accuracy comes at the cost of high inference latency due to the length of generated reasoning sequences and the autoregressive nature of decoding. Our key insight in tackling these overheads is that LRM inference, and the reasoning that it embeds, is highly tolerant of approximations: complex tasks are typically broken down into simpler steps, each of which brings utility based on the semantic insight it provides for downstream steps rather than the exact tokens it generates. Accordingly, we introduce SpecReason, a system that automatically accelerates LRM inference by using a lightweight model to (speculatively) carry out simpler intermediate reasoning steps and reserving the costly base model only to efficiently assess (and potentially correct) the speculated outputs. Importantly, SpecReason's focus on exploiting the semantic flexibility of thinking tokens in preserving final-answer accuracy is complementary to prior speculation techniques, most notably speculative decoding, which demands token-level equivalence at each step. Across a variety of cross-domain reasoning benchmarks, SpecReason achieves 1.4-3.0$\times$ speedup over vanilla LRM inference while improving accuracy by 0.4-9.0%. Compared to speculative decoding without SpecReason, their combination yields an additional 8.8-58.0% latency reduction. We open-source SpecReason at https://github.com/ruipeterpan/specreason.
METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation
Siddhant Ray, Rui Pan, Zhuohan Gu, Kuntai Du, Shaoting Feng, Ganesh Ananthanarayanan, Ravi Netravali, Junchen Jiang
ACM SOSP 2025, Seoul, Korea

Abstract

RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge often improves generation quality at the expense of response delay. Prior work either reduces the response delay (through better scheduling of RAG queries) or strives to maximize quality (which involves tuning the RAG workflow), but they fall short in optimizing the \emph {tradeoff} between the delay and quality of RAG responses. This paper presents METIS, the first RAG system that jointly schedules queries and adapts the key RAG configurations of each job, such as the number of retrieved text chunks and synthesis methods, in order to balance quality optimization and response delay reduction. Using 4 popular RAG-QA datasets, we show that compared with the state-of-the-art RAG scheduling system, METIS reduces the generation latency by 1.64--2.54× without sacrificing generation quality.
Marconi: Prefix Caching for the Era of Hybrid LLMs
Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Yida Wang, Ravi Netravali
MLSys 2025, Santa Clara, CA
🏆Outstanding Paper Award, Honorable Mention
(Being) Integrated into SGLang!

Abstract

Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4× higher token hit rates (71.1% or 617 ms lower TTFT) compared to state-of-the-art prefix caching systems.
Mowgli: Passively Learned Rate Control for Real-Time Video
Neil Agarwal, Rui Pan, Francis Y. Yan, Ravi Netravali
USENIX NSDI 2025, Philadelphia, PA

Abstract

Rate control algorithms are at the heart of video conferencing platforms, determining target bitrates that match dynamic network characteristics for high quality. Recent data-driven strategies have shown promise for this challenging task, but the performance degradation they introduce during training has been a nonstarter for many production services, precluding adoption. This paper aims to bolster the practicality of data-driven rate control by presenting an alternative avenue for experiential learning: leveraging purely existing telemetry logs produced by the incumbent algorithm in production. We observe that these logs often contain effective decisions, although often at the wrong times or in the wrong order. To realize this approach despite the inherent uncertainty that log-based learning brings (i.e., lack of feedback for new decisions), our system, Mowgli, combines a variety of robust learning techniques (i.e., conservatively reasoning about alternate behavior to minimize risk and using a richer model formulation to account for environmental noise). Across diverse networks (emulated and real-world), Mowgli outperforms the widely deployed GCC algorithm, increasing average video bitrates by 15-39% while reducing freeze rates by 60-100%.
Humanity's Last Exam
Center for AI Safety, Scale AI
arXiv 2025

Abstract

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,700 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai/.

2024

Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving
Yinwei Dai*, Rui Pan*, Anand Iyer, Kai Li, Ravi Netravali
ACM SOSP 2024, Austin, TX

Abstract

Machine learning (ML) inference platforms are tasked with balancing two competing goals: ensuring high throughput given many requests, and delivering low-latency responses to support interactive applications. Unfortunately, existing platform knobs (e.g., batch sizes) fail to ease this fundamental tension, and instead only enable users to harshly trade off one property for the other. This paper explores an alternate strategy to taming throughput-latency tradeoffs by changing the granularity at which inference is performed. We present Apparate, a system that automatically applies and manages early exits (EEs) in ML models, whereby certain inputs can exit with results at intermediate layers. To cope with the time-varying overhead and accuracy challenges that EEs bring, Apparate repurposes exits to provide continual feedback that powers several novel runtime monitoring and adaptation strategies. Apparate lowers median response latencies by 40.5-91.5% and 10.0-24.2% for diverse CV and NLP classification workloads, and median time-per-token latencies by 70.4-77.9% for generative scenarios, without affecting throughputs or violating tight accuracy constraints.
Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation
Anand Iyer, Mingyu Guan, Yinwei Dai, Rui Pan, Swapnil Gandhi, Ravi Netravali
ACM SOSP 2024, Austin, TX

Abstract

Machine learning inference platforms continue to face high request rates and strict latency constraints. Existing solutions largely focus on compressing models to substantially lower compute costs (and time) with mild accuracy degradations. This paper explores an alternate (but complementary) technique that trades off accuracy and resource costs on a per-input granularity: early exit models, which selectively allow certain inputs to exit a model from an intermediate layer. Though intuitive, early exits face fundamental deployment challenges, largely owing to the effects that exiting inputs have on batch size (and resource utilization) throughout model execution. We present E3, the first system that makes early exit models practical for realistic inference deployments. Our key insight is to split and replicate blocks of layers in models in a manner that maintains a constant batch size throughout execution, all the while accounting for resource requirements and communication overheads. Evaluations with NLP and vision models show that E3 can deliver up to 1.74× improvement in goodput (for a fixed cost) or 1.78× reduction in cost (for a fixed goodput). Additionally, E3's goodput wins generalize to autoregressive LLMs (2.8-3.8×) and compressed models (1.67×).
Optimizing Mixture-of-Experts Inference Latency Combining Model Deployment and Communication Scheduling
Jialong Li, Shreyansh Tripathi, Lakshay Rastogi, Yiming Lei, Rui Pan, Yiting Xia
arXiv 2024

Abstract

As machine learning models scale in size and complexity, their computational requirements become a significant barrier. Mixture-of-Experts (MoE) models alleviate this issue by selectively activating relevant experts. Despite this, MoE models are hindered by high communication overhead from all-to-all operations, low GPU utilization due to the synchronous communication constraint, and complications from heterogeneous GPU environments. This paper presents Aurora, which optimizes both model deployment and all-to-all communication scheduling to address these challenges in MoE inference. Aurora achieves minimal communication times by strategically ordering token transmissions in all-to-all communications. It improves GPU utilization by colocating experts from different models on the same device, avoiding the limitations of synchronous all-to-all communication. We analyze Aurora's optimization strategies theoretically across four common GPU cluster settings: exclusive vs. colocated models on GPUs, and homogeneous vs. heterogeneous GPUs. Aurora provides optimal solutions for three cases, and for the remaining NP-hard scenario, it offers a polynomial-time sub-optimal solution with only a 1.07x degradation from the optimal. Aurora is the first approach to minimize MoE inference time via optimal model deployment and communication scheduling across various scenarios. Evaluations demonstrate that Aurora significantly accelerates inference, achieving speedups of up to 2.38x in homogeneous clusters and 3.54x in heterogeneous environments. Moreover, Aurora enhances GPU utilization by up to 1.5x compared to existing methods.

2023

Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning
Pengfei Zheng, Rui Pan, Tarannum Khan, Shivaram Venkataraman, Aditya Akella
USENIX NSDI 2023, Boston, MA

Abstract

Dynamic adaptation has become an essential technique in accelerating distributed machine learning (ML) training: Recent studies have shown that dynamically adjusting model structure (e.g., lottery ticket hypothesis) or hyperparameters (e.g., batch size) can significantly accelerate training without sacrificing accuracy. However, existing ML cluster schedulers are not designed to handle dynamic adaptation. We show that existing schemes fail to provide fairness and degrade system efficiency when the training throughput changes over time under dynamic adaptation. We design Shockwave, a scheduler with future planning that builds on two key ideas. First, Shockwave extends classic market theory from static settings to dynamic settings to co-optimize efficiency and fairness. Second, Shockwave utilizes stochastic dynamic programming to handle uncertain, dynamic throughput. We build a system for Shockwave and validate its performance with both trace-driven simulation and cluster experiments. Results show that for traces of ML jobs with dynamic adaptation, Shockwave improves makespan by 1.3× and fairness by 2× when compared with existing fair scheduling schemes.

2022

Efficient Flow Scheduling in Distributed Deep Learning Training with Echelon Formation
Rui Pan*, Yiming Lei*, Jialong Li, Zhiqiang Xie, Binhang Yuan, Yiting Xia
ACM HotNets 2022, Austin, TX

Abstract

This paper discusses why flow scheduling does not apply to distributed deep learning training and presents EchelonFlow, the first network abstraction to bridge the gap. EchelonFlow deviates from the common belief that semantically related flows should finish at the same time. We reached the key observation, after extensive workflow analysis of diverse training paradigms, that distributed training jobs observe strict computation patterns, which may consume data at different times. We devise a generic method to model the drastically different computation patterns across training paradigms, and formulate EchelonFlow to regulate flow finish times accordingly. Case studies of mainstream training paradigms under EchelonFlow demonstrate the expressiveness of the abstraction, and our system sketch suggests the feasibility of an EchelonFlow scheduling system.

Some other non peer-reviewed write-ups include:

CS 759 Project Report: Cautiously Aggressive GPU Space Sharing for Improving Resource Utilization and Job Efficiency (pdf)
CS 744 Project Report: Comparing Black-Box Optimization Methods for Online DBMS Tuning (pdf)
AgDH: A System for Gathering and Disseminating Dairy Data (pdf)

Rui Pan 潘瑞

About

Publications

2025

SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning

METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation

Marconi: Prefix Caching for the Era of Hybrid LLMs

Mowgli: Passively Learned Rate Control for Real-Time Video

Humanity's Last Exam

2024

Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving

Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation

Optimizing Mixture-of-Experts Inference Latency Combining Model Deployment and Communication Scheduling

2023

Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning

2022

Efficient Flow Scheduling in Distributed Deep Learning Training with Echelon Formation

Education

🐅Princeton University

🦡University of Wisconsin-Madison

Experience

Student Researcher @ Google, SystemsResearch@Google

Applied Scientist Intern @ AWS AI, Machine Learning Systems Team

🇩🇪Research Intern @ Max Planck Institute for Informatics (MPI-INF)

Professional Activities

Interests

Contacts