北京大学统计科学中心

首页» En» Events» Seminars» Statistics and Data Science

Statistics and Data Science

Some Statistical Opportunities for Large Language Models

Holder： Weijie Su（University of Pennsylvania）

Time：2025-06-25 11:00-12:00

Location：Shanlan Hall, Zhihua Building

Abstract: In this talk, we advocate for the development of rigorous statistical foundations for large language models (LLMs). We begin by elaborating two key features that motivate statistical perspectives for LLMs: (1) the probabilistic, autoregressive nature of next-token prediction, and (2) the complexity and black box nature of Transformer architectures. To illustrate how statistical insights can directly benefit LLM development and applications, we present two concrete examples. First, we demonstrate statistical inconsistencies and biases arising from the current approach to aligning LLMs with human preference. We propose a regularization term for aligning LLMs that is both necessary and sufficient to ensure consistent alignment. Second, we introduce a novel statistical framework to analyze the efficiency of watermarking schemes, with a focus on a watermarking scheme developed by OpenAI for which we derive optimal detection rules that outperform existing ones. Time permitting, we will explore how statistical principles can inform rigorous evaluation protocols for LLMs. Collectively, these findings showcase how statistical insights can address pressing challenges in LLMs while simultaneously illuminating new research avenues for the broader statistical community to advance responsible generative AI research. This talk is based on arXiv:2404.01245, 2405.16455, 2503.10990, 2505.19145, and 2506.02058.

About the Speaker:

Weijie Su is an Associate Professor in the Wharton Statistics and Data Science Department and, by courtesy, in the Departments of Biostatistics, Epidemiology and Informatics, Computer Science, and Mathematics, at the University of Pennsylvania. He is a co-director of Penn Research in Machine Learning (PRiML) Center. Prior to joining Penn, he received his Ph.D. in statistics from Stanford University in 2016 and bachelor's degree in pure mathematics from Peking University in 2011. His research interests span statistical foundations of generative AI, privacy-preserving data analysis, high-dimensional statistics, and optimization. He serves as an associate editor of Journal of Machine Learning Research, Journal of the American Statistical Association, Foundations and Trends in Statistics, Operations Research, and Journal of the Operations Research Society of China, and he is currently guest editing a special issue on Statistics for Large Language Models and Large Language Models for Statistics in Stat. His work has been recognized with several awards and honors, such as the Stanford Anderson Dissertation Award, NSF CAREER Award, Sloan Research Fellowship, IMS Peter Hall Prize, SIAM Early Career Prize in Data Science, ASA Noether Early Career Award, the ICBS Frontiers of Science Award in Mathematics, and two discussion papers in JRSSB and JASA. He is a Fellow of the IMS.

Your participation is warmly welcomed!

欢迎扫码关注北大统计科学中心公众号，了解更多讲座信息!