北京大学统计科学中心

首页» 新闻动态» 学术讲座» 统计与数据科学系列讲座

统计与数据科学系列讲座

A tailored robust multivariate clustering approach via mean-shift penalization

报告人： 陈昆, 康涅狄格大学

时间：2016-05-27 14:00 ~ 15:00

地点：理科一号楼1418

Abstract: Finite mixture models have been widely used for modeling clustered and heterogeneous population. Inunivariate analysis, a mixture model of three components is commonly used for modeling heavy-tailed distributions arising from multiple comparisons or multiple testing problems, in which one component represents the null and the other two capture departures from the null at the two tails. However, inmultivariate case, a three-component mixture model may become inadequate to characterize the entire heavy-tailed behavior of a multivariate distribution, since the two “tail” components now can only cover two specific corners in the multivariate space. In general, failing to accommodate such rare, nuisance but extreme observations deviating from an expected cluster pattern may jeopardize both model estimation and inference. Motivated by a proteomics application for identifying proteins of concordant change, we propose a robust multivariate normal mixture model with a case-specific mean-shift parameterization. A penalized likelihood approach is adopted to induce sparsity among the mean-shift parametersin order to distinguish the anomalies from the data cloud of certain expected cluster pattern, and a generalized Expectation-Maximization algorithm is developed for stable and efficient optimization. We explore the connections and differences between the proposed approach and other existing robust clustering methods.Under a mild eigenvalue-ratio condition, we show that the problem of unbounded likelihood is resolved and the solution of the proposed method is well defined.We tailor the proposed method to incorporate prior biological expectations and to handle replicated, incomplete data in a real-world proteomics application, for robustlydetectingproteins ofconcordantly changed intensity levels across multiple disease-producing strains compared to a non-disease-producing strain.If time permits, extensions to heterogeneous responses and low-rankestimation will be discussed.

About the Speaker: