Abstract:
We have collected a data set for the networks of statisticians, consisting of titles, authors, abstracts, MSC numbers, keywords, and citation counts of papers published in representative journals in statistics and related fields. In Phase I of our study, the data set covers all published papers from 2003 to 2012 in Annals of Statistics, Biometrika, JASA, and JRSS-B. In Phase II of our study, the data set covers all published papers in 36 journals in statistics and related fields, spanning 40 years. We report some Exploratory Data Analysis (EDA) results including productivity, journal-journal citations, and citation patterns. This part of result is based on Phase II of our data set (ready for use not very long ago). We also discuss two closely related problems: network community detection, and network membership estimation. We attack these problems with the recent approach of Spectral Clustering On Ratioed Eigenvectors (SCORE), reveal a surprising simplex structure underlying the networks, and explain why SCORE is the right approach. We apply SCORE to the Coauthorship and Citation networks of statisticians (based on Phase I of our data set), and present several communities including “Large- Scale Multiple Testing”, “Variable Selection”, “Spatial Statistics”, “Carroll-Hall”, and “North Carolina”.
About the Speaker:
Jiashun Jin received his Ph.D in Statistics from Stanford University in 2003. He was trained in statistical inference for Big Data, specializing in dealing with the most challenging regime where the signals are both Rare and Weak. In such Rare/Weak settings, many conventional approaches fail, and it is desirable to find new methods and theory that are appropriate for such situations.
His earlier work was on large-scale multiple testing, focusing on (Tukey’s) Higher Criticism and practical False Discovery Rate (FDR) controlling methods. He has developed the idea of Higher Criticism into a class of methods that are useful for solving problems in genetics and genomics and cosmology and astronomy, including cancer classification, cancer clustering, and nonGaussian signature detection in the Cosmic Microwave Background (CMB). He has proposed to use the so-called “phase diagram” as a new optimality measure that is particularly appropriate for Big Data settings where the signals of interest are Rare/Weak, and worked out the phase diagrams for many seemingly unrelated settings.
His more recent interest is on complex graphs, social networks, and sparse PCA and Random Matrix Theory. He has developed a number of new methods, among which are the Graphlet Screening (GS) for high dimensional variable selection, IF-PCA for dimension reduction and high dimensional clustering, and SCORE for network community detection.
Jin and coauthors have collected and cleaned a data set for the coauthorship and citation networks for statisticians. The data set consists of titles, authors, keywords, abstracts, and citation counts of approximately 70,000 papers published in 36 journals in statistics and related fields, spanning about 40 years. The data set provides a fertile ground for researches in social network of statisticians. It also opens doors for quantitative evaluation of the impacts of statistical research.