## Research

I work on theoretical statistics with a focus on clustering, bootstrap, random matrices and networks. Here are some of my publications. For all publications look at the arxiv link.

**Students and Postdocs:**

Robert Lunde , SDS, U. Michigan, Research Fellow, 2021-2022

Asst. Prof., Department of Mathematics and Statistics, Washington University in St. Louis, 2022.

Qiaohui Lin , SDS (Jointly advised with Peter Mueller)

Prateek Srivastava , ORIE (Jointly advised with Grani Hanasusanto)

Xueyu Mao, CS (now at Amazon)

Bowei Yan , SDS (now at Nuro)

**PREPRINTS:** Find me on Arxiv.

Bootstrapping the error of Oja’s algorithm.

(Robert Lunde, Purnamrita Sarkar, Rachel Ward)

- An limiting distribution for the error of Oja's algorithm and an online bootstrap method for estimating the distribution.

- NeurIPS 2021. Spotlight.

Two provably consistent divide and conquer clustering algorithms for large networks

(Soumendu S. Mukherjee, Purnamrita Sarkar, Peter J. Bickel)

- Two methods to scale up network clustering to very large graphs

- PNAS 2021. Forthcoming.

Consistent Nonparametric Methods for Network Assisted Covariate Estimation

(Xueyu Mao, Deepayan Chakrabarti, Purnamrita Sarkar)

- Designing methods to estimate missing covariates using a network and available covariates using simple local network statistics

- ICML 2021

When random initializations help: a study of variational inference for community detection

(Purnamrita Sarkar, Y.X. Rachel Wang, Soumendu Mukherjee )

- Understanding the role of random initializations in variational inference.

- JMLR 2021.

On the Theoretical Properties of the Network Jackknife

(Qiaohui Lin, Robert Lunde , Purnamrita Sarkar )

Talk slides

- Jackknife for network moments

- ICML 2020.

On hyperparameter tuning in general clustering problems

(Xinjie Fan, Yuguang Yue, Purnamrita Sarkar, Y. X. Rachel Wang)

- A general and provable framework for doing hyperparameter tuning and model selection in clustering

- ICML 2020.

A Theoretical Case Study of Structured Variational Inference for Community Detection

(Mingzhang Yin, Y. X. Rachel Wang, Purnamrita Sarkar )

- Understanding the role of dependence structure in variational inference.

- AISTATS 2020.

Hierarchical Community Detection by Recursive Partitioning

(Tianxi Li, Lihua Lei, Sharmodeep Bhattacharya, Koen Van den Berge, Purnamrita Sarkar, Peter J. Bickel, Elizaveta Levina)

- A model-free, computationally efficient, hierarchical algorithm for clustering networks with hierarchical community structure

- JASA Theory and Methods, To appear.

Estimating Mixed Memberships with Sharp Eigenvector Deviations

(Xueyu Mao, Purnamrita Sarkar, Deepayan Chakrabarti)

- A result on row-wise deviation of empirical eigenvectors of a broad class of low rank matices, and its application to mixed membership estimation in networks

- JASA Theory and Methods, To appear.

Covariate Regularized Community Detection in Sparse Graphs

(Bowei Yan, Purnamrita Sarkar)

- A semidefinite relaxation to do community detection in networks in presence of covariates

- JASA Theory and Methods, To appear.

Network modelling of topological domains using Hi-C data

(Y. X. Rachel Wang, Purnamrita Sarkar, Oana Ursu, Anshul Kundaje, Peter J. Bickel)

- A non-exchangeable random graph model and a linear program for detecting TADs in HiC data

- Annals of Applied Statistics, In press.

Mean Field for the Stochastic Blockmodel:
Optimization Landscape and Convergence Issues.

(Soumendu Sundar Mukherkee, Purnamrita Sarkar, Y. X. Rachel Wang, and Bowei Yan. )

- A better understanding of the landscape of the meanfield loss function for blockmodels and its implication to inference.

- NIPS 2018.

- Supplementary here .

Overlapping Clustering Models, and One (class) SVM to Bind Them All.

(Xueyu Mao, Purnamrita Sarkar, and Deepayan Chakrabarti. )

- An overarching inference framework for clustering with overlap.

- NIPS 2018, Spotlight.

- Supplementary here .

Provable Estimation of the Number of Blocks in Block Models

(Bowei Yan, Purnamrita Sarkar, and Xiuyuan Cheng. )

- Recovering K in the stochastic blockmodels via SDPs.

- AISTATS 2018.

On clustering network-valued data.
[Supplementary]

(Soumendu Mukherjee, Purnamrita Sarkar, Lizhen Lin)

- Two clustering algorithms for clustering where each datapoint corresponds to a network.

- NIPS 2017.

Statistical Convergence Analysis of Gradient EM on multi-component Gaussian Mixture Models.
[Supplementary]

(Bowei Yan, Mingzhang Yin, Purnamrita Sarkar)

- Local convergence guarantees for EM with general Gaussian Mixture Models

- NIPS 2017.

On Mixed Memberships and Symmetric Nonnegative Matrix Factorizations.
[Supplementary] [code]

(Xueyu Mao, Purnamrita Sarkar, Deepayan Chakrabarti)

- Asymptotic guarantees for Symmetric Non-negative matrix factorization for the Mixed Membership Blockmodel.

- ICML 2017.

On Robustness of Kernel Clustering.
[arxiv] [code]

(Bowei Yan, Purnamrita Sarkar)

- Asymptotic guarantees for a semidefinite relaxation of Kernel Clustering under outlier nodes.

- NIPS 2016.

Answering Enumeration Queries with the Crowd.

(Beth Trushkowsky and Tim Kraska and Michael J. Franklin)

- We use statistical tools that enable one to reason about query completeness and help with query execution and crowdsourcing strategies.

- Communications of the ACM 2016.

The Consistency of Common Neighbors for Link Prediction in Stochastic Blockmodels.
[Supplementary]

(Purnamrita Sarkar, Deepayan Chakrabarti, Peter Bickel)

- Asymptotic guarantees for link prediction with common neighbors .

- NIPS 2015.

Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning.

(Barzan Mozafari, Purnamrita Sarkar, Michael Franklin, Michael Jordan, Samuel Madden)

- Using the theory of nonparametric bootstrap, we propose an active learning scheme for a crowdsourced database.

- To appear in VLDB 2015.

Role of normalization in spectral clustering.

(P. Sarkar and P. J. Bickel)

- We theoretically investigate the effect of normalization for Spectral Clustering in Stochastic Blockmodels. A common practice in Spectral Clustering is to use the normalized adjacency matrix.
Our results suggest that normalization improves the second order consistency properties of the principal eigenvectors of the adjacency matrix. Under the spectral embedding, normalization
reduces the variance of points in a class while keeping the distance between cluster centers intact.

- To appear in the Annals of Statistics.

Hypothesis Testing for Automated Community Detection in Networks.

(P. J. Bickel and P. Sarkar)

- Most network clustering algorithms assume that the number of clusters k is known. We propose a hypothesis testing framework for automatically determining the number of clusters in networks generated from Stochastic Blockmodels. We theoretically derive
the asymptotic distribution of the largest eigenvalue of a shifted and scaled adjacency matrix and use this for the test. Further we develop a recursive bipartitioning algorithm
for generating a hierarchical clustering of the graph.

- To appear in the Journal of Royal Statistical Society, Series B.

Crowdsourced Enumeration Queries.
(International Conference on Data Engineering, ICDE 2013)

(B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar)

-We propose to use statistical species estimation tools for improving the quality of
enumeration type queries in crowdsourced databases.

- Received the Best Paper Award

Nonparametric Link Prediction in Dynamic Networks.
(International Conference on Machine Learning, ICML 2012). [Appendix].

(P. Sarkar, D. Chakrabarti, and M. I. Jordan)

-We propose a nonparametric time-series model for dynamic networks with non-linear linkage patterns. We use simple link prediction heuristics from static networks as features, and propose a scalable estimation procedure that
uses locality sensitive hashing. We also prove asymptotic consistency of the estimator under the specified model.

-*Extended version:*
Nonparametric Link Prediction in Large Scale Dynamic Networks.

-To appear in the Electronic Journal of Statistics.

Big Data Bootstrap.
(International Conference on Machine Learning, ICML 2012)

(A. Kleiner, A. Talwalkar, P. Sarkar, and M. I. Jordan)

-With the huge amounts of noisy data generated from public and corporate sources everyday, estimation of confidence on a quantity of inferential interest
is becoming more and more important. In this work we present "Bag of Little Bootstraps", a sampling scheme which obtains compact representations of bootstrap samples. This algorithm
significantly reduces the computational burden of bootstrap without losing out on its asymptotic consistency properties.

-*Extended version:*
A Scalable Bootstrap for Massive Data.

-Journal of Royal Statistical Society, Series B. In press.

Theoretical Justification of Popular Link Prediction Heuristics.
(International Conference on Learning Theory, COLT 2010)

(P. Sarkar, D. Chakrabarti, and A. W. Moore)

- Link prediction is an interesting and important practical problem in social networks. There are many different graph-based heuristics for doing link prediction, namely number of common neighbors between two nodes, shortest path etc.
By using a generative model for link formation, and geometric intuitions, we justify some popular link prediction heuristics. Here is the presentation.

- Received the Mark Fulk Best Student Paper Award

- Invited to the Best Paper Track, International Joint Conference on Artificial Intelligence (IJCAI), 2011.

Fast Nearest-neighbor Search in Disk-resident Graphs. (ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2010)

(P. Sarkar and A. W. Moore)

- We examine the problem of computing random walk based measures on large disk-resident graphs. We also examine the relationship between different proximity measures in order to analyze the effect
of high degree nodes on random walks. The technical report can be found
here.

Fast Dynamic Reranking in Large Graphs.
(International Conference on World Wide Web, WWW 2009)

(P. Sarkar and A. W. Moore)

- In this paper we consider the problem of
re-ranking search results by incorporating user feedback. We propose to
use a graph theoretic measure for discriminating irrelevant results
from relevant results using a few labeled examples provided by the
user. We present efficient sampling and neighborhood expansion algorithms to compute the top k nodes under this measure.
Here is the presentation.

Fast Incremental Proximity Search in Large Graphs.
(International Conference on Machine Learning, ICML 2008)

(P. Sarkar, A. W. Moore, and A. Prakash)

- In this paper we combine random sampling with
deterministic neighborhood expansion schemes to compute approximate
nearest neighbors of a query under hitting and commute times.
Resulting algorithms can process queries on the fly without any
preprocessing. This is of crucial importance for graphs which change
quickly over time. Here is the presentation.

A Tractable Approach to Finding Closest Truncated-commute-time Neighbors in Large Graphs.
(Uncertainty in Artificial Intelligence, UAI 2007)

(P. Sarkar and A. W. Moore)

- We present GRANCH, an algorithm to compute all
interesting pairs of approximate nearest neighbors in truncated
commute times in a graph, without
computing it between all pairs. Our algorithm provably prunes away
uninteresting pairs of nodes in the graph, and as a result
quickly finds the "potential nearest neighbors".

A Latent Space Approach to Dynamic Embedding of Co-occurrence Data.
(International Conference on Artificial Intelligence and Statistics, AISTATS 2007)

(P. Sarkar, S. Siddiqi, and G. Gordon)

- We propose a graphical model to compute dynamic
embeddings of co-occurrence data. We show how to make inference
tractable by using Kalman Filters, which also provides distributional
information of the embedding inferred.

Dynamic Social Network Analysis using Latent Space Models.
(Advances in Neural Information Processing Systems, NIPS 2005)

(P. Sarkar and A. W. Moore)

- We present a dynamic model for social networks changing
over time. We combine a new dynamic multidimentional scaling algorithm
with a tractable local optimization technique for inference. Both of
these algorithms are highly efficient (sub-quadratic in size of the
social network).

- [*Extended version*:]
Appeared in SIGKDD Explorations: Special Edition on Link Mining 2005