
Bertrand Clarke
Professor Statistics University of Nebraska-Lincoln
Contact
- Address
-
HARH 354B
- Phone
-
-
- Website
-
Areas of Expertise:
Data mining and machine learning, prediction, statistical techniques for complex or high-dimensional data, model selection and uncertainty.
Research Areas of Interest:
My main interest these days is in prediction. This is broader than it sounds because prediction brings in questions about model uncertainty (Which model, if any, is true?) model mis-specification (If no model is true, what’s the least bad one?), model complexity (When is more complex modeling better than a simple approach?) and the other sources of variability and bias that have to be small enough for a prediction is useful. Obviously, different model classes can be used to generate predictors but there are also predictors that are not based on any model class. This is the case, for instance, with many machine learning methods such as bagging, boosting, kernel methods, and ensemble methods more generally. In these cases, it is reasonable to ask what the predictor means, i.e., what does a good predictor say about the properties of the phenomenon being predicted? Complex and high dimensional data are the natural places to use predictive techniques since model identification is so hard – even if one believes a model exists (often a dubious assumption). So, I tend to be interested in genomic or other types of complex data where useful formal theory is rare but statistical principles (variance-bias, robustness, complexity minimization, etc.) still provide helpful guidance. Analyzing complex data, or better, developing and understanding good predictors for complex data, often includes clustering, dimension reduction, complexity concepts, ensemble methods – and much else. Indeed, the predictive approach can be regarded as providing an overall conceptualization of the statistical problem in much the same way as Bayes, frequentist, survey sampling, or decision theory does.
Publications:
In Preparation:
Clarke, B. On the problem of selecting a list of models for predictive purposes.
Submitted
In-press (2025+)
All Publications
(2021-2025)
- Dustin, D., Ghosh, S., and Clarke, B. (2025) Testing for the Important Components of Predictive Variance. Stat. Analysis and Data Mining, 18, No. 4, 15 pages.
- Clarke, B. and Yao, Y. (2025) A cheat sheet for Bayesian prediction. Stat. Sci., 40, 3-24.
- Clarke, B. and Rigo, P. (2025) Introduction to the Special Issue on Contemporary Bayesian Prediction. Stat. Sci, 40, 1-2.
- Dustin, D. and Clarke, B. (2024) Post-model selection prediction for generalized linear models. In the Special Issue in Honor of D. Basu’s Centennial. Sankhya A, 86, Suppl. 1, 301-326.
- Dustin, D., Clarke, B., and Clarke, J. (2024) Predictive stability criteria for penalty selection in linear models. Comput. Stat. 39, 1241-1280. Supplementals: Supplementary file 1 (pdf 282 KB)
- Clarke, B. (2024) Invited comment on ‘Defining a Credible Interval Is Not Always Possible with “Point-Null” Priors: A Lesser-Known Correlate of the Jeffreys-Lindley Paradox.’ by Campbell and Gustafson. Bayesian Anal., 19, 956-962.
- Clarke, B., Haran, M. Jones, G., and MacEachern, S. (2024) Rethinking Faculty Research Evaluation. Amstat News, 566, 40-41.
- Jarquin, D., Roy, A., Clarke, B., and Ghosal, S. (2023). Combining phenotypic and genomic data to improve prediction of binary traits. J. Appl. Stat., 51, 1497-1523.
- Clarke, B. (2023) Comment on ‘Martingale posterior distributions’ by Fong, Holmes, and Walker. J. Roy. Stat. Soc., 85, 1400-1401.
- Le, T. and Clarke, B. (2022) Interpreting the uninterpretable: Kernel methods, Shtarkov solutions, and random forests. Statistical Theory and Related Fields, 6, 10-28.
- Le, T. and Clarke, B. (2022) Model averaging is better than model selection for prediction. J. Machine Learning Research, 23, paper #33, 1-53.
- Sun, Y., Clarke, B., Clarke, J. , and Li, X. (2021) Predicting antibiotic resistance gene abundance in activated sludge using shotgun metagenomics and machine learning. Water Research, 202, 117384 (11 p.).
- Clarke, B. (2021) Invited comment on ‘Bayesian restricted likelihood methods: Conditioning on insufficient statistics in Bayesian regression’ by Lewis et al. Bayesian Analysis Vol. 16, p. 41-47.
Clarke, B. and Datta, G. (2021) Preface for the Jayanta K. Ghosh Memorial Volume of Sankhya, Series B. Vol. 83 B, Part 1, p. 1-2.
(2011-2020)
- Le, T, Clarke, B. (2020). In praise of partially interpretable predictors. Stat Anal Data Min: The ASA Data Sci Journal; 13: 113– 133
- Amiri, S., Clarke, B., Clarke, J., and Koepke, H. (2019) A General Hybrid Clustering Technique, Journal of Computational and Graphical Statistics, 28:3, 540-551
- Dobra, A., Valdes, C., Ajdic, D., Clarke, B., & Clarke, J. (2019). Modeling association in microbial communities with clique loglinear models. Annals of Applied Statistics, 13(2), 931-957.
- Clarke, B. and Mpoudeu, M. (2019) "Model Selection via the VC Dimension," Journal of Machine Learning Research, 20, 1-26
- Clarke, B. (2019) Invited comment on ‘Prior-based Bayesian Information Criterion’ by Bayarri et . al. Statistical Theory and Related Fields Vol. 3, 26-29.
- Amiri, S., Clarke, B., and Clarke, J. (2018) Clustering Categorical Data via Ensembling Dissimilarity Matrices, Journal of Computational and Graphical Statistics, 27:1, 195-208
- Le, T., & Clarke, B. (2018). On the interpretation of ensemble classifiers in terms of Bayes classifiers. Journal of Classification, 35(2), 198-229.
- Clarke, B. and Clarke, J. (2018) Predictive Statistics: Analysis and Inference Beyond Models, Cambridge University Press, U.K. (600 p.)
- Amiri, S., Clarke B., Clarke, J. (2018) https://github.com/saeidamiri1/GHC/wiki. Software to implement hybrid K-means, single linkage clustering.
- Le, Tri, and Clarke, B. (2017). "A Bayes Interpretation of Stacking for M-Complete and M-Open Settings" Bayesian Analysis 12.3: 807-829.
- Le, T., & Clarke, B. (2016). Using the Bayesian Shtarkov solution for predictions. Computational Statistics and Data Analysis, 104, 183-196.
- Amiri, S., Clarke, B. and Clarke, J. ENSCAT (2016) an R package to do ensemble clustering of categorical data.
- Yu, C. W., & Clarke, B. (2015). Regular, median and Huber cross‐validation: A computational comparison. Statistical Analysis and Data Mining: The ASA Data Science Journal, 8(1), 14-33.
- Clarke, B., Valdes, C., Dobra, A., & Clarke, J. (2015). A Bayes testing approach to metagenomic profiling in bacteria. Statistics and Its Interface, 8(2), 173-185.
- Valdes, C., Brennan, M., Clarke, B., & Clarke, J. (2015). Detecting bacterial genomes in a metagenomic sample using NGS reads. Statistics and Its Interface, 8(4), 477-494.
- Clarke, B. and Clarke, J. (2014) “Estimating proportions in a mixed sample using transcriptomics.”STAT, Vol. 3, 313-325.
- Clarke, B. and Chu, J. (2014) “‘Generic feature selection with short, fat data’. Invited paper for Special Issue of J. Ind. Soc. Ag. Stat. Vol. 68, 145-162.
- Clarke, B., Clarke, J. and Yu, C.-W. (2014) Statistical problem classes and their links to information theory. Econ. Reviews, Zellner Memorial Issue, Vol. 33, 337-371
- Yu, C.-W., Clarke, B. and Clarke, J. (2013) “Bayes Prediction in the M-complete problem class with moderate sample size” Bayes Analysis, Vol. 8, 647-690
- Koepke, H. and Clarke, B. (2013) “A Bayesian Criterion for Clustering Stability.”Statistical Analysis and Data Mining, Vol. 4, 346-374
- Koepke, H., Hu, Z. and Clarke, B. (2013) EASYSTAB an R package for assessing clustering stability. See https://cran.r-project.org/src/contrib/Archive/easystab/.
- Clarke, B. (2013) Guest Editorial for Special Issue of Statistical Analysis and Data Mining Vol. 6, No. 4., 271-272.
- Clarke, B. and Holt, G. (2013) Comment on ‘Nonparametric Bayes Inference – Why and How’ by Mueller and Mitra. Bayesian Analysis Vol. 8, 329-331.
- Clarke, B. and Clarke, J. (2012) `How to Predict in Several Conventional Settings’.Statistics Surveys, Vol. 6, 1-73
- Clarke, B. (2012) Invited comment on ‘Catching up faster by switching sooner’ by van Erven et al. Journal of the Royal Statistical Society Series B Vol. 74, 47-50.
- Clarke, B. (2012) Invited comment on ‘Universality of Bayes predictions” by Sancetta, Bayesian Analysis Vol. 7, 37-43.
- Koepke, H. and Clarke, B. (2011)“On The Limits of Clustering in High Dimensions via Cost Functions”.Stat. Anal. and Data Mining, Vol. 4, 30-53.
- Fokoue, E. and Clarke, B. (2011) “Variance Bias Tradeoff for Prequential Model List Selection”.Stat. Papers, Vol. 52, 813-833
- Clarke, B. and Severinski, C. (2011) Invited comment on ‘Shrink globally, act locally’ by Polson and Scott. Proceedings of the IX Valencia Conference on Bayesian Statistics, Bernardo, J. M. et al. Eds. 523-528. Oxford Univ. Press. Title: Subordinators, Adaptive Shrinkage and a Prequential Comparison of Three Sparsity Methods.
Clarke, B. (2011) Comment on a paper by J. M. Bernardo. Proceedings of the IX Valencia Conference on Bayesian Statistics, Bernardo, J. M. et al. Eds. 30-32, Oxford Univ. Press, Oxford. Title: Integrated Analysis = Benchmark Analysis.
(2001-2010)
- Clarke, B. (2010) “Desiderata for a Predictive Theory for Statistics”. Bayesian Analysis, Vol. 5, No. 2, 283-318.
- Yu, C-W and Clarke, B. (2010) “Asymptotics of Bayesian Median Loss Estimation”J. Mult. Analysis, Vol. 101, No.9, 1950-1958.
- Clarke, J., Seo, P. and Clarke, B. (2010). “Statistical expression deconvolution from mixed tissue samples.” Bionformatics, Vol. 26, No. 8, 1043-1049.
- Clarke, B. and Yuan, A. (2010) “Reference Priors for Empirical Likelihoods.” in: Frontiers of Statistical Decision Making and Bayesian Analysis. Co-Editors: Chen, M., Dey, D., Mueller, P. Sun, D. and Ye, K. Springer, New York, p. 56-68.
- Yu, C-W and Clarke, B. (2010) “Median Loss Decision Theory”. J. Stat. Planning and Inference, Vol 141, 611-623.
- Severinski, C., Clarke, B., Fokou´e, E., and Zhang, H. (2010) Solutions manual for ‘Principles and Theory for Data Mining and Machine Learning’ by Clarke, Fokou´e, and Zhang. (e-copy for instructors only) Springer, New York. (450 p.)
- Clarke, J. and Clarke, B. (2009) “Prequential Analysis of Complex Data with Adaptive Model Reselection”. Stat. Analysis and Data Mining, Vol. 2, No. 4, 274-290.
- Clarke, B., Fokou´e, E., and Zhang, H. (2009) Principles and Theory for Data Mining and Machine Learning, Springer, New York. (800 p.)
- Clarke, B. and Clarke J. (2009) Unsubmitted: Thoughts on Refereeing. Bulletin of the IMS Vol. 38, No. 6, 8-9.
- Datta, G., Bhattacharya, A. and Clarke, B. (2008) “Bayesian Tests for the Zero Inflated Poisson Model”. In: Beyond Parametrics in Interdisciplinary Research: Festschrift in Honor of P. K. Sen, Balakrishnan, A., Pena, E, and Silvapulle, M. Eds. p. 89-104.
- Clarke, B. and Ghosal, S. (2008). Pushing the Limits of Contemporary Statistics: Contributions in Honor of Professor Jayanta K. Ghosh. IMS Collections Volume 3, Institute of Mathematical Statistics, Beachwood, OH. (330 p.)
- Lin, X., Pittman, J. and Clarke, B. (2007). “Information Conversion, Effective Samples & Parametric Size”. Information Theory Transactions. Vol. 53, No. 12, 4438-4456.
- Clarke, B. (2007). “Information Optimality and Bayes Models”.Journal of Econometrics, Vol. 138, No. 2, 405-429.
- Clarke, B. (2007) Comment on ‘Objective Bayesian Analysis for the Multivariate Normal Model’ by Sun and Berger. In: Bayesian Statistics 8: Proceedings of the 8-th Valencia International Meeting on Bayesian Statistics, Bayarri M. et al. Eds. 551-553. Oxford Univ. Press, Oxford.16
- Clarke, B. (2007). Statistics: We should be Leading not Serving. Bulletin of the IMS, Vol 36 No. 6, 8-9.
- Clarke, B. and Yuan, A. (2006). “Closed Form Expressions for Bayesian Sample Sizes”. Annals of Statistics, Vol. 34, No. 3, 1293-1330.(pdf)
- Clarke, B. and Song, X. (2004). “Approximating the Dependence Structure of Discrete and Continuous Stochastic Processes”. Sankhya A Vol. 66, No. 3, 536-547.(pdf)
- Wong, H. and Clarke, B. (2004). “Characterizing Model Weights Given Partial Information in Normal Models”. Statistics and Probability Letters. Vol. 68, No. 1, 27-37.(pdf)
- Wong, H. and Clarke, B. (2004). “Improvement over Bayes Prediction in Small Samples in the Presence of Model Uncertainty”.Canadian Journal of Statistics, Vol. 32, No. 3,269-283.
- Yuan, A. and Clarke, B. (2004). “Asymptotic Normality of the Posterior Given a Statistic”. Canadian Journal of Statistics, Vol. 32, No. 2, 119-137.
- Clarke, B. and Yuan, A. (2004). “Partial Information Reference Priors: Derivation and Interpretations”. Journal of Statistical Planning and Inference, Vol. 123, No. 2, 313-345.
- Gustafson, P. and Clarke, B. (2004). “A Decomposition for the Posterior Variance”.Journal of Statistical Planning and Inference, Vol. 119, No. 2, 311-327.
- Clarke, B., Mittenthal, J. and Fawcett, G. (2004). “Netscan: a procedure for generating reaction networks by size”. Journal of Theoretical Biology, Vol. 230, No. 4, 591-602.
- Clarke, B. (2003). “Comparing Bayes and Non-Bayes Model Averaging When Model Approximation Error Cannot Be Ignored”.Journal of Machine Learning Research.4,683-712.(pdf)
- Clarke, B., Mittenthal, J. and Fawcett, G. (2002) NETSCAN: Software to list sets of reactions that satisfy a biochemical constraint in order of size of reaction network.
- Clarke, B., Mittenthal, J. and Fawcett, G. (2002). NETSCAN Reaction Network Finder. A manual for the NETSCAN software.
- Mittenthal, J.E., Clarke, B., Waddell, T., and Fawcett, G. (2001). “A New Method for Assembling Metabolic Networks, with Application to the Krebs Citric Acid Cycle.” Journal of Theoretical Biology, Vol. 208, No. 3, 361-382.
- Clarke, B. (2001). “Combining Model Selection Procedures for Online Prediction”. Sankhya, Ser. A, Vol. 63, Part 2, 229-249.(pdf)
(1988-2000)
- Yuan, A. and Clarke, B. (1999). “An Information Criterion for Likelihood Selection”.IEEETransactions on Information Theory. Vol. 45, No. 2, 562-571.
- Clarke, B. (1999). "Asymptotic Normality of the Posterior in Relative Entropy”.IEEE Transactions on Information Theory, 45, No. 1, 165-176.
- Yuan, A. and Clarke, B. (1999). “A Minimally Informative Likelihood for Decision Analysis: Illustration and Robustness”. Canadian Journal of Statistics, Vol. 27, No. 3,649-665.
- Clarke, B. and Sun, D. (1999). "Asymptotics of the Expected Posterior". Annals of the Institute of Statistical Mathematics, Vol. 51, No. 1, 163-185.
- Clarke, B. (1999). Discussion of the papers by Rissanen, and by Wallance and Dowe. The Computer Journal Vol. 42, 338-339.
- Mittenthal, J.E., Yuan, A., Clarke, B., and Scheeline, A. (1998). “Designing Metabolism: Alternative Connectivities for the Pentose Phosphate Pathway”. Bulletin of Mathematical Biology, Vol. 60, 815-856.(pdf)
- Clarke, B. and Gustafson, P. (1998). "On the overall sensitivity of the posterior distribution to its inputs”. Journal of Statistical Planning and Inference, 71: 137-150.
- Clarke, B. and Sun, D. (1997). "Reference Priors Under the Chi-Square Distance".Sankhya Series A, Vol. 59, Part II, 215-231.(pdf)
- Clarke, B., McKay, I., Grigliatti, T., Lloyd, V., Yuan, A. (1996). "A Markov Model for the Assembly of Heterochromatic Regions in Position-Effect Variegation." Journal of Theoretical Biology, 181, 137-155.
- Clarke, B. (1996). "Implications of Reference Priors for Prior Information and Sample Size." Journal of the American Statistical Association, 91, 173-184.
- Clarke, B. and Ghosh, J. K. (1995). "Posterior Convergence Given the Mean."The Annals of Statistics, 23, 2116-2144.
- Clarke, B. and Barron, A. (1994). "Jeffreys' Prior is Asymptotically Least Favourable Under Entropy Risk." The Journal of Statistical Planning and Inference, 41, 37-60.(pdf)
- Clarke, B. and Wasserman, L. (1993). "Non Informative Priors and Nuisance Parameters."Journal of the American Statistical Association, 88, 1427-1432.
- Mittenthal, J., Clarke, B. and Levinthal, M. (1993). Designing Bacteria. In: Thinking about Biology. W. Stein and F. Varela, Eds. Addison-Wesley, Redwood City, CA, 65-104.
- Barron, A. Clarke, B. and Haussler, D. (1993). Information Bounds for the Risk of Bayesian Predictions and the Redundancy of Universal Codes. Proceedings of the IEEE International Symposium on Information
Theory, 54-54, doi: 10.1109/ISIT.1993.748369. - Clarke, B. (1992) Comment on ‘On the Development of the Reference Prior Method’ by Berger, J. and Bernardo, J. In: Bayesian Statistics 4: Proceedings of the Fourth Valencia International Meeting on Bayesian
Statistics, Dawid, et al. Eds. 51-52. Clarendon Press, Oxford. - Clarke, B. (1992) Comment on ‘Non-Informative Priors’ by Ghosh, J.K. and Mukerjee, R. In: Bayesian Statistics 4: Proceedings of the Fourth Valencia International Meeting on Bayesian Statistics, 207-208. Clarendon Press, Oxford.
- Clarke, B. and Mittenthal, J. (1992) Reliability of Networks of Genes. In: Principles of Organization of Organisms, Proceedings Volume 13, Santa Fe Institute Studies in the Sciences of Complexity, 333-336. Addison-Wesley, Reading, Massachusetts.
- Junker, B. and Clarke, B. (1991). Inference from the Product of Marginals of a Dependent Likelihood. Technical Report #508, Department of Statistics, Carnegie-Mellon University.
- Clarke, B. and Barron, A. (1990). "Information Theoretic Asymptotics of Bayes Methods." IEEE Transactions on Information Theory, 36, 453-471.
- Clarke, B. and Barron, A.R. (1988). Information Theoretic Asymptotics of Bayes Methods. Technical Report #26, Department of Statistics, University of Illinois.
- Clarke, B. (1989). Asymptotic cumulative risk and Bayes risk under entropy loss, with applications. PhD Thesis under Andrew Barron, Department of Statistics, University of Illinois.
- Clarke, B., Mittenthal, J. and Arcuri, P. (1988). An Optimality Criterion for Epimorphic Regeneration. Bulletin of Mathematical Biology Vol. 50, 395-434. (My unofficial Master's thesis.)
Research interests:
My main interest these days is in prediction. This is broader than it sounds because prediction brings in questions about model uncertainty (Which model, if any, is true?) model mis-specification (If no model is true, what’s the least bad one?), model complexity (When is more complex modeling better than a simple approach?) and the other sources of variability and bias that have to be small enough for a prediction is useful. Obviously, different model classes can be used to generate predictors but there are also predictors that are not based on any model class. This is the case, for instance, with many machine learning methods such as bagging, boosting, kernel methods, and ensemble methods more generally. In these cases, it is reasonable to ask what the predictor means, i.e., what does a good predictor say about the properties of the phenomenon being predicted? Complex and high dimensional data are the natural places to use predictive techniques since model identification is so hard – even if one believes a model exists (often a dubious assumption). So, I tend to be interested in genomic or other types of complex data where useful formal theory is rare but statistical principles (variance-bias, robustness, complexity minimization, etc.) still provide helpful guidance. Analyzing complex data, or better, developing and understanding good predictors for complex data, often includes clustering, dimension reduction, complexity concepts, ensemble methods – and much else. Indeed, the predictive approach can be regarded as providing an overall conceptualization of the statistical problem in much the same way as Bayes, frequentist, survey sampling, or decision theory does.
Biosketch:
Bertrand Clarke earned his PhD in Statistics at the University of Illinois-Champaign-Urbana in 1989.His thesis work was given the Browder J. Thompson award for authors under age 30 of papers in IEEE journals. He spent three years as an Assistant Professor at Purdue University before moving to the University of British Columbia where he worked from 1992-2008. His early research focused on asymptotics, prior selection in Bayesian statistics, and mathematical modeling of biological systems. His first sabbatical was at University College London and his second sabbatical was at Duke University where he was a visiting scholar in the `Large P Small N’ program at SAMSI. In addition, in 2008 he spent three months at the Newton Institute at Cambridge University. He moved to the University of Miami in 2008 and worked for five years at the medical school where he started their MS and PhD programs in biostatistics before coming to Chair the Department of Statistics at the University of Nebraska-Lincoln. His current foci of research are predictive statistics and statistical methodology in genomic data. He has been an associate editor for four different journals, served three years on the Savage Award Committee (best thesis prize in Bayesian statistics), has published numerous papers over several fields, and was made a Fellow of the ASA in 2014. He has also authored one PhD level textbook on data mining and machine learning for Springer, with a complete solutions manual (available to instructors on request).
Education
Bertrand Clarke earned his PhD in Statistics at the University of Illinois-Champaign-Urbana in 1989.His thesis work was given the Browder J. Thompson award for authors under age 30 of papers in IEEE journals. He spent three years as an Assistant Professor at Purdue University before moving to the University of British Columbia where he worked from 1992-2008. His early research focused on asymptotics, prior selection in Bayesian statistics, and mathematical modeling of biological systems. His first sabbatical was at University College London and his second sabbatical was at Duke University where he was a visiting scholar in the `Large P Small N’ program at SAMSI. In addition, in 2008 he spent three months at the Newton Institute at Cambridge University. He moved to the University of Miami in 2008 and worked for five years at the medical school where he started their MS and PhD programs in biostatistics before coming to Chair the Department of Statistics at the University of Nebraska-Lincoln. His current foci of research are predictive statistics and statistical methodology in genomic data. He has been an associate editor for four different journals, served three years on the Savage Award Committee (best thesis prize in Bayesian statistics), has published numerous papers over several fields, and was made a Fellow of the ASA in 2014. He has also authored one PhD level textbook on data mining and machine learning for Springer, with a complete solutions manual (available to instructors on request).