Statistical Inference Bayes Impact 2014
arranged by Daniel Korenblum
The Inference Problem is estimation of an unknown quantity
Inference/Estimation Subject Areas Problem Point Estimation
Error Bars / Confidence (Estimator Error) Classification and Clustering (Pattern Recognition)
Model Selection
1. 2. 3.
Solution / Method
Algorithm / Statistic
Maximum Likelihood (ML)
Gradient Descent
Minimum-Variance Unbiased (MVU) Estimator
Least Squares
Maximum Posterior (MAP/GMLE)
Gradient Descent
Posterior Mean (PM)
Markov-chain Monte Carlo (MCMC)
Confidence Interval / Region
Covariance/Information, Resampling
Credibility Interval / Region
Evidentiary Credible Region (2014)
Unsupervised Learning
Cluster Analysis
(Semi) Supervised Learning
Discriminant, Generative, SVM, kNN, trees
Feature Selection
Ranking, Filtering, Greedy, Sparse
Hypothesis Testing
Significance Tests (Holy Trinity)
Model Evidence
Marginal Likelihood
Frequentist inference as an optimization problem: maximize the likelihood over all observations Bayesian inference as distribution estimation: the posterior distribution estimate is “the inference” Decision theory can be used to derive estimates from posteriors by minimizing decision risk/loss
Scope and Outline Topics covered 1. 2.
Likelihood models and model comparison Frequentist and Bayesian approaches 2.1. Frequentist Inference 2.1.1. Analytic - set the derivative of sample log-likelihood equal to zero and solve 2.1.2. Numerical - use local or global optimization algorithms (e.g. steepest descent) 2.2. Bayesian Inference 2.2.1. Choose a prior distribution 2.2.2. Product of likelihood and prior yields unnormalized posterior distribution 2.2.3. Select an objective / risk / loss and minimize its expected value over the posterior 3. Statistics and algorithms 3.1. Regression: using the noise distribution to choose appropriate objective / risk / loss 3.2. Estimator error: bias-variance trade-off, small bias can reduce variance and MSE 3.3. Classification: choosing between generative, discriminative, or discriminant approaches
Topics not covered 1. 2. 3. 4. 5.
Stochastic process models / methods (e.g. Markov models) Time series analysis / 1-D signal processing, multidimensional signal processing Black/gray box models (e.g. artificial neural networks, decision trees, ensembles) Information theoretic approaches (maximum entropy, mutual information, K-L divergence) Control theory, duality theory, convex analysis, global optimization
Introduction to Statistical Inference
Frequentist Inference Likelihood theory (Fisher ~1920)
Likelihood Theory Likelihood functions are not probability density functions. The integral of a likelihood function is not in general 1.
fixed
variable
variable fixed
variable
Frequentist Inference & Decision Theory
Frequentist Risk/Loss Function:
Frequentist Risk Example: Squared Error
Frequentist Decision Theoretic Objective
Bayesian Inference posterior distribution & minimum risk/loss
Bayesian Conditional Distributions
Bayesian Update, Inverse Problems
Prior Function and Regularization Term
Bayesian Posterior Loss
When the prior is improper, an estimator which minimizes the posterior expected loss is referred to as a generalized Bayes estimator.
Risk/Loss and Regularization Functions
Risk/Loss Functions and Derivatives
http://dl.acm.org.oca.ucsc.edu/citation.cfm?id=1281270
Point Estimation maximum likelihood, least squares
Maximum Likelihood Estimation (MLE)
Linear Regression / Least Squares
Orthogonal Projections & Least Squares
http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)#Properties_of_the_least-
Nonlinear Regression / Least Squares
Generalized Linear Models
Maximum Likelihood Noise Dependence
MLE Estimator: Gamma Distribution
MVU Estimator: Mean of Uniform Noise
Posterior Mean and Maximum Posterior
Median Posterior Density
Example: Changepoint Detection
Example: Changepoint Detection
PM Example: Bayesian Prediction
Error Bars / Uncertainty Fisher information, confidence regions
Negative Log-Likelihood & Uncertainty
Likelihood Geometry and Contours
Score Function & Fisher Information
Fisher Information / Precision
Estimator Error
proof of Cramer-Rao Lower-Bound: http://ens.ewi.tudelft.nl/Education/courses/et4386/Slides/01.estimation.pdf
Bayesian Mean Squared Error
Bayesian Minimum Mean Squared Error
Classification cluster analysis, supervised learning
Bayesian Classification
Bayes Classifier Risk/Loss
Bayesian Classifier Decision Error
Bayesian Classifier Posterior Density
Example: Support Vector Machine
Classifier Comparison Example
Feature Selection ranking, filtering, greedy, sparse, hybrid
Introduction to Feature Selection
Feature Selection Approaches
Filtering / Subset Selection Algorithms
Exhaustive Search & Zero-norm Penalty
Basis Pursuit / LASSO / Elastic Net
Cluster Analysis also known as unsupervised learning
Introduction to Cluster Analysis
Cluster Analysis Algorithm Categories Hierarchical
Crisp
agglomerative clustering
Fuzzy
Hierarchical unsupervised fuzzy clustering (Geva 1999)
Non-hierarchical
k-means
spectral clustering, fuzzy k-means
Hierarchical Agglomerative Clustering
Clustering Algorithm Comparisons
Model Selection Cross-Validation, LR, ICs, model evidence
Parsimony and Occam’s Razor
Cross-Validation
Likelihood Ratio Test for Nested Models
Aikake & Bayesian Information Criteria
Deviance Information Criterion
Model Selection Discussion
Bayesian Model Selection
Bayes Factors & Bias-Variance Tradeoffs
Bayesian Model Selection Example
Philosophy interpretations, debates, and paradoxes
Bertrand Paradox
Bertrand Paradox: Jaynes’ Solution
Bertrand Paradox: Disambiguation
References
References Lecture Notes 2004 Figueiredo, Lecture Notes on Bayesian Estimation and Classification Martinez et al., Estimation and Detection Books 2006 Bishop, Pattern Recognition and Machine Learning 2009 Hastie et al, The Elements of Statistical Learning 2012 MacKay, Information Theory, Inference, and Learning Algorithms Wiki http://wikipedia.org