STA 250: Topics in Applied and Computational Statistics

Subject: STA 250
Title: Topics in Applied and Computational Statistics
Units: 4.0
School: College of Letters and Science LS
Department: Statistics STA
Effective Term: 2006 Spring

General Description
Learning Activities
Lecture - 3.0 hours
Discussion/Laboratory - 1.0 hours

Description
Resampling, nonparametric and semiparametric methods, incomplete data analysis, diagnostics, multivariate and time series analysis, applied Bayesian methods, sequential analysis and quality control, categorical data analysis, spatial and image analysis, computational biology, functional data analysis, models for correlated data, learning theory. May be repeated for credit with consent of graduate advisor.

Prerequisites

STA 131A; STA 232A recommended, not required.

Repeat Credit
May be repeated for credit with consent of graduate advisor.

Expanded Course Description
Summary of Course Content:
Typically source material would be selected from the following list of topics: 1. Resampling methods: Jackknife, bootstrap, cross-validation, sample re-use, and other data-adaptive devices. 2. Nonparametric and semi-pararametric methods: Smoothing procedures, kernel, nearest neighbor approaches, splines and partial splines, local polynomials, wavelets, local likelihood, generalized additive models, ACE algorithm. 3. Incomplete data analysis: Missing data, censoring, truncation, biased sampling, non-response in sample surveys, application of EM algorithm, and imputation. 4. Diagnostics and other topics: Nonlinear techniques, assessment of influential observations, robust regression, transformations, and diagnostics based on statistical curvature assessment. 5. Multivariate and time series data analysis: Multivariate sampling distributions, projection pursuit, multivariate location and scale, large dimensional classification with many variables, analysis of correlated data, and correspondence analysis. 6. Applied Bayesian methods: Markov chain monte carlo methods, approximations for posterior moments and marginals, numerical integration, Bayesian bootstrap models, and applications. 7. Sequential analysis and Quality Control: Sequential probability ratio test and extensions, sampling inspection, sequential allocation, cumulative sum (CUMSUM) quality control, Dodge and Roming's schemes, control charts, stochastic approximation, group sequential designs. 8. Categorical data analysis: Log linear models and diagnostics, incomplete contingency tables, sparse tables, high dimensional tables, collapsibility, and, complex sampling designs. 9. Spatial and image analysis: Data compression, image segmentation, independent component analysis, Markov random fields. 10. Computational Biology: Bioinformatics, Statistical genetics. 11. Functional data analysis: Dimensionality reduction, classification, regression. 12. Models for correlated data,linear and nonlinear mixed models. 13. Learning theory: Support vector machines, neural networks, boosting, data mining.

Illustrative Reading:
The reading will mainly consist of a synthesis of material taken from journal articles, relevant monographs and books selected by the instructor.

Potential Course Overlap:
There may be overlap with one or more of the courses 138, 145, 232C, 237AB, depending upon the choice of topics. Potentially there could be overlap with Statistics 251. However, the overlap would be slight since the main focus of course 250 would be on topics which were not covered in these courses.
Example: Fall 2016, Statistics on Manifolds
Prerequisites: No prior knowledge of statistics on manifolds is assumed. The course does expect a background in probability theory and mathematical statistics equivalent to STA 231AB. Specific tools needed to understand the theory include: laws of large numbers for triangular arrays, central limit theorems for triangular arrays, and basic weak convergence theory for random elements. The ability to program in R, or equivalent computing environment, is critical for class assignments.

Course Sketch: The course will begin with statistical methods for the analysis of directional, axial, and shape data, then link these to the rapidly emerging area of methodology for manifold-valued data. Bootstrap methods will be introduced as a rigorous algorithmic technique for overcoming infeasible distribution theory. There is no textbook. A primary source for older material will be
• Mardia, K.V. and Jupp, P. (2000). Directional Statistics. Second edition, Wiley.
Newer ideas will be drawn from the literature and from
• Patrangenaru, V. and Ellingson, L. (2016). Nonparametric Statistics on Manifolds and Their Applications to Object Data Analysis. CRC Press.
Relevant papers include
• Beran, R. and Fisher, N.I. (1998). Nonparametric comparison of mean directions or mean axes. Annals of Statistics 29, 472–493.
• Bhattacharya, R and Patrangenaru, V. (2003). Large sample theory of intrinsic and extrinsic sample means on manifolds I. Annals of Statistics 31, 1–29.
• Bhattacharya, R and Patrangenaru, V. (2005). Large sample theory of intrinsic and extrinsic sample means on manifolds II. Annals of Statistics 33, 1225–1259.

Labs: Lab assignments will be due in the Discussion Section on Thursdays. These computational labs will code course methodologies from first principles.

Grading: The course grade will be based on a midterm and on a final project involving theory and computational skills learned during the course.

Software: The site http://cran.r-project.org offers binaries and documentation for R. The Art of R Programming by Norman Matloff (2011) describes the use of R as a programming language.
Example: Winter 2017, Data, Computing, and Sciences
GOAL: The goal this proposed STA-250 course is to re-examine the foundation of statistics from data structures and subject sciences perspectives. Particularly the direction of system sciences would be the focus.
Ideas: As the availability of data becomes abundant, structural information contents embedded within data is the primary concern. The geometry of information contents are tied with subject sciences, from which data is derived. Therefore there exists a critical need to realistically link the data to subject matter sciences. If statistics is about to play an important role on this linkage, many aspects of its foundations need to revised. For instance, the needs of independence, linearity and homogeneity assumptions have to be done in proper fashions under realistic settings. Many concepts, such as bootstrapping and modeling and asymptotical theories need to be revisited and even revised, or given up completely.
Computations: The key computing is centered at matrix data. Computational algorithms are those that I have developed by myself, Hierarchical Factor segmentation (HFS), Data Cloud Geometry (DCG), Data mechanics, etc. and some by others available in literature. The data types considered in this course range from one data matrix or one network to time series of matrix and then longitudinal ensembles of matrices as prescribing a whole system over time.

Topics:
1) Why Principle component analysis, Multidimensional scaling and Hierarchical clustering, Spectral clustering and Diffusion map are not effective clustering methodologies?
2) What is wrong with classic statistical methodologies, such as Logistic regression, linear regression, ANOVA, et al.? Why Boostrapping and Dynamic Factor analysis (DFS) are not right concept for system sciences?
3) How to fix all aforementioned problems with data-driven computations?
4) What are the principles for exploring structural patterns in data from System? Why making discovery is the aim for Data Science?
Example: Winter 2017, Change-Point Analysis
This course will cover classic works and recent advances on change-point analysis, both offline change-point analysis (also referred to as segmentation), which divides a completely observed sequence into homogeneous temporal or spatial segments, and online/sequential change-point analysis, which on-the- y detects changes in sequentially observed data. Change-point detection is a classic problem and was extensively studied for the univariate case. However, with advances in technologies, the collection of massive data becomes feasible and new challenges arise as the dimension of the observations in the sequence becomes higher or the observations are even non-Euclidean. This triggers a recent wave of change-point method developments for modern data.

This course will first prepare students with standard techniques for change-point problems, which are also useful for studying modern change-point problems. This course will then explore recent developments on change-point detection for multivariate data and for non-Euclidean data. The detailed contents are listed below.

Part I: Traditional change-point analysis for univariate data. In this part, the main concepts of oine and online change-point analysis will be covered. It will rst start with the easiest scenario that the observations follow the Gaussian distribution with known variance and the change is a mean shift. Classical procedures will be covered as well as major theoretical results. Then, the scenario will be relaxed to Gaussian distribution with unknown variance and then to other distributions. Some recent applications on genomic data will be covered.

Part II: Change-point analysis for multivariate data.
This part will consist two sub-topics: (i) The multiple sequences are independent,
and (ii) The multiple sequences are not independent. We focus on parametric
methods in this part. For scenario (i), current methods can deal with quite high
dimensions and we will cover ways of eectively integrating information from
the multiple sequences. For scenario (ii), current methods could deal with low-
dimensional data. Some successful examples for both scenarios for recent studies in genomic data and multiple sensor problems will be covered.

Part III: Recent advances on change-point analysis for high-dimensional data/non-Euclidean data. In this part, non-parametric methods will be covered, namely, methods based on similarity of the observations, which can be applied to data in arbitrary dimension and to non-Euclidean data. The main techniques, which are quite different from parametric approaches, will be covered. Applications on modern data, such as network data, will be discussed.