STA 226: Statistical Methods for Bioinformatics

Subject: STA 226
Title: Statistical Methods for Bioinformatics
Units: 4.0
School: College of Letters and Science LS
Department: Statistics STA
Effective Term: 2007 Fall

Expanded Course Description

Summary of Course Content: 
1. Introduction to molecular biology. Includes the central dogma DNA/mRNA/proteins, post-translational modification and structure of proteins, signal transduction, gene regulation. Methods of probing the genome, transcriptome, proteome, and metabolome. 2. DNA and protein sequence data. Methods of defining and finding sequence similarity including BLAST. Genes, regulatory regions, and other gene sequence issues. Protein sequence and structure. Evolution and homology. Sequence variation and linkage disequilibrium. Determination of haplotype block structure.

3. Methods of biological analysis. PCR, gene expression arrays, mass spectrometry, and NMR spectroscopy. Separation methods including 2D gel electrophoresis, gas and liquid chromatography. Properties of analytical data. Calibration, error models, correlations in "true values" and measurement error. Transformation and normalization, baseline correction, and other pre-processing issues.

4. Determination of significance for many variables in high-throughput biological data such as those from gene expression arrays and proteomics by mass spec. Issues in variable-by-variable application of standard statistical methods. Methods of controlling error rates including control of family-wise error rate and false discovery rate. Accounting for variance heterogeneity and dependence between genes.

5. Experimental design and analysis in biological research: group comparison, factorial, time course, dose-response, survival.

6. Discrimination and classification (supervised learning). Methods of multivariate classification. Methods of variable selection, dimension reduction, and model selection. Parameter, model, and method choice by cross-validation. Estimation of classification performance. Applications will be to high-throughput biological assay data such as those from gene expression arrays, proteomics by mass spec, and metabolomics by NMR spectroscopy.

7. Clustering (unsupervised learning). Methods for clustering variables and subjects. Evaluating performance of clustering methods.

8. Other and more advanced topics, including linkage, phylogeny, haplotype analysis, graphs as models of biological networks, hidden Markov models for sequence analysis. 

Illustrative Reading: 
Statistical Methods in Bioinformatics by Warren J. Ewens, Gregory R. Grant Hardcover: 552 pages ; Dimensions (in inches): 1.07 x 9.55 x 6.40 Publisher: Springer Verlag; 1st edition (April 20, 2001) ISBN: 0387952292

Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology by Dan Gusfield, Cambridge University Press

Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids by Richard Durbin, Sean R. Eddy, Anders Krogh, Graeme Mitchison, Cambridge University Press; Reprint edition (July 1, 1999) ISBN: 0521629713

Genome by T. A. Brown, Publisher: Wiley-Liss; 1 edition (May 26, 1999) ASIN: 0471316180 

Prerequisite: STA 131C; or consent of instructor; data analysis experience recommended.

Potential Course Overlap: 
None