STA 15B Introduction to Statistical Data Science II


 

Learning outcomes:
1. Learn basics of the statistical programming language R: data handling, datamanipulation, visualization tools
2. Principles of graphical integrity and identifying distortion in visualizations
3. Understand the concept of ``data dimension’’
4. Understand the concepts of correlation and regression
5. Learn basics of classification and clustering

 

Course content:
1. Basics of the R programming environment.
2. Tools for handling and manipulating different types of data -- tabular, spreadsheet and text data.
3. Basic data visualization tools for

  • continuous univariate data
  • continuous multivariate data
  • categorical data

4. Graphical integrity – common sources of distortion in plots.
5. Concepts of statistical association, correlation, and linear regression.
6. Concepts of classification of labeled data.
7. Concepts of clustering of unlabeled data.

Illustrative Reading:
1. Freedman, D., Pisani, R. and Purves, R. (2007). Statistics, 4th Edition. W. W. Norton and Company.
2. Ramsey, F. and Schafer, D. (2012). The Statistical Sleuth: A Course in Methods of Data Analysis, 3rdEdition. Cengage Learning.
3. Bruce, P. and Bruce, A. (2017). Practical Statistics for Data Scientists: 50 Essential Concepts. O'Reilly Media.
4. Wickham, H. and Grolemund, G. (2017). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly Media.

Potential Overlap:
Some overlap with STA 032 and STA 100. But data visualization is covered in significantly more detail in this course. Also, there is some overlap with PSC/SOC/POL 012Y. However, scope of this course is broader, as it encompasses not just data visualization, but also core techniques of data analysis associated with different types of data in a wide variety of fields of natural and social sciences.

History:
None