A Brief Introduction to Statistical Data Analysis with Python

May 7, 2021

Online course

In this 2hr course, we will provide an introduction to data analysis and statistics in Python. In particular, we will cover data processing using pandas, statistical analysis using statsmodels, and data visualization using matplotlib and seaborn. In more detail, the pandas library provides means to represent and manipulate data frames. We will introduce how to read data in Python using pandas, and perform some general data wrangling including selecting rows and columns by name and other criteria, applying functions to the selected data, aggregating the data, etc. We will also look at general data visualization. The matplotlib library is a low level plotting library that allows for considerable control of the plot, albeit at the price of a considerable amount of low level code. Based on matplotlib, and providing a much higher level interface to the plot, is the seaborn library. This allows us to produce complex data visualizations with a minimal amount of code. Finally, we will introduce how to to perform widely used statistical analysis in Python. Here we will focus on statsmodels, which provides many of the mostly widely used statistical methods.

Jupyter notebook

The Jupyter notebook for this course is here. Usually, Jupyter notebooks render as a nice webpage in GitHub. Sometimes, it does require a few reloads to get that. But of course, you can always download it and use it in your Jupyter or upload it and use it on Colab.

Use the notebook online with mybinder

The mybinder service allows you to share Jupyter notebooks so that others can use them (i.e. run them) online directly. Click the button below to access this repo in mybinder:

GitHub resources

Further resources for this training course can be found on Github at mark-andrews/intro2pystats.