Introduction to Python for data analysis
Information
The estimated time to complete this training module is 4h.
The prerequisites to take this module are:
- You should already have everything installed for this module!
- We will be using Jupyter Notebook which is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.
If you have any questions regarding the module content please ask them in the relevant module channel on the school Discord server. If you do not have access to the server and would like to join, please send us an email at school [dot] brainhack [at] gmail [dot] com.
Before starting:
- Open the terminal
- Type jupyter notebook
- If you’re not automatically directed to a webpage copy the URL printed in the terminal and paste it in your browser
- Once on the webpage, click “New” in the top-right corner and select “Python 3”
- You have Jupyter notebook
The little box you see once in the notebook is called a “cell.” You can enter (multi-line) code by typing in the cell, and then run the code by pressing “Shift+Enter.” To create a new cell, press the “+” button on lefthand side of the toolbar at the top of the screen!
You are ready for this tutorial and you are strongly encouraged to type along the presentation!
Resources
This module was presented by Ross Markello during the QLSC 612 course in 2020.
All the tutorial notes related to the video below are available here.
Exercises
For this part, we will use the famous scikit-learn dataset iris which consists of 3 different types of irisesβ (Setosa, Versicolour, and Virginica) with information about petal and sepal length and width stored in a 150x4 numpy.ndarray.
Before starting to write some code, you want to set-up the environment so that it load the required modules. In a Jupyter Notebook import the following libraries:
# imports import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_iris %matplotlib inline
Load the iris dataset
iris = load_iris()
Explore the dataset using .keys()
Print the shape and type of ‘data’
Store ‘data’ and ‘features_names’ in distinct variables
Create a pandas dataframe with ‘data’ and use ‘feature_names’ for column names
Get the summary statistics for this dataframe using .describe()
Subset the dataframe to keep only the first 50 rows
Try to answer this question using the entire dataframe : Are there any extreme sepal length values?
- Reminder : extreme value are > 3.9 standard deviation. (value - mean) / std. For this one, you might need to use a for loop.
What about other features of the flowers? Try automating the previous operation by writing a function name find_extreme_values()
Read about the boxplot function in matplotlib to get familiar with python documentation. What does it tell us? https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html
Use this function to plot the boxplot distribution for features. Try adding a title and name for axis.
Save dataframe in csv format and the plot as png.
Note: Internet is your best friend. Remember that whenever you are stuck, resources and blogs can help you figure it out (Stack Overflow).
If you are done, you can play around with different functions (ex. other plotting functions). Try to answer interesting questions you might have using the data.
- Follow up with your local TA(s) to validate you completed the exercises correctly.
- π π π you completed this training module! π π π
More resources
There are hundreds of excellent resources online for learning Python and/or data science. A few good ones:
CodeAcademy offers interactive programming courses for many languages and tools, including Python and git
A Whirlwind Tour of Python is an excellent intro to Python by Jake VanderPlas; Jupyter notebooks are available here
Another excellent and free online book is Allen Downey’s “Think Python”
Object Oriented Programming in Python 3 (https://realpython.com/python3-object-oriented-programming/)
Jake Vanderplas’s Python Data Science Handbook is also available online as a set of notebooks
Kaggle maintains a nice list of data science and Python tutorials
Neuromatch Academy also has great tutorials available for Python in a computational neuroscience context.
Introduction to Python in French (https://www.youtube.com/watch?v=cjFxd-0idHo)
If you are curious, eiger to learn more, you can also try out this tutorial which inspired much of the content you saw today: Introduction to Python