Reproducible Analysis in Jupyter Notebooks

Travis Leleu bio photo By Travis Leleu

GitHub Notebook from the video series

Also check out Jake’s awesome, free book: Python Data Science Handbook


Jake Vanderplas has a great video series about reproducible analysis in Jupyter Notebooks. His overall recommendation for using Jupyter:

Before closing out of an analysis session, make sure you can run your notebook from a clean state.

Restart & Run All Cell
Tests Notebook Linearity, an important part of reproducibility

Video Timelines and Notes

video 1 [5min] - acquiring, loading, plotting data.

  • Retrieve data from code; reproducible analysis starts with acquiring the data.
  • Use pandas to load data.
  • Familiar libraries are often imported with nicknames common in the community
    • import pandas as pd

video 2 [6min] - exploring data

  • explore your data by graphing it from different angels
  • matplotlib has built-in styles to prettify plots, including seaborn
  • aggregate and groupby, data with pandas data wrangling
  • pivot tables are very easy to run and output
Bikepath usage
Bikepath Usage by direction (time series)

video 3 [5min] - what should be saved

  • Jupyter is great because we can explore data by jumping around in different code blocks (nonlinearity)
  • before saving, linearize your notebook. “Restart & Run All” is your friend

video 4 [6min] - git and github

  • don’t check your data into version control (it should be acquired in code, if possible)

video 5 [7min] - turn your code into a python package

  • package useful bits of code so you don’t have to c/p into other notebooks
  • requires a few bits of code, but nothing complicated
    • create __init__.py file in directory that imports objects from w/in that same dir

video 6 [6min] - test your code

  • unit tests ensure the results of your methods do what they are supposed to
  • is a positive signal to others that your code can be relied upon

video 7 [6min] - refactoring for speed

  • if there’s something common that can be optimized, pandas has a way to do it
Speedup after an improved read_csv invocation
From 22s to <0.5s because we understood how pd works

video 8 [6min] - debugging

  • debugging is a learned art. watch the videos to get better at it in your own code
  • when you find a bug in your code, that’s a good candidate for unit testing

video 8.5 [8min] - finding, fixing, PR for scikit-learn bug

  • pretty neat - watch him find, fix, and submit a PR for a bug in a major library

video 9 [8min] - more sophisticated analysis

video 10 [8min] - cleaning up the notebook

  • to go from a Jupyter notebook exploration to a Reproducible Result, and to share with other people, try to linearize your notebook. Jake’s tweet is pretty straightforward here:

In short, think about how other people, including yourself at your next session, will likely approach your code. It’s great to explore in a non-linear fashion – that’s part of the power of this notebook IDE – but try to tidy up after yourself, even if it’s for your own sake.