A Recipe for Organising Data Science Projects

Learn how to create structured and reproducible data science projects

Data science projects are by their very nature experimental and exploratory. It can be very easy when working on a project of this kind to end up with a big mess of spaghetti code that is difficult to decipher or reproduce.

Data science projects are different from traditional software engineering projects in this way. However, it is possible to create a solid code structure that will ensure your project and its results are both reproducible and extensible by yourself and others.

In the following article, I am going to give you a recipe including the tools, processes and techniques, for setting up data science projects that will give you the following:

  • A consistent project structure so that your code is easy to follow.

  • Version control so that you can track and make changes without breaking the core project.

  • An isolated virtual environment so that the project is easily reproducible.

  • Ethical and secure projects.

Project structure

The majority of web and software development programming languages come with a predefined standard code structure. For example, I have recently been learning Bootstrap and I was impressed by the fact that when I download the project I automatically get some skeleton code organised similar to that shown in the image below.

What this means is that regardless of the exact nature of the project you are building, an outsider looking at your code organised in this standard way will instantly know where to look for certain files and can easily follow your code. It assists greatly with collaboration and reproducibility.

There is a tool that has been developed for data science projects that automatically creates a standard project structure called cookiecutter-data-science. This tool can be installed via pip.

pip install cookiecutter

To start a new project simply type the following, there is no need to create a new directory first as cookiecutter will do this for you.

cookiecutter https://github.com/drivendata/cookiecutter-data-science 

The tool will take you through a series of questions to set up your project. The first will ask for the project name which will then be the name of the directory created.

Continue to answer the questions when prompted. Many of the questions are optional and you can simply hit return on those not relevant to your project.

You will now have a new directory with the name you gave your project. If you navigate to it, it will contain a file structure that looks similar to this image. The project structure contains a little Python boilerplate but is not limited to python projects as this can always be removed if you are using a different programming language.

Github repository

Github is a version control tool that stores a remote version of your project, called a repository. This repository can be cloned by, anyone with access, to their local machine. Changes can be made here and tested before committing to the master version.

Version control ensures that changes can safely be made to projects without breaking the original codebase. This