Data Science Project Structure Template: Cookiecutter

袁晗 | Luo, Yuan Han
5 min readOct 19, 2022

part 1 overview

img src: https://github.com/drivendata/cookiecutter-data-science

How to read this

Everyone knows what LICENSE, requirements, and README are so I won’t go into those.

If you are well verse in statistics and data science but new to computer, read section 1 and all of part 2.

If you are well verse in computer but are unfamiliar with data and statistics, read section 2 only.

Section 1

You are here probably because you are looking for a better working process or file structure. Everything looks nice until you saw Makefile, docs, setup&src, and tox. Lets go through them.

Makefile

Compiling is like writing research paper. Importing codes from other files is like citation. To properly compile a program, we need to send a command to compile the main source code of that program along with everyone of the libraries that was used.

What happens if your program needs 1000 files to compile?

Yes, you need to type the name of all 1000 files all in 1 command. Basically main source file plus n libraries (The example given is in c++ conventions). And worst yet, if you update 1 file, you have to go through this all over again. This is where makefile comes in.

Makefile basically automate all these process where instead of typing out everyone of the files that you used, you just give 1 word command. Part 2 will talk about the details in Makefile.

Docs

Doc file came from previous successful practices of documenting your work. Sphinx automates that process, and generate something like the below depiction.

img src: https://andreas-suekto.medium.com/automate-documentation-in-python-925c38eae69f

There are many reasons to why this is more preferable than single or multiline comments, which I think deserves a blog on its own. For now, think of it as a more mature, generally accepted, and readable documentation.

Setup&src

After you upload your program on to pypi.org, you need a way for users to download it. In python, that is setup.py.

This is a lot quicker and easier than creating a user interface for your end user. So now they are reduce to the extreme hard work of typing 1 line of code to download, unless they want to pay you for the alternatives. Part 2 will go into detail on how to set it up.

src is a programming standard practice where you turn your files into functions so it’s reusable, easy to read, and easy to debug. __init__ is the key for that. It acts like a thread that connects all you src together. Read part 2 for detail implementation.

References

References include data dictionary, a metadata which explains your features such as their size, type, and format. This is also where you store the rest of your supplementary information.

tox.ini

tox is a testing tool. If you are new to coding, try testing. It will save you more time in the long run. I will go through details in the 2nd part.

Section 2

Data

While raw is self explanatory, intrim, processed, and external might not seem so obvious. Particularly what’s the difference between intrim vs processed, and what’s the difference between raw and external.

Intrim

Intrim data are data that has no missing values, irrelevant features, duplicate values, and so on. Although you can do analysis on them, they are not necessary model ready, though some new models can do wonders. During train phase, feature engineering such as one-hot-encoding(change features to binary values) or binning(generalize a column features into categories) are required or will enhance model performance. That means model accuracy is the main focus and readability/analysis will take a back sit.

left: intrim data, right: processed data

That’s the difference between Intrim and Processed data.

External

3rd party data can be raw, but it’s not the original. In the real world, completing a project from 1 source is like writing a research paper with only 1 citation, very unlikely.

A more realistic approach is pulling data from many difference sources from inside or outside of the entity that you are working with to form a Frankenstein database.

Model

The model file stores the trained model via pickle library or something similar.

Model is nothing more than a function with fitted parameters(or slopes). Machine learning is just statistic calculator.

You can train and deploy your model all on the cloud or make the whole program installable.

Notebook

Jupyter notebook is a combination of comments, executable codes, and terminals. This puts everything together to give us a more granular view of everything for the sake of our sanity.

img src: https://www.kaggle.com/code/yassineghouzam/titanic-top-4-with-ensemble-modeling

It is not a pdf or display of codes, you can edit and execute the codes inside the gray container.

Reports/figures

Reports is like a PowerPoint presentation of your findings in the form of html or pdf or whatever that’s trendy. The subfile figures contains graphs and etc that you use in your report. The purpose of reports is for none technical colleagues to understand statistical findings as appose to jupyternote book. It differs from docs in that it has a more business analytical focus, and cares very little about technology.

Conclusion

I try to waste as little time as possible when I am writing this to give you an overview. This structure isn’t written on the stone, but it is generally accepted practice through mistakes of the past. If you are well-versed in computer, I think this is enough, but if computer isn’t your strong suit, read part 2.

--

--