This repository serves as a personal template for data science projects.
- Analysis scripts and notebooks are located in
analysis/. - Reusable functions and modules are stored in the local package
src/.- The package can then be installed in development mode with
pip install -e .for easy prototyping. src/config.pyis used to store variables, constants and configurations.- The package version is extracted from git tags using setuptools_scm following semantic versioning.
- The package can then be installed in development mode with
- Tests for functions in
src/should go totests/and follow the conventiontest_*.py.
Moreover, I use the following the directories that are (usually) ignored by Git:
data/to store data files.results/to store results/output files such as figures, output data, etc.
I can set up the environment differently depending on the project. The irrelevant sections can be deleted when using the template.
The following does not apply when managing requirements with conda, see the section below.
The requirements are specified in the following files:
requirements.into specify direct dependencies.requirements.txtto pin the dependencies (direct and indirect). This is the file used to recreate the environment from scratch usingpip install -r requirements.txt.pyproject.tomlto store the direct dependencies of thesrcpackage.
The requirements.txt file should not be updated manually.
Instead, I use pip-compile from pip-tools to generate requirements.txt.
- Start with an empty
requirements.txt. - Install pip-tools with
pip install pip-tools. - Compile requirements with
pip-compileto generate arequirements.txtfile. - Install requirements with
pip-sync(orpip install -r requirements.txt).
NB: the advantage of using pip-sync over pip install -r requirements.txt is that pip-sync will make sure the environment matches requirements.txt, i.e. removing packages in the environment but not in requirements.txt, if required.
- To upgrade packages, run
pip-compile --upgrade. - To add new packages, add packages in
requirements.inand then compile requirements withpip-compile.
Then, the environment can be updated with pip-sync.
To setup a Python virtual environment with venv called .venv, using the currently installed Python's version, navigate to the repository directory and run the following in the command line:
$ python -m venv .venv
$ source .venv/Scripts/activateTo set up the environment with conda (assuming it is already installed), navigate to the repository directory and run the following in the command line (specify the Python version and environment name as appropriate):
$ conda create -n myenv python=3.11
$ conda activate myenv
$ pip install -r requirements.in
$ pip install -e .Then pin the requirements with:
$ conda env export > environment.ymlFinally, the environment can be recreated with:
$ conda create -n myenv -f environment.ymlA Docker container can be used as a development environment.
In VS Code, this can be achieved using Dev Containers, which are configured in the .devcontainer directory.
The environment is automatically built as follows:
- A Docker image of Python is created with packages installed from
requirements.txt(except local packages). The Python's version can be edited in the Dockerfile. - The image is ran in a container and the current directory is mounted.
- The local packages are installed in the container, along with some VS Code extensions.
To set up the dev container:
- Install and launch Docker.
- Open the container by using the command palette in VS Code (
Ctrl + Shift + P) to search for "Dev Containers: Open Folder in Container...".
If needed, the container can be rebuilt by searching for "Dev Containers: Rebuild Container...".
Pre-commit hooks are configured using the pre-commit tool.
Currently, the hooks consists in formatting with Black.
When this repository is first initialised, the hooks need to be installed with pre-commit install.
This section can be deleted when using the template.
- Initialise your GitHub repository with this template. Alternatively, fork (or copy the content of) this repository.
- Update
- the repository name
- project information in
pyproject.toml - the README
- the license
- Set up your preferred development environment.
- Add a git tag for the inital version with
git tag -a "v0.1.0" -m "Initial setup", and push it withgit push origin --tags.
I usually work with Visual Studio code, for which various settings are already predefined. In particular, I use the following extensions for Python development.
- Black for formatting.
- Flake8 and SonarLint for linting.
- autoDocstring extension to generate docstrings skeleton following the Google docstring format.
The src/ package could contain the following modules or sub-packages depending on the project:
utilsfor utility functions.data_processingfor data processing functions (this could be imported asdp).features: for extracting features.models: for defining models.evaluation: for evaluating performance.plots: for plotting functions.
The repository structure could be extended with:
docs/to store documentation, for example- subfolders in
data/such asdata/raw/for storing raw data. models/to store model files.
This template is inspired by the concept of a research compendium and similar projects I created for R projects (e.g. reproducible-workflow).
This template is relatively simple and tailored to my needs. More sophisticated templates are available elsewhere, such as:
- Cookiecutter Data Science.
- https://joserzapata.github.io/data-science-project-template/
- Data Science for Social Good's hitchhikers guide template
- https://github.com/khuyentran1401/data-science-template
As opposed to other templates, this template is more focused on experimentation rather than sharing a single final product.