Proper Python Project Structure 2024
are you interested in proper python project perfection for 2024? you’ve come to the right place. let’s get into it.
Require This
If you see a modern active python project with only requirements.txt
or setup.py / setup.cfg
in the wild, you must assume the maintainers haven’t learned anything new in half a dozen years.
requirements.txt
(or “requirements in setup()
call”) is not a valid way to manage dependencies — and it hasn’t been for the past 5+ years. If you are still using requirements.txt
it shows you need professional help. luckily, i’m a professional.
Let’s go over some bad / good / example practices for living your best python life in 2024.
Python Project Worst Practices
Indicators your python architecture is outdated:
- BAD: using
requirements.txt
requirements.txt
has no mechanism to ensure all dependencies are compatible based on their other transitive dependencies. alsorequirements.txt
usage defaults to poor developer hygiene because many people just list names with no versions, so builds end up completely non-reproducible as time drifts forward.
- BAD: manually creating and entering venv virtual environments — your dependency/package manager should be controlling the creation and destruction of your venvs as well as automatically managing install/uninstall cycles for packages in the venv.
- BAD: any code not in a proper python directory/package structure using python namespace formats. If you are manually setting
PYTHONPATH
because your code isn’t following the 20 year old standard python directory layout, your code is also not reusable and probably difficult to advance and extend too. - BAD: if you are using
setup.py
to build extensions but you don’t understandsetup.py
runs outside of any installed packages, you are creating software uninstallable in the modern python ecosystem. If you are doing “system tests” for installed packages before installing your own packages, logic is corrupt and you need to fix it (looking at you, poorly maintained detectron2 and flash-attention install logic)
Better Practices
Basic template for a modern python project:
- All python projects must use a
pyproject.toml
file for declaring all aspects of the local package. - Each single-purpose repository should have only one
pyproject.toml
file at the top level. - All python package code goes in a directory for your package name (sometimes called stuttering since your package directory will often just be your repo name again like
packagename/packagename
) - So your directory structure looks like:
project_name/
pyproject.toml
project_name/__init__.py
Using pyproject.toml
lets us define dependencies in a more reliable fashion than the outdated legacy ways of requirements.txt
/ setup.py
/ setup.cfg
. Writing all your metadata parameters in pyproject.toml
is much cleaner and much better defined than metadata just being parameters to setup functions.
For creating and managing pyproject.toml
, the best tool is currently Poetry. Poetry includes typical features you’d expect such as versioned dependency resolution (without the broken dependency performance of previous python dependency managers) and automatic virtual environment management and automatic package building/uploading and automatic command running too.
Here’s a quick buhbample:
Use the poetry init
wizard for creating your defaults
-- first just verify you have your global environment configured correctly:
> pip install pip poetry wheel setuptools -U
-- now we can create a demo project
> mkdir hello
> cd hello
> poetry init
This command will guide you through creating your pyproject.toml config.
Package name [hello]:
Version [0.1.0]: 0.3.0
Description []: A Project Which Wishes You Great Hellos
Author [Matt Stancliff <matt@matt.ai>, n to skip]: Matt <matt@matt.ai>
License []: MattLicense-3.2.1
Compatible Python versions [^3.10]: ^3.12
Would you like to define your main dependencies interactively? (yes/no) [yes] no
Would you like to define your development dependencies interactively? (yes/no) [yes] no
Generated file
> cat pyproject.toml
[tool.poetry]
name = "hello"
version = "0.3.0"
description = "A Project Which Wishes You Great Hellos"
authors = ["Matt <matt@matt.ai>"]
license = "MattLicense-3.2.1"
readme = "README.md"
[tool.poetry.dependencies]
python = "^3.12"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
Also notice how your package now depends on a python version too (which will need to be adjusted forward in time as versions grow unless you relax the restriction here, but may dependencies will also require you have a narrow range of acceptable python versions.).
One missing feature of the python/poetry setup environment though: poetry doesn’t let you define which poetry versions are allowed to operate with your pyproject.toml
file, so sometimes poetry will add or remove features in newer poetry
versions and you have outdated syntax in your project files, but there’s no way to tell your project files which poetry versions are acceptable.
Add your initial dependencies
> poetry add fire orjson httpx loguru
Creating virtualenv hello-shtJlOYo-py3.12 in /Users/matt/Library/Caches/pypoetry/virtualenvs
Using version ^0.5.0 for fire
Using version ^3.9.10 for orjson
Using version ^0.26.0 for httpx
Using version ^0.7.2 for loguru
Updating dependencies
Resolving dependencies... (0.2s)
Package operations: 12 installs, 0 updates, 0 removals
• Installing certifi (2023.11.17)
• Installing h11 (0.14.0)
• Installing idna (3.6)
• Installing sniffio (1.3.0)
• Installing anyio (4.2.0)
• Installing httpcore (1.0.2)
• Installing six (1.16.0)
• Installing termcolor (2.4.0)
• Installing fire (0.5.0)
• Installing httpx (0.26.0)
• Installing loguru (0.7.2)
• Installing orjson (3.9.10)
Writing lock file
As one would expect from modern tools, the dependency manager automatically pulls deps in a reasonably performant manner and generates a hash-defined lockfile (though, there’s been a poetry bug (skill issue) for a while where sometimes you need to force-delete your local package cache if installs get stuck resolving dependencies via poetry cache clear pypi --all
).
Look at your dependency graph (look at it!)
> poetry show --tree
fire 0.5.0 A library for automatically generating command line interfaces.
├── six *
└── termcolor *
httpx 0.26.0 The next generation HTTP client.
├── anyio *
│ ├── idna >=2.8
│ └── sniffio >=1.1
├── certifi *
├── httpcore ==1.*
│ ├── certifi *
│ └── h11 >=0.13,<0.15
├── idna *
└── sniffio *
loguru 0.7.2 Python logging made (stupidly) simple
├── colorama >=0.3.4
└── win32-setctime >=1.0.0
orjson 3.9.10 Fast, correct Python JSON library supporting dataclasses, datetimes, and numpy
and it’s useful to view your dependency tree especially in larger projects when you get a security alert how Version X.Y.Z.Q.M has A SINGLE INVALID QUATARI CERTIFICATE AND NOW MUST BE FIXED GLOBALLY TO Version X.Y.Z.Q.M+1 but you’re not sure how many packages it may impact.
A proper dependency tree is the core of all modern software development and legacy methods of requirements.txt
or setup.py / setup.cfg
package specifications do not ensure cross-package version compatibility, which is why we need poetry
to handle all our dependency logic in a single consistent well-behaved source of truth operating environment.
Living with pyproject.toml
But what about most common people who are copy-paste-by-example developers and just use the same pip install -r requirements.txt
pattern everywhere because they read it on a forum somewhere in 2012 and never looked up anything again?
You never need to invoke pip install
anywhere except for your initial global pip install pip poetry wheel setuptools -U
because all other packages and projects are managed in your poetry-controlled virtual environments.
Technically you can look inside your poetry-managed virtual environment with poetry shell
or even poetry run python
or poetry run pip
— but it’s a big anti-pattern to poke the venv python environment directly unless you are performing extreme debugging operations. Your pyproject.toml
file should be configured to run all entry points to your package using auto-generated scripts logic.
The largest problem with pyproject.toml
is all the outdated online forum posts telling people to just install -r requirements.txt
when you should never be generating a requirements.txt
anymore in the first place.
So, anyway, to contribute to the discourse arena, here’s a simple modern non-legacy example of a (ugh) docker (ugh) setup using no requirements.txt
at all:
FROM python:3.12-slim
# globally install poetry and upgrade pip things
# (NOTE: the poetry project often releases new versions over weekends, so
# if your have auto-building services and poetry releases a new incompatible
# version, your stuff will just break randomly on Saturday nights; so you _could_
# pin your specific poetry version here, but also fix-as-it-breaks is valid too)
RUN pip install pip poetry setuptools wheel -U --no-cache-dir
# Copy your project definitions into the image
COPY pyproject.toml poetry.lock README.md .
# Run the virtual env creator and dependency installer
# (NOTE: some python packages like mysqlclient require more system binary packages
# to be installed, so you'd need to apt-get other packages as required before
# your poetry install if needed)
RUN poetry install --without=dev --no-cache
# Copy your project package
COPY hello hello
# Now install the project package itself
# NOTE: Yes, we run `poetry install` TWICE due to the docker
# caching logic because we don't want to reinstall dependencies
# on every code update. This means the dependencies are cached
# in the _first_ `poetry install` layer, while _this_ `poetry install`
# layer just handles a final "script cleanup" install due to path issues.
RUN poetry install --without=dev --no-cache
# now run your command (as defined in `pyproject.toml` poetry scripts section)
CMD poetry run hello-command extra-args
Using this pattern you are, yes, creating a virtualenv in your docker image and using it, but LOOK HOW CLEAN IT IS. You ARE NOT generating a redundant and potentially out-of-sync requirements.txt
file; you are using your built-in tooling as expected without any extra weird workarounds and all your expectations play out without any additional mental indirection across incompatible tool usage. nobody likes incompatible tools.
FREE TIP: did you know Dockerfile
is actually a file extension? So you should name things like hello.Dockerfile
and not helloDockerfile
or Dockerfile.hello
. the more you know
Poetry Script Entry Point Management Excellence
Perhaps the most practical day-to-day feature of poetry is its nice ability to automatically generate global python commands from a simple one line package.module:function
declaration.
le example
Let’s make a simple web fetcher nicely encapsulated in a dataclass so we can run it all via a CLI.
creating hello/entrypoint.py
from dataclasses import dataclass, field
import httpx
from loguru import logger
@dataclass(slots=True)
class Fetcher:
url: str
timeout: float | None = None
# These two are marked 'init=False' so they do not show up in the constructor
# logic because the user doesn't need the ability to initialize these values since
# they a.) have defaults and b.) are internal implementation details.
client: httpx.Client = field(default_factory=httpx.Client, init=False)
results: list[str] = field(default_factory=list, init=False)
def __post_init__(self) -> None:
# Attach our timeout to our instance httpx client
# (note how we need to do this in __post_init__ since we can't access
# peer instance variables in the `field()` defaults above because there's
# no `self` existing there yet)
self.client.timeout = self.timeout
def fetch(self) -> None:
logger.info("[{}] Fetching with timeout {}", self.url, self.timeout)
self.results.append(self.client.get(self.url))
logger.info("[{} :: {}] Found results: {}", self.url, len(self.results), self.results)
Pretty simple, right? We have a dataclass with one required parameter url
and the other instance variables all have defaults so the user doesn’t need to provide values for them.
But how do we run this? If you were using legacy outdated python logic, you’d bring out more dunders with something like if __name__ == "__main__": Fetcher(*sys.argv[1:]).fetch()
.
But we can also just use fire.Fire
(or even jsonargparse
) and it automatically generates a full CLI from the class parameters itself.
Instead of adding the name-main dunder, we create a proper command entrypoint where it will auto-detect required parameters just from the class definition and allow us to run any methods declared on the class:
appending to hello/entrypoint.py
Now we tell poetry where to find the cmd()
top-level function by adding this to our pyproject.toml
:
updating pyproject.toml
Now we just run it via:
running examples
> poetry run hello-entrypoint
Usage: hello-entrypoint <group> | --url=URL <flags>
available groups: client | results | timeout | url
optional flags: --timeout
For detailed information on this command, run:
hello-entrypoint --help
and it worked! Except we didn’t actually run it yet. We can run it too:
> poetry run hello-entrypoint --url http://google.com - fetch
2024-01-09 18:31:51.487 | INFO | hello.entrypoint:fetch:21 - [http://google.com] Fetching with timeout None
2024-01-09 18:31:51.513 | INFO | hello.entrypoint:fetch:25 - [http://google.com :: 1] Found results: [<Response [301 Moved Permanently]>]
> poetry run hello-entrypoint --url http://google.com --timeout 0.1 - fetch
2024-01-09 18:38:38.614 | INFO | hello.entrypoint:fetch:21 -[http://google.com] Fetching with timeout 0.1
2024-01-09 18:38:38.643 | INFO | hello.entrypoint:fetch:25 -[http://google.com :: 1] Found results: [<Response [301 Moved Permanently]>]
Here’s an example of the output using jsonargparse
instead of fire.Fire
too:
> poetry run hello-entrypoint http://google.com
usage: hello-entrypoint [-h] [--config CONFIG] [--print_config[=flags]] [--timeout TIMEOUT] url {fetch} ...
error: expected "subcommand" to be one of {fetch}, but it was not provided.
> poetry run hello-entrypoint http://google.com fetch
2024-01-09 18:47:43.671 | INFO | hello.entrypoint:fetch:21 - [http://google.com] Fetching with timeout None
2024-01-09 18:47:43.728 | INFO | hello.entrypoint:fetch:25 - [http://google.com :: 1] Found results: [<Response [301 Moved Permanently]>]
> poetry run hello-entrypoint --timeout 0.1 http://google.com fetch
2024-01-09 18:47:51.628 | INFO | hello.entrypoint:fetch:21 - [http://google.com] Fetching with timeout 0.1
2024-01-09 18:47:51.657 | INFO | hello.entrypoint:fetch:25 - [http://google.com :: 1] Found results: [<Response [301 Moved Permanently]>]
-- you can also have jsonargparse generate a config file from your parameters for overriding shared values on future runs:
> poetry run hello-entrypoint --print_config --timeout 0.1 http://google.com fetch
url: http://google.com
timeout: 0.1
fetch: {}
How to decide between fire.Fire
or jsonargparse.CLI
? If you only have a simple interface with a couple parameters and a couple commands, fire.Fire
is easy to run and is flexible if you want to be lazy (fire.Fire
parses inputs to types, but not the declared code types, so it’s more likely to work for any input even if your code types aren’t perfect), while jsonargparse
allows inputs to be fully defined nested objects if you want to have more complex inputs (like inputs being other nested dataclasses of their own fields to create) and jsonargparse
will validate types of all fields from your class definitions from either CLI or config file input. basically, fire.Fire
is good for being being quick and lazy (it also supports nice pipeline workflow systems); jsonargparse.CLI
is good for more complex and more long-lived systems. Also note in our example above, fire.Fire
generated config params for all instance variables while jsonargparse.CLI
properly ignored our init=False
fields since users shouldn’t be editing those values anyway.
so there’s one example of running a fully modern pyproject.toml
+ poetry
+ good class layout + auto-generated-CLI system. Everything you create should have a structure very similar to the format above for basic interactive tools or any systems requiring the ability for users to run some part of your project (i.e. it’s not “just a library” somebody else is importing).
Bonus: Legacy Outdated Python Usage Warning Signs
Other signs your python knowledge is outdated or you only learned from online spam tutorials without reading platform updates over the past 5+ years:
- BAD: you use anything is the
os.path
namespace. You should only use the much nicerfrom path import Path
namespace for file introspection. - BAD: you manually create CLI structures with
argparse
orclick
. All your CLI interaction should be generated fromfire.Fire
for simple interfaces or jsonargparse for more complex interfaces (jsonargparse
also includes a built-in config file parser just from your project structure, which you’ll find familiar if you’ve used the pytorch lightning config syntax before) - BAD: you aren’t using a nice time wrapper.
Arrow
orpendulum
often provide nicer interfaces than python’s built-in poorly namedfrom datetime import datetime
system. - BAD: mega red flag: you manually configure
PYTHONPATH
anywhere. Manually settingPYTHONPATH
shows you don’t understand the basic operations of python module path logic (which, we must admit is confusing by itself), but usingpyproject.toml
withpoetry
to manage your dependencies and environments means you should never usePYTHONPATH
ever again. - BAD: you don’t realize
List[]
andTuple[]
andSet[]
andOptional[]
are deprecated in favor of justlist[]
,tuple[]
,set[]
,deque[]
,x | None
typing annotations (it is also difficult to teach current code LLMs to stop outputting legacy type annotations though). - BAD: your python version is more than 1 version old. As of this writing, you are allowed to use Python 3.11+ and Python 3.12+. If you are still creating new systems targeting older versions, STOP IT. JOIN THE FUTURE. if you are “stuck” on your current version because your “system python” isn’t updated, you should be using pyenv so you control your language version directly.
- BAD: not using a standard code formatter: pick one or more of
black
,ufmt
(groups/sorts imports too),ruff format
- BAD: using “anaconda” or “conda” for anything — those are just crutches for people who don’t know how to install their own packages, but if you can’t even install a package, you need to just learn more instead of operating at borderline-incompetent levels of understanding in your professional work.
- BAD: reliance on jupyter interfaces for much of anything. just be a real grown up human developer and create reusable interactive systems instead of buggy sprawling low quality “notebooks” everywhere.
- BAD: you aren’t using
@dataclass
everywhere. You should never writedef __init__(self, ...) -> None:
ever again. dataclasses allow you to define your instance variables once then a constructor is auto-generated and you can hook into them withdef __post_init__(self) -> None:
but basically every python class should be adataclass
going forward with no exceptions (only exception maybe if you need to inherit from a non-dataclass, but even such cases can be worked around too where a dataclass inherits from a non-dataclass even if you have to annotate with@dataclass(unsafe_hash=True)
(using said trick you can create dataclasses inheriting fromnn.Module
for a very clean system) or something else fun). - BAD: you use the built-in python
logging
module. It is a complete disaster of an API. You should useloguru
everywhere instead. - BAD: you write too much top-level code instead of behaviors being nicely encapsulated in self-contained dataclass components. python sadly encourages a lot of top-level unstructured code like how flask and fastapi expect to use a global object as their root decorator source, but this can be worked around with a couple clever levels of nesting using instance variables and closures.
- BAD: here’s a fun minor nit: you should never use an in-line generator because it is always faster to use a list comprehension instead:
max(x + 3 for x in range(100))
is always slower thanmax([x + 3 for x in range(100)])
because, unless your source is truly infinite or truly generated-per-call, collecting and iterating a list is faster than building and running the generator state management logic on every call (plus, using inline generators introduces an entire class of bugs where you forget it’s a generator and you iterate it twice, but the second time, it had already fully generated all the data and then returns nothing instead of your data even though it “looks right” because you didn’t think it through abstractly enough.) - BAD: general unawareness or lack of experience using built-in features from things like collections and itertools or even helpful wrappers like more-itertools and boltons and standard useful things like sortedcontainers and diskcache and peewee minimal orm and httpx and loguru and orjson (at least if you aren’t using pypy) — many of those tools and APIs require actual practice/play/experience to remember when it’s appropriate to use them since they are often higher order abstractions which don’t feel intuitive until you’ve learned to think in more advanced data processing logic.
- OPTIONAL: using type checkers and/or linters via
mypy
orruff
too — depending on the size of your project and its purpose/complexity/cost/value, sometimes it isn’t worth spending an extra couple days or weeks tracking down all the proper types (especially if you are returning instances of classes with poorly defined types you have to code dive to find manually), but it’s always best if you can pass mypy strict just for peace of mind during updates and refactors. also, at least attempting proper type annotation everywhere feasible helps you write more self-documenting code and continually reflect on the appropriate use cases and structure of input types and response values (also, declaring type annotations on parameters helps you notice when maybe you are generating too many ad-hoc data structures via nested list/dicts/sets where instead you should be passing around better structured single-purpose encapsulated dataclasses, etc). - contribute your own bugbear here!
if you’ve made it this far, i’m still soliciting ai indulgences for yourself and everyone you know (buy now before rates go up again in another step function!). also I still have some domains for sale as well if you have more money than sense: make.ai and god.ai are currently ready and waiting to live their best lives with you as soon as possible for the low low price of millions of dollars direct from you to me.