Without Further Ado: Automate Dev Environments and Build

Bring joy to your fellow developers by making your software easy to use through environment and build automation. With code examples in…

Mattia Di Gangi

Towards Data Science

· ~11 min read · February 24, 2023 (Updated: February 25, 2023) · Free: No

Bring joy to your fellow developers by making your software easy to use through environment and build automation. With code examples in Python and Hatch.

Most developers hate legacy software, why? "Legacy" in our industry means a codebase that has been in service for many years, usually the original developers are no longer in the company, and nobody is able to really maintain it.

Some important ingredients in the legacy software recipe are: lack of documentation, hard-to-understand code, difficulty in trying out modifications, and serious difficulty in building software for shipment.

However, legacy software is called like that because the company relies on it, and that is why it is still in use despite the fact that most developers do not know how to handle it. We inherit it from our past colleagues, and it becomes our duty to keep it working in the future like it did in the past.

Yet, there is software that respects the abovementioned requirements for legacy software without having the pro side of having serviced us for years. Part of that code is also quite recent. That means, people write code that is already "legacy" according to its negative definitions.

A bad working day

Today it is Monday and you have been assigned to a new project. In a meeting at 9 o'clock, you learn that you are now working in Petty, the new company product, which is a search engine for pet images. Your first task is to work on Petognizer, a component to recognize pets in images. It is currently working on production, but it can only recognize cats and dogs. Now the management wants it to recognize also hamsters and rabbits.

You get a link to the repository in the company GitLab account and are ready to use it.

Except that you cannot.

The repository has no README, and no folder called docs or something similar. It looks like you have some figuring out to do on your own.

It is 10.30 and you start digging into the repo. After 30 minutes you understood where to find the training code, the data preprocessing scripts and the inference code. It looks overly complicated, but this is not the moment to go into the details, you want to get a global idea of it.

You also find a Dockerfile that is likely used for production, but this is not explicitly written anywhere, and a requirements.txt, of course this is a Python project since it is about machine learning.

It is 11.30 now and you feel ready to try it out. You know where the new dataset is, so now you want to start with the data preparation pipeline. Thus, you want to install the package, except that you notice the absence of a setup.py and a pyproject.toml. You ask your colleagues if maybe the package is in the company's private pypi server but you receive a negative answer. After all, if there are no build instructions, how could it be packaged?

Apparently, you have to go the dirty way. You download the repo, find the entry point with data preparation, which was not in a __main__.py file, add the root folder of the project to your PYTHONPATH environment variable. Then, you create a new python environment with venv, run pip install -r requirements.txt to install all the necessary dependencies, run the script and… voila!

ImportError: module xxx not found

You thought that requirements.txt would have provided the dev dependencies, but apparently not. But now you look at your watch and notice that it is already one o'clock, so you decide to go for lunch with your colleagues. You will figure it out later.

When you are back from your lunch break, you are again energized and ready to tackle the problem again.

Your work now is to install the missing dependencies one by one, every time that you install one library you run it again, wait a little time and the next ImportError comes out. You finish installing all the stuff by 3 o'clock and you are still wondering what the requirements.txt is actually for.

You have been several hours in the process and you are still not sure that you can really run the software for what you need. And your day so far feels completely wasted.

Photo by Anthony Intraversato on Unsplash

In a parallel world, you simply ran

pip install petognizer

or maybe, if the training dependencies are provided as an extra

pip install petognizer[train]

Then found the command for data preparation explained in the README, and you were ready to go in about 10 minutes, well before your lunch break

The presence of a pyproject.toml would have made so easy to both install the package (even more if it was in the pypi server) and prepare a build environment. Also, some proper documentation would have made easy to know how to actually use the code for the different tasks.

So many hours of talented developers all over the world go wasted every day just because some code maintainer did not take any time to set up documentation and automation to provide the basics for a good development experience.

Now that you are in charge of this codebase, you can decide to leave as it is, gifting the same frustration to the next developer that is going to use it. Alternatively, you can decide to spark some joy in the organization and make the repository easily usable.

While writing documentation is a long-term process that require understanding of the code (but some correct documentation is better than no documentation), automating the build and the dev environment creation can free the developers from a lot of hassle and make them productive with the codebase much faster.

Tips for Reading and Writing an ML Research Paper

Lessons learned by dozens of peer reviews given and received

towardsdatascience.com

Hatch

Hatch is an automation tool supported by the Python Packaging Authority (PyPA) that enables creating environments and building software with high simplicity.

It uses a file called pyproject.toml which is the "new" unifying Python project settings file. Unifying because it is used by Hatch and other similar tools, like Poetry, and it can specify the configuration of other tools, such as black or flake8, for instance.

We can use Hatch to create a new project with

hatch new "My Project"

and it will create a standard project structure like

my-project
├── src
│   └── my_project
│       ├── __about__.py
│       └── __init__.py
├── tests
│   └── __init__.py
├── LICENSE.txt
├── README.md
└── pyproject.toml

It contains two separate packages for the source code and the tests, a README.md filled with some default text, a LICENSE.txt, a file called __about__.txt containing the software version, and pyproject.toml.

Hatch can also be used into an existing project by simply running

hatch new --init

in its root folder. It will generate a default pyproject.toml that needs to be modified.

Hatch can be installed with pipx install hatch to make it available as a command for the user without "polluting" the existing default python environment.

Python Polymorphism with Registers | Python Patterns

Learn a pattern to isolate packages while extending the functionalities of your Python code.

towardsdatascience.com

A quick glance of pyproject.toml

The Hatch documentation is quite thorough, although sometimes the clarity could be improved, thus here we shall stick to the parts relevant for this article's topic and some necessary context.

pyproject.toml comes with an initial part that is quite descriptive. Here we can find the project's name, description, authors, what Python versions are supported and the version.

Let's use as an example this toy repo that I made for illustration https://github.com/mattiadg/demo_it-analyze/blob/main/pyproject.toml (Thanks to my friend Mario for helping with the front end of this little app)

[project]
name = "demo-it-analyze"
description = ''
readme = "README.md"
requires-python = ">=3.7"
license = "MIT"
keywords = []
authors = [
  { name = "Mattia Di Gangi", email = "mattiadg@users.noreply.github.com" },
]
classifiers = [
  "Development Status :: 4 - Beta",
  "Programming Language :: Python",
  "Programming Language :: Python :: 3.7",
  "Programming Language :: Python :: 3.8",
  "Programming Language :: Python :: 3.9",
  "Programming Language :: Python :: 3.10",
  "Programming Language :: Python :: 3.11",
  "Programming Language :: Python :: Implementation :: CPython",
  "Programming Language :: Python :: Implementation :: PyPy",
]
dependencies = [
  "flask",
  "pydantic",
  "spacy",
]
dynamic = ["version"]

[project.urls]
Documentation = "https://github.com/unknown/demo-it-analyze#readme"
Issues = "https://github.com/unknown/demo-it-analyze/issues"
Source = "https://github.com/unknown/demo-it-analyze"

It should be quite clear, except for two fields. dynamic = ["version"] means that the software version has to be computed dynamically, and later in the file it says it has to be found in the __about.py__ file. On the other hand, the field dependencies lists the package dependencies. These are the packages that will be installed together with this software, when packaged. Here, I have only listed the names, but some version constraints are also applicable, such as ==, <=, etc… they are the same as used by pip.

Notice that these dependencies are specified under the tag [project], but there are other dependencies that are specified only for some environments. However, these are common to the package and all the environments. The codebase just does not work without them.

Tools configuration

As mentioned previously, pyproject.toml can be used to specify the configuration of tools used in the project, and Hatch is one of those tools that can be configured here.

[tool.hatch.version]
path = "src/__about__.py"

[tool.hatch.envs.default.env-vars]
  FLASK_APP = "demo_it_analyze/app.py"

[tool.hatch.envs.default]
extra-dependencies = [
  "pytest",
  "pytest-cov",
  "mypy",
]

[tool.hatch.envs.default.scripts]
cov = "pytest --cov-report=term-missing --cov-config=pyproject.toml --cov=demo_it_analyze --cov=tests {args}"
no-cov = "cov --no-cov {args}"
download_ita = "python -m spacy download it_core_news_sm"
serve = "flask --app src.app run"

[tool.hatch.envs.train]
template = "default"
extra-dependencies = [
  "spacy[cuda-autodetect,transformers,lookups]"
]

[tool.hatch.envs.test]
template = "default"

[[tool.hatch.envs.test.matrix]]
python = ["37", "38", "39", "310", "311"]

Notice that the fields here start with tool.hatch and not with project.

We start by telling Hatch where to find the software version. Then we specify the FLASK_APP for the default python environment. Hatch allows us to create many environments, and default is the one that is always defined and from which all the others derive. We can define dependencies for our environments, and these are the dev dependencies, because the environments do not exist in the package code.

With Hatch we can create the default environment by just executing

hatch env create

and it will create the default environment with all its dependencies. We can also create another environment specified in the pyproject.toml (for instance test is also defined in this example) by running

hatch env create test

However, one nice thing of Hatch is that we do not need to create environments explicitly. The command hatch run allows us to run any command, or script, inside an environment. In the file copied above we have this script definition:

[tool.hatch.envs.default.scripts]
cov = "pytest --cov-report=term-missing --cov-config=pyproject.toml --cov=demo_it_analyze --cov=tests {args}"
no-cov = "cov --no-cov {args}"
download_ita = "python -m spacy download it_core_news_sm"
serve = "flask --app src.app run"

So we can run for example

hatch run serve

and hatch will create the default environment, if it does not exist, syncronize its dependencies with those specified in pyproject.toml at the moment of execution, and run flask --app src.app run inside the environment, executing our flask server. We can also specify to run the command in a specific environment, for example test with

hatch run test:serve

And notice that we did not need to install anything. Hatch reads the pyproject.toml and recognizes the scripts that it can run.

You see what we did here? The project maintainers specify some configuration options in a file and then it becomes super easy for anybody who uses the repo to create exactly the same environments. No more struggling to reproduce the dev environments!

Data Processing Automation with inotifywait

How to automate before having a production-ready MLOps platfotm

towardsdatascience.com

Building

Hatch is also a decent build tool. It has its own build backend called hatchling, and the pyproject.toml comes with some configurations for building. From the example:

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.hatch.build.targets.sdist]
exclude = [
  "/.github",
  "/docs",
]

[tool.hatch.build.targets.wheel]
packages = ["src"]

We have to specify the build backend because Hatch can be used as a nice front end for all the features that it offers, and then build with another tool. Then we see that we can specify some options for what to put in the two build formats sdist and wheel. However, there is not much to put inside them for a toy project like this one.

Then, we can simply run hatch build and it will build both targets. Or we can specify one with hatch build -t wheel, for instance.

When you also have credentials for a pypi server, be it the official one, the test server, or a private server, you can publish your package there from your build using simply hatch publish.

For sure, publishing a package is something that should be done with care, after ensuring that the software works. But then, hatch takes care of all the details and we just need to run two simple commands.

And if you have a CI/CD pipeline enabled like Github Actions, you can also specify these commands there for even more automation!

What we did not cover

Making a codebase easily usable can be a challenging problem and, while building automation is a useful step in the right direction, it is far from being everything needed.

We already mentioned documentation in the introductory story and we shall not talk longer about it.

Another important aspect is a thorough test suite, which serves three main purposes. First, it helps to verify that the code is correct. Anybody can read the test and check if they are proving the right thing. And if tests pass, then we have some guarantees about code correctness. Second, they serve as a safety net against regression. When we change a codebase, in particular one we do not know well, a thorough test suite can catch if the modification broke some expected behavior. Third, if the documentation is not exhaustive enough, tests can show us how to use functions or classes we struggle to understand, since they are by their very nature code examples.

Another important missing part is a continuous integration/continuous delivery pipeline (CI/CD), which runs some checks every time some code is added to the remote repository, and can also build and deploy the software. A CI/CD pipeline shows the expected code quality from the code base, since it can run checks like linters or static type checkers (for dynamic languages like Python), as well as tests. A step in the pipeline for building and deploying the software also ensures that those actions are easy to perform and have an high-enough level of automation.

Conclusions

Hatch is a great tool for automation that requires some maintainance and the specification of some configuration options, but then pays off greatly by simplifying significantly the environment and build creation.

When developers can easily work with software without spending hours figuring out how to set it up, it is easier for them to be productive and also figure out the missing part in the documentation, because they can run the code.

Thank you for reading so far, it was a long read, but I hope I have convinced you that with tools like Hatch you can greatly improve the developer experience for you and your fellow developers.

Bibliography

Some books definitely influenced this article. The most prominent is The Unicorn Project, written by the famous DevOps author Gene Kim and edited by IT Revolution. A senior developer is exiled in a dysfunctional department where it is not possible for developers to run the code locally. During the novel, she and a group of illuminated colleagues do their best to free their colleague from the hassle of thousands of blockers and bureaucracy to let them do what they are actually paid for.

Grokking Continuous Delivery by Christie Wilson and edited by Manning is a guide on how to start having a CI/CD pipeline starting from scratch, and improve your codebase in the process.

Re-Engineering Legacy by Chris Birchall and also edited by Manning takes a comprehensive look at legacy software, how to work with it when we are assigned to it, how to improve it, and how to make sure that we are not writing legacy software.

Medium Membership

Do you like my writing and are considering subscribing for a Medium Membership for having unlimited access to the articles?

If you subscribe through this link you will support me through your subscription with no additional cost for you https://medium.com/@mattiadigangi/membership

#automation #build #software-engineering #machine-learning #developer-tools

Without Further Ado: Automate Dev Environments and Build

Bring joy to your fellow developers by making your software easy to use through environment and build automation. With code examples in…

Bring joy to your fellow developers by making your software easy to use through environment and build automation. With code examples in Python and Hatch.

A bad working day

Tips for Reading and Writing an ML Research Paper

Lessons learned by dozens of peer reviews given and received

Hatch

Python Polymorphism with Registers | Python Patterns

Learn a pattern to isolate packages while extending the functionalities of your Python code.

A quick glance of pyproject.toml

Tools configuration

Data Processing Automation with inotifywait

How to automate before having a production-ready MLOps platfotm

Building

What we did not cover

Conclusions

Bibliography

Medium Membership

Reporting a Problem