Red Line Station: Lessons learned: Python packaging

So in the last few weeks I've slowly been putting all my code into a Python package on PyPI. (Links: GitHub project, PyPI.)

It's still pretty buggy--hence the pre-alpha designation--and it's not as extensively built-out as some of my older code, but it's under active development and hopefully will pass the old code before long.

I took on this project for two reasons:

As a public service for people who want to do everything in Python. (Python is a great general-purpose language to learn, though for any given specific task there's probably a superior option, like R, Julia, or C++.)
To teach myself some software engineering-related skills.

One of Python's weaknesses is that the process of writing documentation and making and uploading packages is less-than-clear online.

I can't claim all my tips here are 100% good practice, but they got things to work. Getting things to work was surprisingly frustrating once my project became nontrivial.

Below, I listed some pain points that I learned how to handle. Reminder: you can see the code on GitHub if you want specifics or to see the fixes in context.

Organizing a project

Use subdirectories. Each one should contain __init__.py. In each __init__, write
__all__ = []
and then fill the list with (quoted) names of modules in that folder. (You can do this manually or algorithmically.)
This ensures that when you import * from the subdirectory:
from mypackage.mydir import *
all methods from all the modules will get imported.
Don't forget to update your top __init__.py either.
In your setup.py, use
packages=setuptools.find_packages()
as one of your kwargs in setuptools.setup(). As you build out your package, this will help ensure nothing gets lost. (You'll have to import find_packages from setuptools, of course.)
Have a separate data folder. Don't git commit these unless you can't regenerate them easily. But you can include them in your MANIFEST.in file if you want.
Add everything you want to include in the package in your MANIFEST.in. Only python files are included by default. (For example, start by include your LICENSE.txt and README.rst.)
Add files to your gitignore early. It's easy to start tracking a file, but annoying to stop tracking it and delete earlier tracking (and fooling around with git can lead to catastrophe-level deletion by accident).

Coding

Writing tests after the fact is hard. Use test-driven development if possible--it'll also help you structure your code to be as modular as possible. If it's too late for that, at least use assert statements liberally, wherever something should always hold true but you're not 100% sure it will (having only tested a basic case or two while writing the original code)
Made modules short and methods modular. Don't worry too much about efficiency/run time unless your package does computationally intensive work.
Use feather-format for fast data frame read/write.
It's beautifulsoup4, not just beautifulsoup.
Make decorators for tasks like error logging, rate-limiting requests, caching results of methods (via functools), trying to access a site a few times in case you get an unexpected timeout or something, etc. I confess I haven't quite figured out how to make all of this work, so I wrote methods to do similar tasks--they're really helpful.
The caching decorator from functools needs all the arguments to be hashable, which is annoying. Keep that in mind--you won't be able to pass dataframes, dictionaries, lists, or other mutable objects as arguments. Sometimes, you can use tuples, but other times, you'll need to refactor to avoid the situation.
Writing one-line methods is often worth it. For example, the method to access a particular file--if you decide to change the file path down the road, you only have one change to make.
Try to make syntax as parallel as possible. E.g. I use get_parsed_pbp, get_parsed_toi,
get_raw_pbp, get_raw_toi, get_parsed_pbp_filename, etc. Makes method names easy to remember.
Use aliases and be consistent. If you have to import another module in the same package, you could use, for example

from mypackage.mysubdir import module as mod

import mypackage.mysubdir.mod as mod

Add a blank line between the intro text and :param lines of your docstring, and again between the :param and :return:. This ensures that Sphinx reads it in correctly.
If you need multiple lines for anything in :param and :return:, indent. That way Sphinx doesn't get confused.

Uploading to PyPI

Just follow this process:

Create an account with PyPI
pip install twine (if necessary)
pip install wheel (if necessary)
cd into your project's top folder
python setup.py sdist
python setup.py bdist_wheel
twine upload dist/* and enter your username and password

If you want to upload things later:

Change the package version in setup.py
Remove the old whl and tar.gz files from your dist folder (you can delete them, or if you really want, move them to a new folder called olddist)
Follow process above

Some of the commands don't overwrite your old stuff, so if you don't clear things out and change the version, when you try to install again, you'll just have the old version.

Also, you can install a package from whl using

pip install [filename].whl

So before uploading to PyPI, I recommend creating a virtualenv or conda environment, making sure your whl installs properly in there, and only then uploading to PyPI. Otherwise you'll have to increment your version again and again and again.

It's also good to use travis. It'll require you to create a travis.yml file in your project main directory where you have to specify python versions, platforms, packages you need, etc. For example, the top of a travis.yml could read:

language: python
python:
- "3.6"
install:
- pip install numpy

Writing documentation with Sphinx

Create a nice README.rst. My project wasn't working for weeks but at least that pretty readme made me feel good.
Create a docs folder in your project and set up Sphinx there. Make sure autodoc gets enabled and there are separate source and build directories.
You'll need to add paths to the rest of your project in conf.py. Near the top, where you see the part about adding paths in, just add a bunch of sys.path.insert statements using relative paths until things work. (You can clean it up later.)
In conf.py, also update the package version. Also change the pygments_style and html_theme to something you like.
Be careful about indentation in your .rst files. For example, PyCharm, if you let it auto-format before pushing code to a git repository, might change your three-space indent to four spaces or to zero spaces. (This kept messing up my toctree.)
In your toctree, the entries have to line up with the first colon in maxdepth. In general, line up text with that colon, not with the text after it.
Yes, you do have to outline all the pages of your docs. But it's not too hard. Just make some basic pages and use
.. automodule:: [module name]
:members:
You can reorganize things from there.
You can include your readme in your index.rst:
.. include:: ../../README.rst
:start-after: inclusion-marker-for-sphinx
Then I just inserted the marker before my first header (but after the code badges) in my readme:
.. inclusion-marker-for-sphinx

Introduction
========
[...]
If everything is building properly on your machine, you can try to upload to Read The Docs. There are some settings there you may have to fiddle with (like the virtualenv when you need numpy). If your build fails, or the docs don't look like they do on your system, check the build logs. (I found issues with absolute imports needing to be changed to relative and python package version and dependency issues, for example.)
Use your browser to view the docs you create. You can just refresh the page when you rebuild.
Pay attention to the error messages Sphinx gives you--it may still finish building even when the docs don't turn out quite right:

If a module can't import properly, automodule may not show anything.
Again, if you don't make proper use of indentation and blank lines, method docs (for example) may show up as a wall of text rather than a list of parameters.

BONUS: Flask

I'm building out a Flask app for this package as well. Flask was also tricky to get up and running beyond "Hello, World." Here's what I learned:

You need to create HTML templates for each page. Kind of like for Sphinx, this can just be an outline (HTML in this case) and you can use some special syntax to insert arguments Python will pass to it. You can also have if-else and for-loops, and more.
Beware of caching
You can generate and show an MPL image. The flow goes like this:

User access a url, like '/games/20001/'
That request triggers a method
The method tries to render a template
The template tries to get an image from another page (use img src = {{ url_for...)
That request triggers a method
In your original .py file, you should have a method that does this:
fig = [get image]
import io
img = io.BytesIO()
fig.savefig(img)
img.seek(0)
return send_file(img, mimetype='image/png')
So what essentially happens is when the user goes to a url, you render a template that requests a different url, and when that request comes in, the method returns an image.

That's it for now. I'll add to this later if anything else comes up. Hopefully, though, between the usual online references and this, you'll have a less painful experience writing and documenting a python package.

Red Line Station

Monday, November 6, 2017

Lessons learned: Python packaging

No comments:

Post a Comment

Welcome!

Search This Blog

Scoring Chances and Zone Stats

Numbers Encore

Caps-Centric Colleagues

Other Hockey Blogs

Blog Archive

About Me