Building and Publishing Docker with GitHub Actions

In this post I am sharing my recipe for building and publishing Docker using GitHub Actions. It concisely wraps up a few steps that beginners often find problematic. In particular:

  • use GitHub secrets to securely store credentials, such as $DOCKER_USER and $DOCKER_PASSWORD, for your docker registry (such as DockerHub or GitHub Container Registry)
  • I recommend logging to the docker registry via the CLI, rather than using a less transparent GitHub Action, which is as simple as docker login -u $DOCKER_USER -p $DOCKER_PASSWORD
  • use the correct tag pattern when pushing your docker to a registry
    • DockerHub: $DOCKER_USER/$IMAGE_NAME:$IMAGE_VERSION
    • GitHub Container: ghcr.io/$GITHUB_USER/$IMAGE_NAME:$IMAGE_VERSION

The sample code is shown below. See it in action on production here and in this template.

name: docker-image

on:
  push:
    branches: [ "main" ]
    paths: ["Dockerfile",".github/workflows/docker-image.yaml"]
  workflow_dispatch:


jobs:
  build-and-publish:
    runs-on: ubuntu-latest
    # Docker tags and credentials for DockerHub/GitHub Containers, customize!
    env:
      IMAGE_NAME: plantuml-docker
      IMAGE_VERSION: latest
      DOCKER_USER: ${{ secrets.DOCKER_USER }}
      DOCKER_PASSWORD: ${{ secrets.DOCKER_PASSWORD }}
      GITHUB_TOKEN: ${{ secrets.PAT }}
      GITHUB_USER: ${{ github.actor }}
    steps:
    - uses: actions/checkout@v3
    - name: Build and tag the image
      run: |
        docker build . \
        --tag $DOCKER_USER/$IMAGE_NAME:$IMAGE_VERSION \
        --tag ghcr.io/$GITHUB_USER/$IMAGE_NAME:$IMAGE_VERSION
    - name: Publish to DockerHub
      if: env.DOCKER_PASSWORD != ''
      run: |
        docker login -u $DOCKER_USER -p $DOCKER_PASSWORD
        docker push $DOCKER_USER/$IMAGE_NAME:$IMAGE_VERSION
    - name: Publish to GitHub Container registry
      if: env.GITHUB_TOKEN != ''
      run: |
        docker login ghcr.io -u $GITHUB_USER -p $GITHUB_TOKEN 
        docker push ghcr.io/$GITHUB_USER/$IMAGE_NAME:$IMAGE_VERSION

Effective Caching with GitHub Actions

GitHub Actions is great as a CI/CD platform. However, to be really efficient, workflows need to leverage some optimization techniques, such as caching or running tasks in parallel. In this note, I am sharing some thoughts on how to use cache effectively, with respect to multiple paths and sudo-installed APT packages. The discussion will touch on a few non-trivial aspects that, in my opinion, are not well-explained in other web materials.

My use case was simple: speed-up building a university course in Sphinx, to be hosted on GitHub pages. Installing Sphinx dependencies required multiple Python and Linux APT downloads, which took quite a long. The caching solution indeed fixed a problem, and here are the key takeaways:

This repository demonstrates the working solution. The cache size (Python and APT packages) ais about 120MB. And this is how the job looks like:

name: docs

on: [push, pull_request, workflow_dispatch]

jobs:
  docs:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - uses: actions/setup-python@v2
      - name: prepare virtual environment
        run: |
          python -m venv .venv
          mkdir .apt
      - name: cache dependencies
        id: cache_deps
        uses: actions/cache@v3
        env:
            cache-name: cache-dependencies
        with:
          path: |
            .venv 
            .apt
          key: ${{ runner.os }}-build-${{ env.cache-name }}-${{ hashFiles('.github/workflows/*') }}
      - name: Install python dependencies
        if: ${{ steps.cache_deps.outputs.cache-hit != 'true' }}
        run: |
          source .venv/bin/activate
          pip install jupyter-book
          pip install sphinxcontrib-plantuml
      - name: Install sudo dependencies
        run: |
          sudo apt-get -o Dir::Cache=".apt" update
          sudo apt-get -o Dir::Cache=".apt" install plantuml
      - run: |
          apt-config dump | grep Dir::Cache
      - name: Compile Docs
        run: |
          source .venv/bin/activate
          jupyter-book build docs
      - name: Deploy to gh-pages
        uses: peaceiris/actions-gh-pages@v3
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          publish_branch: gh-pages
          publish_dir: ./docs/_build/html

Making SSH work by proxy

It is a popular misbelief that hiding encrypted connections (SSH) behind a proxy is a dark domain reserved to crime activities. You may need a Russian or Iranian proxy to get your coding job done, when firewalls of your favourite coffee place or wifi in travel forbid the use of SSH.

As this happens to me regularly – I travel a lot – I would like to share a solution here. The most effective is to proxy the traffic through port 443 (default for HTTPS, typically enabled). Testing the list of free proxy servers ​1​ we find that the proxy 185.82.139.1 (Iranian) is working well. It remains to add a proxy instruction to the ssh configuration as shown below. That’s it, and I can work with GitHub in my favourite coffee place in France. Enjoy!

# content of .ssh/config
Host github.com
    HostName github.com
    User git
    Port 22
    IdentityFile ~/.ssh/id_rsa
    StrictHostKeyChecking no
    ProxyCommand ncat --proxy 185.82.139.1:443 github.com 22
  1. 1.
    http://free-proxy.cz/en/. List of free proxies. Proxy list for port 443. http://free-proxy.cz/en/proxylist/port/443/ping

Free and robust Tweets extraction

As anticipated by many, Twitter stopped offering its (limited!) API for free ​1​.

Now, what options do you have to programmatically access the public content for free?
In this context, it is worth mentioning the library snscrape, a tool (well-maintained as of now) for extracting the content from social media services such as Facebook, Instagram or Twitter ​2​. I have just given a go, in the scope of the research project I am working on, and would love to share some thoughts and code.

The basic usage is pretty simple, but I added multithreading to improve speed by executing queries in parallel (an established way of handling I/O bound operations). I also prefer a functional/pipeline style of composing Python commands, using generators, filter and map features. The code snippet below (see also the Colab notebook) shows how to extract tweets of top futurists. Enjoy!

# install social media scrapper: !pip3 install snscrape
import snscrape.modules.twitter as sntwitter
import itertools
import multiprocessing.dummy as mp # for multithreading 
import datetime
import pandas as pd

start_date = datetime.datetime(2018,1,1,tzinfo=datetime.timezone.utc) # from when
attributes = ('date','url','rawContent') # what attributes to keep

def get_tweets(username,n_tweets=5000,attributes=attributes):
    tweets = itertools.islice(sntwitter.TwitterSearchScraper(f'from:{username}').get_items(),n_tweets) # invoke the scrapper
    tweets = filter(lambda t:t.date>=start_date, tweets)
    tweets = map(lambda t: (username,)+tuple(getattr(t,a) for a in attributes),tweets) # keep only attributes needed
    tweets = list(tweets) # the result has to be pickle'able
    return tweets

# a list of accounts to scrape
user_names = ['kevin2kelly','briansolis','PeterDiamandis','michiokaku']

# parallelise queries for speed ! 
with mp.Pool(4) as p:
    results = p.map(get_tweets, user_names)
    # combine
    results = list(itertools.chain(*results))
  1. 1.
    @TwitterDev. Twitter announces stopping free access to its API. Twitter Dev Team. Published February 3, 2023. Accessed February 15, 2023. https://twitter.com/TwitterDev/status/1621026986784337922?s=20
  2. 2.
    snscrape. snscrape. Github Repository. Accessed February 15, 2023. https://github.com/JustAnotherArchivist/snscrape

Fourier integrals vanishing on large circles

When evaluating contour integrals, it is often of interest to prove that Fourier-type integrals vanish on large enough semicircles (see the figure). This holds under the following condition:

Theorem. Suppose that $$f(z)=O(|z|^{-a}), \quad a>0$$ for \(z\) in the upper half-plane. Then for any \(\lambda > 0\) we have $$\int_{\gamma_R} f(z)\mathrm{e}^{i\lambda z} \rightarrow 0, \quad R\to+\infty,$$ where \(\gamma_R\) is the upper half-circle of radius \(R\).

This result is stronger than other ways of developing vanishing integration contours in the upper half-plane, compare for instance with the MIT lecture notes by Jeremy Orloff​1​. The version above can be found in advanced books on Fourier transforms, for example​2​.

To prove that, parametrize the upper half-circle \(\gamma_R\) by \(z=R\mathrm{e}^{i\theta} = R(\cos\theta + i\sin\theta)\) where \(0<\theta<\pi\). Under this parametrization, the Fourier multiplier becomes \(\mathrm{e}^{i\lambda z} = \mathrm{e}^{-\lambda R \sin \theta}\mathrm{e}^{i R \lambda \cos\theta}\). Thus, the integral can be bounded by $$ \left|\int_{\gamma_R} f(z)\mathrm{e}^{i\lambda z}\right|\leqslant \int_{0}^{\pi} |f(R\mathrm{e}^{i\theta})| R \mathrm{e}^{-R\lambda \sin\theta} \mbox{d}\theta \\
\leqslant C\int_{0}^{\pi} R^{1-a} \mathrm{e}^{-R\lambda \sin\theta} \mbox{d}\theta\\ = 2C\int_{0}^{\frac{\pi}{2}} R^{1-a} \mathrm{e}^{-R\lambda \sin\theta} \mbox{d}\theta \\
\leqslant 2C\int_{0}^{\frac{\pi}{2}} R^{1-a} \mathrm{e}^{-2 R\lambda \theta / \pi} \mbox{d}\theta \\
= C\cdot \frac{\pi R^{- a} \left(1 – e^{- R \lambda}\right)}{\lambda},$$
which tends to zero as long as \(a>0\) and \(R\to \infty\).

  1. 1.
    Orloff J. Definite integrals using the residue theorem. Lecture Notes. Accessed 2023. https://math.mit.edu/~jorloff/18.04/notes/topic9.pdf
  2. 2.
    Spiegel MR. Laplace Transforms. McGraw Hill; 1965.

Expanding Inverse Functions

The problem of inverting the implicit function $$y=f(x)$$ in the form of power-series $$x = a + \sum_{k=1}^{\infty} b_k (y-f(a))^k$$

around a point of interest $x=a$, has a long history. Lagrange obtained a theoretical inversion formula [1]https://mathworld.wolfram.com/LagrangeInversionTheorem.html, yet efficient implementations are relatively recent [2]Brent, Richard P., and Hsiang T. Kung. “Fast algorithms for manipulating formal power series.” Journal of the ACM (JACM) 25, no. 4 (1978): 581-595. [3]Johansson, F., 2015. A fast algorithm for reversion of power series. Mathematics of Computation84(291), pp.475-484..

I this note I am sharing a bit simpler algorithm, performing Newton-like updates:
$$x_{k} = x_k-f(x_{k-1})[(y-f(a))^{k}]\cdot \frac{(y-f(a))^{k}}{f'(a)},\quad x_1 = a+\frac{y-f(a)}{f'(a)}.$$
We can see that it gradually produces more and more accurate terms. More precisely, assuming w.l.o.g. \(0=a=f(a)\), suppose \(f(x) =y + O(y^{k})\), then \(f(x + h ) = f(x)+f'(0) h + O(h^2)\) by Taylor’s expansion, and so the update \(h=-\frac{f(x) [y^k] }{f'(0)}\) (where \([z^k]\) is the operation of extracting the coefficient with \(z^k\)) gives \(f(x+h)=y+O(y^{k+1})\) as desired.
I implemented this a proposal for Sympy.

References

References
1 https://mathworld.wolfram.com/LagrangeInversionTheorem.html
2 Brent, Richard P., and Hsiang T. Kung. “Fast algorithms for manipulating formal power series.” Journal of the ACM (JACM) 25, no. 4 (1978): 581-595.
3 Johansson, F., 2015. A fast algorithm for reversion of power series. Mathematics of Computation84(291), pp.475-484.

Monitoring Azure Experiments

Azure Cloud is a popular work environment for many data scientists, yet many features remain poorly documented. This note shows how to monitor Azure experiments in a more handy and detailed way than through web or cl interface.

The trick is to create a dashborad of experiments and their respective runs, up to a desired level of detail, from Python. The workhorse is the following handy utility function:

from collections import namedtuple

def get_runs_summary(ws):
    """Summarise all runs under a given workspace, with experiment name, run id and run status
    Args:
        ws (azureml.core.Workspace): Azure workspace to look into
    """
    # NOTE: extend the scope of run details if needed
    record = namedtuple('Run_Description',['job_name','run_id','run_status'])
    for exp_name,exp_obj in ws.experiments.items():
        for run_obj in exp_obj.get_runs():
            yield(record(exp_name,run_obj.id,run_obj.status))

Now it’s time to see it in action 😎

# get the default workspace
from azureml.core import Workspace
import pandas as pd

ws = Workspace.from_config()

# generate the job dashboard and inspect
runs = get_runs_summary(ws)
summary_df = pd.DataFrame(runs)
summary_df.head()
# count jobs by status
summary_df.groupby('run_status').size()

Use the dashboard for to automatically manage experiments. For example, to kill running jobs:

from azureml.core import Experiment, Run

for exp_name,run_id in summary_df.loc[summary_df.run_status=='Running',['job_name','run_id']].values:
    exp = Experiment(ws,exp_name)
    run = Run(exp,run_id)
    run.cancel()

Check the jupyter notebook in my repository for a one-click demo.

Repo Passwords in Poetry


Poetry, a popular Python package manager, prefers to use keyring to manage passwords for private code repositories. Storing passwords in plain text is a secondary option, but may be needed in case of either issues in poetry itself or with keyring configuration (may not be properly installed, be locked etc). To disable the use of system keyring by poetry, set the null backend in the environmental variable PYTHON_KEYRING_BACKEND:

(.venv) azureuser@sense-mskorski:~/projects/test_project$ export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring
(.venv) azureuser@sense-mskorski:~/projects/test_project$ poetry config http-basic.private_repo *** '' -vvv
Loading configuration file /home/azureuser/.config/pypoetry/config.toml
Loading configuration file /home/azureuser/.config/pypoetry/auth.toml
Adding repository test_repo (https://***) and setting it as secondary
No suitable keyring backend found
No suitable keyring backends were found
Keyring is not available, credentials will be stored and retrieved from configuration files as plaintext.
Using a plaintext file to store credentials

Analogously, one of working backends can be enabled (make sure it works correctly!) . This is how to list available backends:

(.venv) azureuser@sense-mskorski:~/projects/test_project$ keyring --list-backends
keyring.backends.chainer.ChainerBackend (priority: -1)
keyring.backends.fail.Keyring (priority: 0)
keyring.backends.SecretService.Keyring (priority: 5)

Marking Python Tests as Optional

Often code tests are to be run on special demand, rather than in a CI/CD manner: for instance, they may be slow or work only in a local mode with protected data. This note shows how to declare code tests optional in pytest, the leading testing framework for Python. The article is inspired by the pytest documentation on test markers.

The trick is to mark extra tests with a decorator and couple it with a runtime flag.
The first step is to mark the relevant code blocks and is illustrated below:

import pytest

@pytest.mark.optional
def test_addition():
    assert 1+1==2

The second step is to instruct the config flag on how to interpret the decorator; it is also worth describing this action in the helper. The code below shows how to set this up:

import pytest

def pytest_addoption(parser):
    """Introduce a flag for optional tests"""
    parser.addoption(
        "--optional", action="store_true", default=False, help="run extra tests"
    )

def pytest_collection_modifyitems(config, items):
    """Instruct the testing framework to obey the decorator for optional tests"""
    if config.getoption("--optional"):
        # --extras given in cli: do not skip extra tests
        return
    # otherwise, execute
    skip_extras = pytest.mark.skip(reason="need --optional flag to run")
    for item in items:
        if "optional" in item.keywords:
            item.add_marker(skip_extras)

def pytest_configure(config):
    """Make the decorator for optional tests visible in help: pytest --markers"""
    config.addinivalue_line("markers", "optional: mark test as optional to run")

To see it in action, run pytest -k battery --optional 😎

test_battery.py::test_addition PASSED

Debug CI/CD with SSH

CircleCI is a popular platform Continuous integration (CI) and continuous delivery (CD). While its job status reports are already useful, one can do much more insights by debugging it in real time. Here I am sharing a real use-case of debugging a failing job deploing an app 🙂

Failing job are reported in red and the (most of) errors caught during the execution appear in the terminal. In this case, the environment is unable to locate Python:

The failed job and caught errors.

The error is not very informative. 😮 For more insights, let’s re-run the job in SSH mode:

Re-running the failed CircleCI with SSH.

You will be welcomed with instructions on how to connect via SSH:


CircleCI showing SSH connection string.

Use this instruction to connect to inspect the environment at its failure stage:

mskorski@SHPLC-L0JH Documents % ssh -p 64535 aa.bbb.cc.dd
The authenticity of host '[aa.bbb.cc.dd]:64535 ([aa.bbb.cc.dd]:64535)' can't be established.
ED25519 key fingerprint is SHA256:LsMhHb5fUPLHI9dFdyig4VKw44GTqrA2dkEWT0sZx4k.
Are you sure you want to continue connecting (yes/no)? yes

circleci@bc95bb40fff3:~$ ls project/venv/bin -l
total 300
...
lrwxrwxrwx 1 circleci circleci    7 Aug  1 12:37 python -> python3
lrwxrwxrwx 1 circleci circleci   49 Aug  1 12:37 python3 -> /home/circleci/.pyenv/versions/3.8.13/bin/python3
lrwxrwxrwx 1 circleci circleci    7 Aug  1 12:38 python3.9 -> python3

Bingo! The terminal warns about broken symbolic links:

Terminal highlights broken symbolic links.

The solution in this case was to update cache. The issues may be far more complex than that, but being able to debug them live comes to the rescue. 😎