Tracking Training Performance of LDA Models

In my recent open source contribution I enabled callbacks in the scalable (multi-core) implementation of Latent Dirichlet Alocation in the gensim library ​1​. This will, in turn, allow users for faster and more accurate turning of the popular topic extraction model.

An obvious use case is monitoring and early stopping of training, with popular coherence metrics such as \(U_{mass}\) and \(C_V\) ​2​. On the News20Group dataset, the training performance looks as follows:

Training performance of Multi-Core LDA on 20 Newsgroups data, monitored by callbacks.

The achieved scores are decent, actually better than reported in the literature​3​ – but this may be due to preprocessing not early stopping. A full example is shared in this Kaggle notebook.

  1. 1.
    R Rehr Uv Rek, P Sojka. Software Framework for Topic Modelling with Large Corpora. Unpublished. Published online 2010. doi:10.13140/2.1.2393.1847
  2. 2.
    Röder M, Both A, Hinneburg A. Exploring the Space of Topic Coherence Measures. Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. Published online February 2, 2015. doi:10.1145/2684822.2685324
  3. 3.
    Zhang Z, Fang M, Chen L, Namazi Rad MR. Is Neural Topic Modelling Better than Clustering? An Empirical Study on Clustering with Contextual Embeddings for Topics. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Published online 2022. doi:10.18653/v1/2022.naacl-main.285

On UML Approach to Management Antipatterns

Therefore speak I to them in parables, because seeing, they see not, and hearing, they hear not, neither do they understand.

Matthew 13:13

Ever wondered how miserable some “prestigious” businesses are, and how they manage to make their employees make up for poor project management? Me too! A classical situation that contributes to crisis is miscommunication to subcontractors or employees. Let’s see how UML can be used to study such antipatterns. They happen unintentionally, don’t they? 🤔

This is a real-world use-case from a prestigious legal office located in Warsaw, Poland. I have been asked to capture project management antipatterns, as an external observer and modeller.

One use case was: an expert subcontractor asked proactively, in fact several times, to be put in the communication loop with the client. But the office executives didn’t find it necessary (why would they, huh?). Until… Guess when? The deadline! The subcontractor was caught by surprise: please deliver for the customer by today! But wait, what customer…? 🤔

Another use case: the office rushed promising the client something they couldn’t deliver, and reached out for its experts for help pretty late.. Guess when? On the deadline day!

Here is the UML model that I promised, a good illustration of this poor management practice! I will use a sequence diagram, a powerful tool to explore interactions 💪

You certainly agree this is not professional but would probably argue that this doesn’t happen to ErnstYoung, PWC and other big companies… Would you?

Working with Abstract Syntax Trees

Visualizing code as a syntax tree is both funny and useful, as seen from impressive applications such as creating lineage of SQL which helps to understand complex queries in business. Abstract syntax trees are not only widely used in industry but are still a subject of top academic research​1,2​.

This post demonstrates how to work with AST in Python by parsing C code with CLang/LLVM​3​ and visualizing by graphviz.

Parsing is relatively simple, particularly to users that have had already similar experiences with abstract trees, such as parsing XMLs. My advice for beginners is to avoid code factoring, but leverage functional coding features in Python. The example below shows how to extract declarations of functions and details of arguments:

from clang.cindex import Index, Config, CursorKind, TypeKind

SCRIPT_PATH = "./tcpdump/print-ppp.c"

# C99 is a proper C code standard for tcpdump, as per their docs
index = Index.create()
translation_unit = index.parse(SCRIPT_PATH, args=["-std=c99"])

# filter to nodes in the root script (ignore imported!)
script_node = translation_unit.cursor
all_nodes = script_node.get_children()
all_nodes = filter(lambda c: == SCRIPT_PATH, all_nodes)

# filter to function nodes
func_nodes = filter(lambda c: c.kind == CursorKind.FUNCTION_DECL, all_nodes)

# print attributes and their types for each function
for fn in func_nodes:
    for arg in fn.get_arguments():
        t = arg.type
        # handle pointers by describing their pointees
        if t.kind == TypeKind.POINTER:
            declr = t.get_pointee().get_declaration()
            declr = t.get_declaration()
            f'arg declared in {arg.location.file}:L{arg.extent.start.line},C{arg.extent.start.column}-L{arg.extent.end.line},C{arg.extent.end.column}',
            f'{declr.spelling} declared in {declr.location.file}:L{declr.location.line}'

Which gives the following output when tested on the tcpdump project

     struct netdissect_options * TypeKind.POINTER arg declared in ./tcpdump/print-ppp.c:L403,C39-L403,C59 netdissect_options declared in ./tcpdump/netdissect.h:L161
     const unsigned char TypeKind.ELABORATED arg declared in ./tcpdump/print-ppp.c:L403,C61-L403,C73 u_char declared in /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/include/sys/_types/_u_char.h:L30
     const unsigned int TypeKind.ELABORATED arg declared in ./tcpdump/print-ppp.c:L403,C75-L403,C86 u_int declared in /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/include/sys/_types/_u_int.h:L30
     struct netdissect_options * TypeKind.POINTER arg declared in ./tcpdump/print-ppp.c:L1359,C10-L1359,C33 netdissect_options declared in ./tcpdump/netdissect.h:L161
     const unsigned char * TypeKind.POINTER arg declared in ./tcpdump/print-ppp.c:L1360,C10-L1360,C25 u_char declared in /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/include/sys/_types/_u_char.h:L30
     unsigned int TypeKind.ELABORATED arg declared in ./tcpdump/print-ppp.c:L1360,C27-L1360,C39 u_int declared in /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/include/sys/_types/_u_int.h:L30

However, the funny part comes from visualization. This is easy with graphviz

from graphviz import Digraph

dot = Digraph(strict=True)
dot.attr(rankdir="LR", size="20,100", fontsize="6")

node_args = {"fontsize": "8pt", "edgefontsize": "6pt"}

for fn in func_nodes:
    fn_node_name = f"{fn.spelling}\nL{fn.location.line}"
    dot.node(fn_node_name, **node_args)
    for i, arg in enumerate(fn.get_arguments(), start=1):
        arg_node_name = arg.type.get_canonical().spelling
        dot.node(arg_node_name, **node_args)
        dot.edge(fn_node_name, arg_node_name)
        t = arg.type
        # handle pointers by describing their pointees
        if t.kind == TypeKind.POINTER:
            declr = t.get_pointee().get_declaration()
            declr = t.get_declaration()
        declr_file = f"{declr.location.file}"
        dot.node(declr_file, **node_args)
            arg_node_name, declr_file, label=f"L{declr.location.line}", fontsize="6pt"

from IPython.display import display_svg

We can now enjoy the pretty informative graph 😎 It shows that multiple functions share only few types of arguments and gives precise information about their origin.

The fully working example is shared here as a Colab notebook.

  1. 1.
    Grafberger S, Groth P, Stoyanovich J, Schelter S. Data distribution debugging in machine learning pipelines. The VLDB Journal. Published online January 31, 2022:1103-1126. doi:10.1007/s00778-021-00726-w
  2. 2.
    Fu H, Liu C, Wu B, Li F, Tan J, Sun J. CatSQL             : Towards Real World Natural Language to SQL Applications. Proc VLDB Endow. Published online February 2023:1534-1547. doi:10.14778/3583140.3583165
  3. 3.
    Lattner C, Adve V. LLVM: A compilation framework for lifelong program analysis & transformation. International Symposium on Code Generation and Optimization, 2004 CGO 2004. doi:10.1109/cgo.2004.1281665

Customized Jupyter environments on Google Cloud

Kaggle docker images come with a huge list of pre-installed packages for machine-learning, including the support of GPU computing. They run within a container as a Jupyter application accessed by users through its web interface. Running a custom image boils down to these steps

  • 💡 pulling the right version from the container registry
  • ❗ publishing with appropriate parameters (--runtime flag important for GPU support)

Below we can see how it looks like

(base) maciej.skorski@shared-notebooks:~$ docker pull
v128: Pulling from kaggle-gpu-images/python
d5fd17ec1767: Pulling fs layer 
(base) maciej.skorski@shared-notebooks:~$ sudo docker run \
>    --name "/payload-container" \
>    --runtime "nvidia" \
>    --volume "/home/jupyter:/home/jupyter" \
>    --mount type=bind,source=/opt/deeplearning/jupyter/,destination=/opt/jupyter/.jupyter/,readonly \
>    --log-driver "json-file" \
>    --restart "always" \
>    --publish "" \
>    --network "bridge" \
>    --expose "8080/tcp" \
>    --label "kaggle-lang"="python" \
>    --detach \
>    --tty \
>    --entrypoint "/" \
>    "" \
>    "/" 

The following test in Python shell shows that we can indeed use GPU 🙂

root@cf1b6f63d729:/# ipython
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.33.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import torch

In [2]: torch.cuda.is_available()
Out[2]: True

In [3]: torch.Tensor([1,2,3]).to(0)
Out[3]: tensor([1., 2., 3.], device='cuda:0')

Repairing user-managed notebooks on Google Cloud

In this note, I am sharing a case study on debugging and fixing jupyter-lab access issues.

The diagnostic script can be run on a VM instance as shown below:

(base) maciej.skorski@shared-notebooks:~$ sudo /opt/deeplearning/bin/

Vertex Workbench Diagnostic Tool

Running system diagnostics...

Checking Docker service status...               [OK]
Checking Proxy Agent status...                  [OK]
Checking Jupyter service status in container...         [ERROR] Jupyter service is not running
Checking internal Jupyter API status...         [ERROR] Jupyter API is not active
Checking boot disk (/dev/sda1) space...         [OK]
Checking data disk (/dev/sdb) space...          [OK]
Checking DNS        [OK]
Checking DNS      [OK]

System's health status is degraded

Diagnostic tool will collect the following information: 

  System information
  System Log /var/log/
  Docker information
  Jupyter service status
  Network information
  Proxy configuration: /opt/deeplearning/proxy-agent-config.json
  Conda environment information
  pip environment information
  GCP instance information

Do you want to continue (y/n)? n

Jupyter service runs from a container, but it somehow stopped in this case 😳

(base) maciej.skorski@shared-notebooks:~$ docker container ls

Not a problem! We can restart the container, but carefully choosing the parameters to expose it properly (ports, mounted folders etc). The appropriate docker command can be retrieved from a running container on a similar healthy instance by docker inspect

(base) maciej.skorski@kaggle-test-shared:~$ docker inspect \
>   --format "$(curl -s" 3f5b6d709ccc

docker run \
  --name "/payload-container" \
  --runtime "runc" \
  --volume "/home/jupyter:/home/jupyter" \
  --mount type=bind,source=/opt/deeplearning/jupyter/,destination=/opt/jupyter/.jupyter/,readonly \
  --log-driver "json-file" \
  --restart "always" \
  --publish "" \
  --network "bridge" \
  --hostname "3f5b6d709ccc" \
  --expose "8080/tcp" \
  --env "TENSORBOARD_PROXY_URL=/proxy/%PORT%/" \
  --env "LIT_PROXY_URL=/proxy/%PORT%/" \
  --env "PATH=/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" \
  --env "LC_ALL=C.UTF-8" \
  --env "LANG=C.UTF-8" \
  --env "DL_ANACONDA_HOME=/opt/conda" \
  --env "SHELL=/bin/bash" \
  --env "LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64::/opt/conda/lib" \
  --env "CONTAINER_NAME=tf2-cpu/2-11" \
  --env "KMP_BLOCKTIME=0" \
  --env "KMP_AFFINITY=granularity=fine,verbose,compact,1,0" \
  --env "KMP_SETTINGS=false" \
  --env "NODE_OPTIONS=--max-old-space-size=4096" \
  --env "ENABLE_MULTI_ENV=false" \
  --env "LIBRARY_PATH=:/opt/conda/lib" \
  --env "TENSORFLOW_VERSION=2.11.0" \
  --env "KMP_WARNINGS=0" \
  --env "PROJ_LIB=/opt/conda/share/proj" \
  --env "TESSERACT_PATH=/usr/bin/tesseract" \
  --env "PYTHONPATH=:/opt/facets/facets_overview/python/" \
  --env "PYTHONUSERBASE=/root/.local" \
  --env "MPLBACKEND=agg" \
  --env "GIT_COMMIT=7e2b36e4a2ac3ef3df74db56b1fd132d56620e8a" \
  --env "BUILD_DATE=20230419-235653" \
  --label "build-date"="20230419-235653" \
  --label ""="Container: TensorFlow 2-11" \
  --label "git-commit"="7e2b36e4a2ac3ef3df74db56b1fd132d56620e8a" \
  --label "kaggle-lang"="python" \
  --label ""="ubuntu" \
  --label "org.opencontainers.image.version"="20.04" \
  --label "tensorflow-version"="2.11.0" \
  --detach \
  --tty \
  --entrypoint "/" \
  "" \

Now the check goes OK 🙂

(base) maciej.skorski@shared-notebooks:~$ sudo /opt/deeplearning/bin/

Vertex Workbench Diagnostic Tool

Running system diagnostics...

Checking Docker service status...               [OK]
Checking Proxy Agent status...                  [OK]
Checking Jupyter service status in container... [OK]
Checking internal Jupyter API status...         [OK]
Checking boot disk (/dev/sda1) space...         [OK]
Checking data disk (/dev/sdb) space...          [OK]
Checking DNS        [OK]
Checking DNS      [OK]

ML Prototyping Environment on Cloud

Teams that collaborate on data-science tasks using cloud platforms often choose to share a preconfigured ML environment, such as Kaggle Docker Python image. This resolves reproducibility and dependency issues, while individual team members can add custom packages on top, with local virtual environments, for example adding less common packages for computer vision.

This robust setup requires pointing to the base environment as --system-site-packages when configuring the local virtual environment. Below, we see an example of a local environment with the package DeepForest (not present in the Kaggle image).

root@cf1b6f63d729:/home/jupyter/src/tree_counting# python -m venv .deepforest --system-site-packages
root@cf1b6f63d729:/home/jupyter/src/tree_counting# pip install --upgrade pip --quiet
root@cf1b6f63d729:/home/jupyter/src/tree_counting# pip install deepforest --quiet

The local environment can be further exposed to jupyter as a custom kernel.

root@cf1b6f63d729:/home/jupyter/src/tree_counting# source .deepforest/bin/activate
(.deepforest) root@cf1b6f63d729:/home/jupyter/src/tree_counting# python -m ipykernel install --user --name .deepforest --display-name "Kaggle+DeepForest"
Installed kernelspec .deepforest in /root/.local/share/jupyter/kernels/.deepforest

The architecture is shown below.

Dev Environment Architecture, generated with plantuml.

This script demonstrates the difference between system-level and local packages.

(.deepforest) root@cf1b6f63d729:/home/jupyter/src/tree_counting# python
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import tensorflow
>>> import deepforest
>>> tensorflow.__file__

Finally, it is worth mentioning the Dev Containers extension, which connects IDE to a running container. Then we can enjoy all the VS Code features 🙂

IDE connected to a container.

Efficient Pre-Commit Hooks with GitHub Actions

Pre-commit is a great tool for running various sanity checks (formatting, linting) on the code base. However, such scanning may be time-consuming (particularly on certain content like notebooks) which hits both user experience and billing for CI/CD (minutes are usually paid, except for public repos or very small projects).

Below, I demonstrate how to effectively optimize running pre-commit on GitHub Actions. The key is to cache both the pre-commit package and dependent hooks. Note that, as of now (April 2023), pre-commit native caching does only the second part. Fortunately, managing its cache is as simple as calling the GitHub Cache Action on ~/.cache/pre-commit.

name: pre-commit

    branches: [experiments]
    branches: [experiments, main]

    runs-on: ubuntu-latest
    - uses: actions/checkout@v3
    - uses: actions/setup-python@v4
        python-version: 3.7
    - name: cache pre-commit deps
      id: cache_pre_commit
      uses: actions/cache@v3
          cache-name: cache-pre-commit
        path: |
        key: ${{ env.cache-name }}-${{ hashFiles('.pre-commit-config.yaml','~/.cache/pre-commit/*') }}
    - name: install pre-commit
      if: steps.cache_pre_commit.outputs.cache-hit != 'true'
      run: |
        python -m venv .pre_commit_venv
        . .pre_commit_venv/bin/activate
        pip install --upgrade pip
        pip install pre-commit
        pre-commit install --install-hooks
        pre-commit gc
    - name: run pre-commit hooks
      run: |
        . .pre_commit_venv/bin/activate  
        pre-commit run --color=always --all-files

Building and Publishing Docker with GitHub Actions

In this post I am sharing my recipe for building and publishing Docker using GitHub Actions. It concisely wraps up a few steps that beginners often find problematic. In particular:

  • use GitHub secrets to securely store credentials, such as $DOCKER_USER and $DOCKER_PASSWORD, for your docker registry (such as DockerHub or GitHub Container Registry)
  • I recommend logging to the docker registry via the CLI, rather than using a less transparent GitHub Action, which is as simple as docker login -u $DOCKER_USER -p $DOCKER_PASSWORD
  • use the correct tag pattern when pushing your docker to a registry

The sample code is shown below. See it in action on production here and in this template.

name: docker-image

    branches: [ "main" ]
    paths: ["Dockerfile",".github/workflows/docker-image.yaml"]

    runs-on: ubuntu-latest
    # Docker tags and credentials for DockerHub/GitHub Containers, customize!
      IMAGE_NAME: plantuml-docker
      IMAGE_VERSION: latest
      DOCKER_USER: ${{ secrets.DOCKER_USER }}
      GITHUB_TOKEN: ${{ secrets.PAT }}
      GITHUB_USER: ${{ }}
    - uses: actions/checkout@v3
    - name: Build and tag the image
      run: |
        docker build . \
    - name: Publish to DockerHub
      if: env.DOCKER_PASSWORD != ''
      run: |
        docker login -u $DOCKER_USER -p $DOCKER_PASSWORD
    - name: Publish to GitHub Container registry
      if: env.GITHUB_TOKEN != ''
      run: |
        docker login -u $GITHUB_USER -p $GITHUB_TOKEN 

Effective Caching with GitHub Actions

GitHub Actions is great as a CI/CD platform. However, to be really efficient, workflows need to leverage some optimization techniques, such as caching or running tasks in parallel. In this note, I am sharing some thoughts on how to use cache effectively, with respect to multiple paths and sudo-installed APT packages. The discussion will touch on a few non-trivial aspects that, in my opinion, are not well-explained in other web materials.

My use case was simple: speed-up building a university course in Sphinx, to be hosted on GitHub pages. Installing Sphinx dependencies required multiple Python and Linux APT downloads, which took quite a long. The caching solution indeed fixed a problem, and here are the key takeaways:

This repository demonstrates the working solution. The cache size (Python and APT packages) ais about 120MB. And this is how the job looks like:

name: docs

on: [push, pull_request, workflow_dispatch]

    runs-on: ubuntu-latest
      - uses: actions/checkout@v2
      - uses: actions/setup-python@v2
      - name: prepare virtual environment
        run: |
          python -m venv .venv
          mkdir .apt
      - name: cache dependencies
        id: cache_deps
        uses: actions/cache@v3
            cache-name: cache-dependencies
          path: |
          key: ${{ runner.os }}-build-${{ env.cache-name }}-${{ hashFiles('.github/workflows/*') }}
      - name: Install python dependencies
        if: ${{ steps.cache_deps.outputs.cache-hit != 'true' }}
        run: |
          source .venv/bin/activate
          pip install jupyter-book
          pip install sphinxcontrib-plantuml
      - name: Install sudo dependencies
        run: |
          sudo apt-get -o Dir::Cache=".apt" update
          sudo apt-get -o Dir::Cache=".apt" install plantuml
      - run: |
          apt-config dump | grep Dir::Cache
      - name: Compile Docs
        run: |
          source .venv/bin/activate
          jupyter-book build docs
      - name: Deploy to gh-pages
        uses: peaceiris/actions-gh-pages@v3
          github_token: ${{ secrets.GITHUB_TOKEN }}
          publish_branch: gh-pages
          publish_dir: ./docs/_build/html

Making SSH work by proxy

It is a popular misbelief that hiding encrypted connections (SSH) behind a proxy is a dark domain reserved to crime activities. You may need a Russian or Iranian proxy to get your coding job done, when firewalls of your favourite coffee place or wifi in travel forbid the use of SSH.

As this happens to me regularly – I travel a lot – I would like to share a solution here. The most effective is to proxy the traffic through port 443 (default for HTTPS, typically enabled). Testing the list of free proxy servers ​1​ we find that the proxy (Iranian) is working well. It remains to add a proxy instruction to the ssh configuration as shown below. That’s it, and I can work with GitHub in my favourite coffee place in France. Enjoy!

# content of .ssh/config
    User git
    Port 22
    IdentityFile ~/.ssh/id_rsa
    StrictHostKeyChecking no
    ProxyCommand ncat --proxy 22
  1. 1. List of free proxies. Proxy list for port 443.