Maciej Skorski – Page 2 – personal webpage

Lego Bricks in LaTeX

Who does not enjoy lego bricks, raise a hand! In this post, I am sharing an elegant and efficient way of plotting bricks under 3d view in TikZ. Briefly speaking, it utilizes canvas transforms to plot facets, and describes boundaries of studs in a simple way with cylindrical coordinates based on the azimuth angle (localizing extreme edges might be a challenge on its own).
While there are other packages, like TikZbricks, this method seems simpler in terms of complexity and brings some educational value in terms of cylinders geometry.

And here is the code (see also this online note)

\documentclass[12pt]{standalone}

\usepackage{pgfplots}
\usepackage{tikz-3dplot}

\begin{document}

\pgfmathsetmacro{\pinradius}{0.25}

%  elevation and azimuth for 3D-view
\def\rotx{60}
\def\rotz{120}

\newcommand{\brick}[8]{
    \pgfmathsetmacro{\posx}{#1}
    \pgfmathsetmacro{\posy}{#2}
    \pgfmathsetmacro{\posz}{#3}
    \pgfmathsetmacro{\cubex}{#4}
    \pgfmathsetmacro{\cubey}{#5}
    \pgfmathsetmacro{\cubez}{#6}

    % cube by rectangle facets
    \begin{scope}
    \begin{scope}[canvas is yx plane at z=\posz,transform shape]
    \draw[fill=#8] (\posy,\posx) rectangle ++(\cubey,\cubex);
    \end{scope}
    \begin{scope}[canvas is yx plane at z=\posz+\cubez,transform shape]
    \draw[fill=#8] (\posy,\posx) rectangle ++(\cubey,\cubex);
    \end{scope}
    \begin{scope}[canvas is yz plane at x=\posx+\cubex,transform shape]
    \draw[fill=#8] (\posy,\posz) rectangle ++(\cubey,\cubez) node[pos=.5] {#7};
    \end{scope}
    \begin{scope}[canvas is xz plane at y=\posy+\cubey,transform shape]
    \draw[fill=#8] (\posx,\posz) rectangle ++(\cubex,\cubez);
    \end{scope}
    \end{scope}

    % studs by arcs and extreme edges
    \foreach \i in {1,...,\cubey}{
        \foreach \j in {1,...,\cubex}{
            % upper part - full circle
            \draw [thin] (\posx-0.5+\j,\posy-0.5+\i,\posz+\cubez+0.15) circle (\pinradius);
            % lower part - arc
            \begin{scope}[canvas is xy plane at z=\posz+\cubez]
            \draw[thin] ([shift=(\rotz:\pinradius)] \posx-0.5+\j,\posy-0.5+\i) arc (\rotz:\rotz-180:\pinradius);
            \end{scope}
            \begin{scope}[shift={(\posx-0.5+\j,\posy-0.5+\i)}]
                % edges easily identified in cylindrical coordinates! 
                \pgfcoordinate{edge1_top}{ \pgfpointcylindrical{\rotz}{\pinradius}{\posz+\cubez+0.15} };
                \pgfcoordinate{edge1_bottom}{ \pgfpointcylindrical{\rotz}{\pinradius}{\posz+\cubez} };
                \draw[] (edge1_top) -- (edge1_bottom);
                \pgfcoordinate{edge1_top}{ \pgfpointcylindrical{\rotz+180}{\pinradius}{\posz+\cubez+0.15} };
                \pgfcoordinate{edge1_bottom}{ \pgfpointcylindrical{\rotz+180}{\pinradius}{\posz+\cubez} };
                \draw[] (edge1_top) -- (edge1_bottom);
           \end{scope}
        }
    }
}

\tdplotsetmaincoords{\rotx}{\rotz}
\begin{tikzpicture}[tdplot_main_coords,]
    % draw axes
    \coordinate (O) at (0,0,0);
    \coordinate (A) at (5,0,0);
    \coordinate (B) at (0,5,0);
    \coordinate (C) at (0,0,5);
    \draw[-latex] (O) -- (A) node[below] {$x$};
    \draw[-latex] (O) -- (B) node[above] {$y$};
    \draw[-latex] (O) -- (C) node[left] {$z$};
    % draw brick
    \brick{0}{1}{0}{3}{3}{1}{Lego}{blue!50};
    \brick{0}{1}{2}{2}{3}{1}{Enjoys}{green!50};
    \brick{0}{1}{4}{1}{3}{1}{Everybody}{red!50};
\end{tikzpicture}


\end{document}

Cylinders in LaTeX the Easy and Correct Way

Drawing cylinders in vector graphic is a common task. It is less trivial as it looks at first glance, due to the challenge of finding a proper projection. In this post, I share a simple and robust recipe using the tikz-3dplot package of LaTeX. As opposed to many examples shared online, this approach automatically identifies the boundary of a cylinder, under a given perspective. The trick is to identify edges using the azimuth angle in cylindrical coordinates 💪.

\documentclass{standalone}
\usepackage{tikz,tikz-3dplot}
\begin{document}

\def\rotx{70}
\def\rotz{110}
\tdplotsetmaincoords{\rotx}{\rotz}
\begin{tikzpicture}[tdplot_main_coords]
\tikzstyle{every node}=[font=\small]
\draw[ultra thin,-latex] (0,0,0) -- (6,0,0) node[anchor=north east]{$x$};
\draw[ultra thin,-latex] (0,0,0) -- (0,6,0) node[anchor=north west]{$y$};
\draw[ultra thin,-latex] (0,0,0) -- (0,0,6) node[anchor=south]{$z$};
\draw [thick](0,0,4) circle (3);
\begin{scope}[canvas is xy plane at z=0]
\draw[thick, dashed] ([shift=(\rotz:3)] 0,0,0) arc (\rotz:\rotz+180:3);
\draw[thick] ([shift=(\rotz:3)] 0,0,0) arc (\rotz:\rotz-180:3);
\end{scope}
% manual edges
\draw [dotted, red](1.9,-2.34,0) -- (1.9,-2.34,4) node[midway, left]{};
\draw [dotted, red](-1.9,2.35,0) -- (-1.9,2.35,4);
% automatic edges !
\pgfcoordinate{edge1_top}{ \pgfpointcylindrical{\rotz}{3}{4} };
\pgfcoordinate{edge1_bottom}{ \pgfpointcylindrical{\rotz}{3}{0} };
\draw[thick] (edge1_top) -- (edge1_bottom);
\pgfcoordinate{edge2_top}{ \pgfpointcylindrical{\rotz+180}{3}{4} };
\pgfcoordinate{edge2_bottom}{ \pgfpointcylindrical{\rotz+180}{3}{0} };
\draw[thick] (edge2_top) -- (edge2_bottom);
\end{tikzpicture}

\end{document}

And we can enjoy this output, comparing with manual edges adapted from this post:

Automating Accessors and Mutators Tests

In object-oriented programming, there are plenty of accessors and mutators to test. This post demonstrates that this effort can be automated with reflection 🚀. The inspiration came from discussions I had with my students during our software-engineering class: how to increase code coverage without lots of manual effort? 🤔

Roughly speaking, the reflection mechanism allows the code to analyse itself. At runtime, we are able to construct calls based on extracted class properties. The idea is not novel, see for instance this gist. To add the value and improve the presentation, I modernized and completed the code to a fully featured project on GitHub with CI/CD on GitHub Actions and Code Coverage connected 😎.

Here is how the testing class looks like. Java reflection accesses classes, extracts fields and their types and constructs calls with type-matching values accordingly:

// tester class
class AutoTests {
  
    private static final Class[] classToTest = new Class[]{ 
        // the list of classes to test
        PersonClass.class, AnimalClass.class
    };

   @Test 
   public void correctGettersSetters() {
      for (Class aClass : classToTest) {
        Object instance;
        try {      
           instance = aClass.getDeclaredConstructor().newInstance();
           Field[] declaredFields = aClass.getDeclaredFields();
           for(Field f: declaredFields) {
              // get the field getter and setter, following the Java naming convention (!)
              // www.theserverside.com/feature/Java-naming-conventions-explained
              String name = f.getName();
              name = name.substring(0,1).toUpperCase() + name.substring(1);
              String getterName = "get" + name;
              String setterName = "set" + name;
              Method getterMethod = aClass.getMethod(getterName);
              Method setterMethod = aClass.getMethod(setterName, getterMethod.getReturnType());
              // prepare a test value based on the filed type 
              Object testVal = null;
              Class<?> fType = f.getType();
              if (fType.isAssignableFrom(Integer.class)) {
                  testVal = 1234;
              } else if (fType.isAssignableFrom(String.class)) {
                  testVal = "abcd";
              }
              // test by composing the setter and getter
              setterMethod.invoke(instance, testVal);
              Object result = getterMethod.invoke(instance);
              System.out.printf("Testing class=%s field=%s...\n", aClass.getName(), f.getName());
              assertThat(result).as("in class %s fields %s", aClass.getName(), f.getName()).isEqualTo(testVal);
           }
        } catch(Exception e) {
           System.out.println(e.toString());
        }
      }
   }
}

And here is one more demo available online.

Finally, a disclaimer: accessors and mutators may deserve smarter tests than what stated here – depending on a use-case.

Tracking Training Performance of LDA Models

In my recent open source contribution I enabled callbacks in the scalable (multi-core) implementation of Latent Dirichlet Alocation in the gensim library ¹. This will, in turn, allow users for faster and more accurate turning of the popular topic extraction model.

An obvious use case is monitoring and early stopping of training, with popular coherence metrics such as \(U_{mass}\) and \(C_V\) ². On the News20Group dataset, the training performance looks as follows:

Training performance of Multi-Core LDA on 20 Newsgroups data, monitored by callbacks.

The achieved scores are decent, actually better than reported in the literature³ – but this may be due to preprocessing not early stopping. A full example is shared in this Kaggle notebook.

1.
R Rehr Uv Rek, P Sojka. Software Framework for Topic Modelling with Large Corpora. Unpublished. Published online 2010. doi:10.13140/2.1.2393.1847
2.
Röder M, Both A, Hinneburg A. Exploring the Space of Topic Coherence Measures. Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. Published online February 2, 2015. doi:10.1145/2684822.2685324
3.
Zhang Z, Fang M, Chen L, Namazi Rad MR. Is Neural Topic Modelling Better than Clustering? An Empirical Study on Clustering with Contextual Embeddings for Topics. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Published online 2022. doi:10.18653/v1/2022.naacl-main.285

On UML Approach to Management Antipatterns

Therefore speak I to them in parables, because seeing, they see not, and hearing, they hear not, neither do they understand.
Matthew 13:13

Ever wondered how miserable some “prestigious” businesses are, and how they manage to make their employees make up for poor project management? Me too! A classical situation that contributes to crisis is miscommunication to subcontractors or employees. Let’s see how UML can be used to study such antipatterns. They happen unintentionally, don’t they? 🤔

This is a real-world use-case from a prestigious legal office located in Warsaw, Poland. I have been asked to capture project management antipatterns, as an external observer and modeller.

One use case was: an expert subcontractor asked proactively, in fact several times, to be put in the communication loop with the client. But the office executives didn’t find it necessary (why would they, huh?). Until… Guess when? The deadline! The subcontractor was caught by surprise: please deliver for the customer by today! But wait, what customer…? 🤔

Another use case: the office rushed promising the client something they couldn’t deliver, and reached out for its experts for help pretty late.. Guess when? On the deadline day!

Here is the UML model that I promised, a good illustration of this poor management practice! I will use a sequence diagram, a powerful tool to explore interactions 💪

Modelling poor management in a legal office.
See the source code: here and here.

You certainly agree this is not professional but would probably argue that this doesn’t happen to ErnstYoung, PWC and other big companies… Would you?

Working with Abstract Syntax Trees

Visualizing code as a syntax tree is both funny and useful, as seen from impressive applications such as creating lineage of SQL which helps to understand complex queries in business. Abstract syntax trees are not only widely used in industry but are still a subject of top academic research^1,2.

This post demonstrates how to work with AST in Python by parsing C code with CLang/LLVM³ and visualizing by graphviz.

Parsing is relatively simple, particularly to users that have had already similar experiences with abstract trees, such as parsing XMLs. My advice for beginners is to avoid code factoring, but leverage functional coding features in Python. The example below shows how to extract declarations of functions and details of arguments:

from clang.cindex import Index, Config, CursorKind, TypeKind

SCRIPT_PATH = "./tcpdump/print-ppp.c"

# C99 is a proper C code standard for tcpdump, as per their docs
index = Index.create()
translation_unit = index.parse(SCRIPT_PATH, args=["-std=c99"])

# filter to nodes in the root script (ignore imported!)
script_node = translation_unit.cursor
all_nodes = script_node.get_children()
all_nodes = filter(lambda c: c.location.file.name == SCRIPT_PATH, all_nodes)

# filter to function nodes
func_nodes = filter(lambda c: c.kind == CursorKind.FUNCTION_DECL, all_nodes)

# print attributes and their types for each function
for fn in func_nodes:
    print(fn.spelling)
    for arg in fn.get_arguments():
        t = arg.type
        # handle pointers by describing their pointees
        if t.kind == TypeKind.POINTER:
            declr = t.get_pointee().get_declaration()
        else:
            declr = t.get_declaration()
        print('\t', 
            t.get_canonical().spelling,
            t.kind,
            f'arg declared in {arg.location.file}:L{arg.extent.start.line},C{arg.extent.start.column}-L{arg.extent.end.line},C{arg.extent.end.column}',
            f'{declr.spelling} declared in {declr.location.file}:L{declr.location.line}'
        )

Which gives the following output when tested on the tcpdump project

print_lcp_config_options
     struct netdissect_options * TypeKind.POINTER arg declared in ./tcpdump/print-ppp.c:L403,C39-L403,C59 netdissect_options declared in ./tcpdump/netdissect.h:L161
     const unsigned char TypeKind.ELABORATED arg declared in ./tcpdump/print-ppp.c:L403,C61-L403,C73 u_char declared in /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/include/sys/_types/_u_char.h:L30
     const unsigned int TypeKind.ELABORATED arg declared in ./tcpdump/print-ppp.c:L403,C75-L403,C86 u_int declared in /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/include/sys/_types/_u_int.h:L30
ppp_hdlc
     struct netdissect_options * TypeKind.POINTER arg declared in ./tcpdump/print-ppp.c:L1359,C10-L1359,C33 netdissect_options declared in ./tcpdump/netdissect.h:L161
     const unsigned char * TypeKind.POINTER arg declared in ./tcpdump/print-ppp.c:L1360,C10-L1360,C25 u_char declared in /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/include/sys/_types/_u_char.h:L30
     unsigned int TypeKind.ELABORATED arg declared in ./tcpdump/print-ppp.c:L1360,C27-L1360,C39 u_int declared in /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/include/sys/_types/_u_int.h:L30
...

However, the funny part comes from visualization. This is easy with graphviz

from graphviz import Digraph

dot = Digraph(strict=True)
dot.attr(rankdir="LR", size="20,100", fontsize="6")

node_args = {"fontsize": "8pt", "edgefontsize": "6pt"}

for fn in func_nodes:
    fn_node_name = f"{fn.spelling}\nL{fn.location.line}"
    dot.node(fn_node_name, **node_args)
    for i, arg in enumerate(fn.get_arguments(), start=1):
        arg_node_name = arg.type.get_canonical().spelling
        dot.node(arg_node_name, **node_args)
        dot.edge(fn_node_name, arg_node_name)
        t = arg.type
        # handle pointers by describing their pointees
        if t.kind == TypeKind.POINTER:
            declr = t.get_pointee().get_declaration()
        else:
            declr = t.get_declaration()
        declr_file = f"{declr.location.file}"
        dot.node(declr_file, **node_args)
        dot.edge(
            arg_node_name, declr_file, label=f"L{declr.location.line}", fontsize="6pt"
        )

from IPython.display import display_svg
display_svg(dot)

We can now enjoy the pretty informative graph 😎 It shows that multiple functions share only few types of arguments and gives precise information about their origin.

The fully working example is shared here as a Colab notebook.

1.
Grafberger S, Groth P, Stoyanovich J, Schelter S. Data distribution debugging in machine learning pipelines. The VLDB Journal. Published online January 31, 2022:1103-1126. doi:10.1007/s00778-021-00726-w
2.
Fu H, Liu C, Wu B, Li F, Tan J, Sun J. CatSQL  : Towards Real World Natural Language to SQL Applications. Proc VLDB Endow. Published online February 2023:1534-1547. doi:10.14778/3583140.3583165
3.
Lattner C, Adve V. LLVM: A compilation framework for lifelong program analysis & transformation. International Symposium on Code Generation and Optimization, 2004 CGO 2004. doi:10.1109/cgo.2004.1281665

Customized Jupyter environments on Google Cloud

Kaggle docker images come with a huge list of pre-installed packages for machine-learning, including the support of GPU computing. They run within a container as a Jupyter application accessed by users through its web interface. Running a custom image boils down to these steps

💡 pulling the right version from the container registry
❗ publishing with appropriate parameters (--runtime flag important for GPU support)

Below we can see how it looks like

(base) maciej.skorski@shared-notebooks:~$ docker pull gcr.io/kaggle-gpu-images/python:v128
v128: Pulling from kaggle-gpu-images/python
d5fd17ec1767: Pulling fs layer 
...

(base) maciej.skorski@shared-notebooks:~$ sudo docker run \
>    --name "/payload-container" \
>    --runtime "nvidia" \
>    --volume "/home/jupyter:/home/jupyter" \
>    --mount type=bind,source=/opt/deeplearning/jupyter/jupyter_notebook_config.py,destination=/opt/jupyter/.jupyter/jupyter_notebook_config.py,readonly \
>    --log-driver "json-file" \
>    --restart "always" \
>    --publish "127.0.0.1:8080:8080/tcp" \
>    --network "bridge" \
>    --expose "8080/tcp" \
>    --label "kaggle-lang"="python" \
>    --detach \
>    --tty \
>    --entrypoint "/entrypoint.sh" \
>    "gcr.io/kaggle-gpu-images/python:v128" \
>    "/run_jupyter.sh" 
cf1b6f63d729d357ef3a320dfab076001a3513c54344be7ae3a5af9789395e63

The following test in Python shell shows that we can indeed use GPU 🙂

root@cf1b6f63d729:/# ipython
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.33.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import torch

In [2]: torch.cuda.is_available()
Out[2]: True

In [3]: torch.Tensor([1,2,3]).to(0)
Out[3]: tensor([1., 2., 3.], device='cuda:0')

Repairing user-managed notebooks on Google Cloud

In this note, I am sharing a case study on debugging and fixing jupyter-lab access issues.

The diagnostic script can be run on a VM instance as shown below:

(base) maciej.skorski@shared-notebooks:~$ sudo /opt/deeplearning/bin/diagnostic_tool.sh

Vertex Workbench Diagnostic Tool


Running system diagnostics...

Checking Docker service status...               [OK]
Checking Proxy Agent status...                  [OK]
Checking Jupyter service status in container...         [ERROR] Jupyter service is not running
Checking internal Jupyter API status...         [ERROR] Jupyter API is not active
Checking boot disk (/dev/sda1) space...         [OK]
Checking data disk (/dev/sdb) space...          [OK]
Checking DNS notebooks.googleapis.com...        [OK]
Checking DNS notebooks.cloud.google.com...      [OK]

System's health status is degraded

Diagnostic tool will collect the following information: 

  System information
  System Log /var/log/
  Docker information
  Jupyter service status
  Network information
  Proxy configuration: /opt/deeplearning/proxy-agent-config.json
  Conda environment information
  pip environment information
  GCP instance information

Do you want to continue (y/n)? n

Jupyter service runs from a container, but it somehow stopped in this case 😳

(base) maciej.skorski@shared-notebooks:~$ docker container ls
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

Not a problem! We can restart the container, but carefully choosing the parameters to expose it properly (ports, mounted folders etc). The appropriate docker command can be retrieved from a running container on a similar healthy instance by docker inspect

(base) maciej.skorski@kaggle-test-shared:~$ docker inspect \
>   --format "$(curl -s https://gist.githubusercontent.com/ictus4u/e28b47dc826644412629093d5c9185be/raw/run.tpl)" 3f5b6d709ccc

docker run \
  --name "/payload-container" \
  --runtime "runc" \
  --volume "/home/jupyter:/home/jupyter" \
  --mount type=bind,source=/opt/deeplearning/jupyter/jupyter_notebook_config.py,destination=/opt/jupyter/.jupyter/jupyter_notebook_config.py,readonly \
  --log-driver "json-file" \
  --restart "always" \
  --publish "127.0.0.1:8080:8080/tcp" \
  --network "bridge" \
  --hostname "3f5b6d709ccc" \
  --expose "8080/tcp" \
  --env "NOTEBOOK_DISABLE_ROOT=" \
  --env "TENSORBOARD_PROXY_URL=/proxy/%PORT%/" \
  --env "LIT_PROXY_URL=/proxy/%PORT%/" \
  --env "PATH=/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" \
  --env "GCSFUSE_METADATA_IMAGE_TYPE=DLC" \
  --env "LC_ALL=C.UTF-8" \
  --env "LANG=C.UTF-8" \
  --env "ANACONDA_PYTHON_VERSION=3.10" \
  --env "DL_ANACONDA_HOME=/opt/conda" \
  --env "SHELL=/bin/bash" \
  --env "LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib/x86_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64::/opt/conda/lib" \
  --env "CONTAINER_NAME=tf2-cpu/2-11" \
  --env "KMP_BLOCKTIME=0" \
  --env "KMP_AFFINITY=granularity=fine,verbose,compact,1,0" \
  --env "KMP_SETTINGS=false" \
  --env "NODE_OPTIONS=--max-old-space-size=4096" \
  --env "ENABLE_MULTI_ENV=false" \
  --env "LIBRARY_PATH=:/opt/conda/lib" \
  --env "TENSORFLOW_VERSION=2.11.0" \
  --env "KMP_WARNINGS=0" \
  --env "PROJ_LIB=/opt/conda/share/proj" \
  --env "TESSERACT_PATH=/usr/bin/tesseract" \
  --env "PYTHONPATH=:/opt/facets/facets_overview/python/" \
  --env "MKL_THREADING_LAYER=GNU" \
  --env "PYTHONUSERBASE=/root/.local" \
  --env "MPLBACKEND=agg" \
  --env "GIT_COMMIT=7e2b36e4a2ac3ef3df74db56b1fd132d56620e8a" \
  --env "BUILD_DATE=20230419-235653" \
  --label "build-date"="20230419-235653" \
  --label "com.google.environment"="Container: TensorFlow 2-11" \
  --label "git-commit"="7e2b36e4a2ac3ef3df74db56b1fd132d56620e8a" \
  --label "kaggle-lang"="python" \
  --label "org.opencontainers.image.ref.name"="ubuntu" \
  --label "org.opencontainers.image.version"="20.04" \
  --label "tensorflow-version"="2.11.0" \
  --detach \
  --tty \
  --entrypoint "/entrypoint.sh" \
  "gcr.io/kaggle-images/python:latest" \
  "/run_jupyter.sh"

Now the check goes OK 🙂

(base) maciej.skorski@shared-notebooks:~$ sudo /opt/deeplearning/bin/diagnostic_tool.sh

Vertex Workbench Diagnostic Tool


Running system diagnostics...

Checking Docker service status...               [OK]
Checking Proxy Agent status...                  [OK]
Checking Jupyter service status in container... [OK]
Checking internal Jupyter API status...         [OK]
Checking boot disk (/dev/sda1) space...         [OK]
Checking data disk (/dev/sdb) space...          [OK]
Checking DNS notebooks.googleapis.com...        [OK]
Checking DNS notebooks.cloud.google.com...      [OK]

ML Prototyping Environment on Cloud

Teams that collaborate on data-science tasks using cloud platforms often choose to share a preconfigured ML environment, such as Kaggle Docker Python image. This resolves reproducibility and dependency issues, while individual team members can add custom packages on top, with local virtual environments, for example adding less common packages for computer vision.

This robust setup requires pointing to the base environment as --system-site-packages when configuring the local virtual environment. Below, we see an example of a local environment with the package DeepForest (not present in the Kaggle image).

root@cf1b6f63d729:/home/jupyter/src/tree_counting# python -m venv .deepforest --system-site-packages
root@cf1b6f63d729:/home/jupyter/src/tree_counting# pip install --upgrade pip --quiet
root@cf1b6f63d729:/home/jupyter/src/tree_counting# pip install deepforest --quiet

The local environment can be further exposed to jupyter as a custom kernel.

root@cf1b6f63d729:/home/jupyter/src/tree_counting# source .deepforest/bin/activate
(.deepforest) root@cf1b6f63d729:/home/jupyter/src/tree_counting# python -m ipykernel install --user --name .deepforest --display-name "Kaggle+DeepForest"
Installed kernelspec .deepforest in /root/.local/share/jupyter/kernels/.deepforest

The architecture is shown below.

Dev Environment Architecture, generated with plantuml.

This script demonstrates the difference between system-level and local packages.

(.deepforest) root@cf1b6f63d729:/home/jupyter/src/tree_counting# python
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import tensorflow
>>> import deepforest
'/home/jupyter/src/tree_counting/.deepforest/lib/python3.7/site-packages/deepforest/__init__.py'
>>> tensorflow.__file__
'/opt/conda/lib/python3.7/site-packages/tensorflow/__init__.py'

Finally, it is worth mentioning the Dev Containers extension, which connects IDE to a running container. Then we can enjoy all the VS Code features 🙂

Efficient Pre-Commit Hooks with GitHub Actions

Pre-commit is a great tool for running various sanity checks (formatting, linting) on the code base. However, such scanning may be time-consuming (particularly on certain content like notebooks) which hits both user experience and billing for CI/CD (minutes are usually paid, except for public repos or very small projects).

Below, I demonstrate how to effectively optimize running pre-commit on GitHub Actions. The key is to cache both the pre-commit package and dependent hooks. Note that, as of now (April 2023), pre-commit native caching does only the second part. Fortunately, managing its cache is as simple as calling the GitHub Cache Action on ~/.cache/pre-commit.

name: pre-commit

on:
  push:
    branches: [experiments]
  pull_request:
    branches: [experiments, main]
  workflow_dispatch:

jobs:
  pre-commit:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - uses: actions/setup-python@v4
      with:
        python-version: 3.7
    - name: cache pre-commit deps
      id: cache_pre_commit
      uses: actions/cache@v3
      env:
          cache-name: cache-pre-commit
      with:
        path: |
          .pre_commit_venv 
          ~/.cache/pre-commit
        key: ${{ env.cache-name }}-${{ hashFiles('.pre-commit-config.yaml','~/.cache/pre-commit/*') }}
    - name: install pre-commit
      if: steps.cache_pre_commit.outputs.cache-hit != 'true'
      run: |
        python -m venv .pre_commit_venv
        . .pre_commit_venv/bin/activate
        pip install --upgrade pip
        pip install pre-commit
        pre-commit install --install-hooks
        pre-commit gc
    - name: run pre-commit hooks
      run: |
        . .pre_commit_venv/bin/activate  
        pre-commit run --color=always --all-files