Maciej Skorski – Page 4 – personal webpage

Expanding Inverse Functions

The problem of inverting the implicit function $$y=f(x)$$ in the form of power-series $$x = a + \sum_{k=1}^{\infty} b_k (y-f(a))^k$$

around a point of interest $x=a$, has a long history. Lagrange obtained a theoretical inversion formula ((https://mathworld.wolfram.com/LagrangeInversionTheorem.html)), yet efficient implementations are relatively recent ((Brent, Richard P., and Hsiang T. Kung. “Fast algorithms for manipulating formal power series.” Journal of the ACM (JACM) 25, no. 4 (1978): 581-595.)) ((Johansson, F., 2015. A fast algorithm for reversion of power series. Mathematics of Computation, 84(291), pp.475-484.)).

I this note I am sharing a bit simpler algorithm, performing Newton-like updates:
$$x_{k} = x_k-f(x_{k-1})[(y-f(a))^{k}]\cdot \frac{(y-f(a))^{k}}{f'(a)},\quad x_1 = a+\frac{y-f(a)}{f'(a)}.$$
We can see that it gradually produces more and more accurate terms. More precisely, assuming w.l.o.g. $0=a=f(a)$, suppose $f(x) =y + O(y^{k})$, then $f(x + h ) = f(x)+f'(0) h + O(h^2)$ by Taylor’s expansion, and so the update $h=-\frac{f(x) [y^k] }{f'(0)}$ (where $[z^k]$ is the operation of extracting the coefficient with $z^k$) gives $f(x+h)=y+O(y^{k+1})$ as desired.
I implemented this a proposal for Sympy.

Monitoring Azure Experiments

Azure Cloud is a popular work environment for many data scientists, yet many features remain poorly documented. This note shows how to monitor Azure experiments in a more handy and detailed way than through web or cl interface.

The trick is to create a dashborad of experiments and their respective runs, up to a desired level of detail, from Python. The workhorse is the following handy utility function:

from collections import namedtuple

def get_runs_summary(ws):
    """Summarise all runs under a given workspace, with experiment name, run id and run status
    Args:
        ws (azureml.core.Workspace): Azure workspace to look into
    """
    # NOTE: extend the scope of run details if needed
    record = namedtuple('Run_Description',['job_name','run_id','run_status'])
    for exp_name,exp_obj in ws.experiments.items():
        for run_obj in exp_obj.get_runs():
            yield(record(exp_name,run_obj.id,run_obj.status))

Now it’s time to see it in action 😎

# get the default workspace
from azureml.core import Workspace
import pandas as pd

ws = Workspace.from_config()

# generate the job dashboard and inspect
runs = get_runs_summary(ws)
summary_df = pd.DataFrame(runs)
summary_df.head()

# count jobs by status
summary_df.groupby('run_status').size()

Use the dashboard for to automatically manage experiments. For example, to kill running jobs:

from azureml.core import Experiment, Run

for exp_name,run_id in summary_df.loc[summary_df.run_status=='Running',['job_name','run_id']].values:
    exp = Experiment(ws,exp_name)
    run = Run(exp,run_id)
    run.cancel()

Check the jupyter notebook in my repository for a one-click demo.

Repo Passwords in Poetry

Poetry, a popular Python package manager, prefers to use keyring to manage passwords for private code repositories. Storing passwords in plain text is a secondary option, but may be needed in case of either issues in poetry itself or with keyring configuration (may not be properly installed, be locked etc). To disable the use of system keyring by poetry, set the null backend in the environmental variable PYTHON_KEYRING_BACKEND:

(.venv) azureuser@sense-mskorski:~/projects/test_project$ export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring
(.venv) azureuser@sense-mskorski:~/projects/test_project$ poetry config http-basic.private_repo *** '' -vvv
Loading configuration file /home/azureuser/.config/pypoetry/config.toml
Loading configuration file /home/azureuser/.config/pypoetry/auth.toml
Adding repository test_repo (https://***) and setting it as secondary
No suitable keyring backend found
No suitable keyring backends were found
Keyring is not available, credentials will be stored and retrieved from configuration files as plaintext.
Using a plaintext file to store credentials

Analogously, one of working backends can be enabled (make sure it works correctly!) . This is how to list available backends:

(.venv) azureuser@sense-mskorski:~/projects/test_project$ keyring --list-backends
keyring.backends.chainer.ChainerBackend (priority: -1)
keyring.backends.fail.Keyring (priority: 0)
keyring.backends.SecretService.Keyring (priority: 5)

Marking Python Tests as Optional

Often code tests are to be run on special demand, rather than in a CI/CD manner: for instance, they may be slow or work only in a local mode with protected data. This note shows how to declare code tests optional in pytest, the leading testing framework for Python. The article is inspired by the pytest documentation on test markers.

The trick is to mark extra tests with a decorator and couple it with a runtime flag.
The first step is to mark the relevant code blocks and is illustrated below:

import pytest

@pytest.mark.optional
def test_addition():
    assert 1+1==2

The second step is to instruct the config flag on how to interpret the decorator; it is also worth describing this action in the helper. The code below shows how to set this up:

import pytest

def pytest_addoption(parser):
    """Introduce a flag for optional tests"""
    parser.addoption(
        "--optional", action="store_true", default=False, help="run extra tests"
    )

def pytest_collection_modifyitems(config, items):
    """Instruct the testing framework to obey the decorator for optional tests"""
    if config.getoption("--optional"):
        # --extras given in cli: do not skip extra tests
        return
    # otherwise, execute
    skip_extras = pytest.mark.skip(reason="need --optional flag to run")
    for item in items:
        if "optional" in item.keywords:
            item.add_marker(skip_extras)

def pytest_configure(config):
    """Make the decorator for optional tests visible in help: pytest --markers"""
    config.addinivalue_line("markers", "optional: mark test as optional to run")

To see it in action, run pytest -k battery --optional 😎

test_battery.py::test_addition PASSED

Debug CI/CD with SSH

CircleCI is a popular platform Continuous integration (CI) and continuous delivery (CD). While its job status reports are already useful, one can do much more insights by debugging it in real time. Here I am sharing a real use-case of debugging a failing job deploing an app 🙂

Failing job are reported in red and the (most of) errors caught during the execution appear in the terminal. In this case, the environment is unable to locate Python:

The error is not very informative. 😮 For more insights, let’s re-run the job in SSH mode:

Re-running the failed CircleCI with SSH.

You will be welcomed with instructions on how to connect via SSH:

Use this instruction to connect to inspect the environment at its failure stage:

mskorski@SHPLC-L0JH Documents % ssh -p 64535 aa.bbb.cc.dd
The authenticity of host '[aa.bbb.cc.dd]:64535 ([aa.bbb.cc.dd]:64535)' can't be established.
ED25519 key fingerprint is SHA256:LsMhHb5fUPLHI9dFdyig4VKw44GTqrA2dkEWT0sZx4k.
Are you sure you want to continue connecting (yes/no)? yes

circleci@bc95bb40fff3:~$ ls project/venv/bin -l
total 300
...
lrwxrwxrwx 1 circleci circleci    7 Aug  1 12:37 python -> python3
lrwxrwxrwx 1 circleci circleci   49 Aug  1 12:37 python3 -> /home/circleci/.pyenv/versions/3.8.13/bin/python3
lrwxrwxrwx 1 circleci circleci    7 Aug  1 12:38 python3.9 -> python3

Bingo! The terminal warns about broken symbolic links:

Terminal highlights broken symbolic links.

The solution in this case was to update cache. The issues may be far more complex than that, but being able to debug them live comes to the rescue. 😎

Prototype in Jupyter on Multiple Kernels

For data scientists, it is a must to prototype in multiple virtual environments which isolate different (and often very divergent) sets of Python packages. This can be achieved by linking one Jupyter installation with multiple Python environments.

Use the command <code>which jupyter</code> to show the Jupyter location and <jupyter kernelspec list> to show available kernels, as shown below:

ubuntu@ip-172-31-36-77:~/projects/eye-processing$ which jupyter
/usr/local/bin/jupyter
ubuntu@ip-172-31-36-77:~/projects/eye-processing$ jupyter kernelspec list
Available kernels:
  .eyelinkparser    /home/ubuntu/.local/share/jupyter/kernels/.eyelinkparser
  eye-processing    /home/ubuntu/.local/share/jupyter/kernels/eye-processing
  pypupilenv        /home/ubuntu/.local/share/jupyter/kernels/pypupilenv
  python3           /usr/local/share/jupyter/kernels/python3

To make an environment a Jupyter kernel, first activate it and install ipykernel inside.

ubuntu@ip-172-31-36-77:~/projects/eye-processing$ source .venv/bin/activate
(.venv) ubuntu@ip-172-31-36-77:~/projects/eye-processing$ pip install ipykernel
Collecting ipykernel
...
Successfully installed...

Then use ipykernel to register the active environment as a jupyter kernel (choose a name and the destination, e.g. user space; consult python -m ipykernel install –help for more options).

(.venv) ubuntu@ip-172-31-36-77:~/projects/eye-processing$ python -m ipykernel install --name eye-processing --user
Installed kernelspec eye-processing in /home/ubuntu/.local/share/jupyter/kernels/eye-processing

The kernel should appear in the jupyter kernel lists. Notebooks may need restarting to notice it.

The environment appears in Jupyter kernels (VS Code).

Open kernel.json file under kernel’s path to inspect config details:

{
 "argv": [
  "/home/ubuntu/projects/eye-processing/.venv/bin/python",
  "-m",
  "ipykernel_launcher",
  "-f",
  "{connection_file}"
 ],
 "display_name": "eye-processing",
 "language": "python",
 "metadata": {
  "debugger": true
 }
}
"~/.local/share/jupyter/kernels/eye-processing/kernel.json" [noeol] 14L, 234B

Modern Bibliography Management

A solid bibliography database is vital for every research project, yet building it is considered an ugly manual task by many – particularly by old-school researchers. But this does not have to be painful if we use modern toolkit. The following features appear particularly important:

collaborative work (sharing etc)
extraction magic (online search, automated record population)
tag-annotation support (organize records with keywords or comments)
multi-format support (export and import using BibTeX or other formats)
plugins to popular editors (Office, GoogleDocs)

I have had a particularly nice experience with Zotero (many thanks to my work colleagues from SensyneHealth for recommending this!). Let the pictures below demonstrate it in action!

tag and search…

organize notes…

automatically extract from databases…

format and export…

National Bank of Poland leads on helping Ukraine’s currency

Ukrainian refugees see their currency hardly convertible at fair rates or even accepted. The lack of liquidity will also hit Ukraine’s government urgent expenses (medical supplies or weapons).

The National Bank of Poland (NBP) offered a comprehensive financial package to address these problems. Firstly, it enabled the refugees to exchange the currency at a nearly official rate, through the agreement with the National Bank of Ukraine (NBU) signed on March 18.

Ukrainians to Be Able to Exchange Hryvnia Cash in Polish Banks

Secondly, in a follow-up agreement the NBP offered a FX Swap for USD 1 billion which will provide the Ukraine central bank with more liquidity.

As Reuters announced on March 24, EU countries are close to follow and agree on a scheme enabling exchanging Ukraine’s cash. The rules disclosed in the draft are close to the NBP scheme (exchange capped at around 10,000 UAH per individual)

Evidence-based quick wins in reducing oil demand

10-point action plan on cutting the oil demand by IEA.

Following the Russia invasion on Ukraine an oil supply shock is expected and what is worse there may be no supply increase from OPEC+ . A way out is to cut the demand as advised by IEA. The proposed steps are quick-wins: easy to implement and reverse while having measurable and significant impact. The overall impact should balance the supply gap.

Quantitative estimates can be backed with empirical or even statistical evidence. A good example is the speed reduction, as recent research advances demonstrate quantitatively what drivers learn by experience: fuel consumption considerably increases with speed beyond a turning point. Turning this statement around: you can likely save lots of petrol driving at a lower speed. The study of He at al. finds cubic approximations fit well and enables several optimisation considerations for various groups of car users.

Source: “Study on a Prediction Model of Superhighway Fuel Consumption Based on the Test of Easy Car Platform”, He at al.

Robust Azure ETLs with Python

Microsoft Azure cloud computing platform faces criticism due to its Python API being little customisable and poorly documented. Still it is a popular choice for many companies, thus data scientists need to squeeze maximum of features and performance out of it rather than complain. Below I am sharing thoughts on creating robust Extract-Transform-Load processes in this setup.

Extract from database: consumer-producer!

The key trick is to enable the consumer-producer pattern: you want to process your data in batches, for the sake of robustness and efficiency. The popular pandas library dedicated for tabular data is not enough for this task, as it hinders useful features. Gain more control using dedicated database drivers, e.g. psycogp driver for the popular Postgres database.

import psycopg2

dsn = f'user={db_user} password={db_password} dbname={db_name} host={db_host}'
conn = psycopg2.connect(dsn)

query = """
    SELECT *
    FROM sales
    WHERE date > '2021'
    """


def process_rows(row_group):
    '''do your stuff, e.g. filter and append to a file'''
    pass


n_prefetch = 10000


# mind the server-side cursor! 
with conn.cursor(name='server_side_cursor') as cur:
    cur.itersize = n_prefetch
    cur.execute(query)
    while True:
        row_group = cur.fetchmany(n_prefetch)
        if len(row_group) > 0:
            process_rows(row_group)
        else:
            break

Cast datatypes

Cast datatypes early and explicitly having these point in mind

avoid data types inferring if possible, because it can be very slow and not 100% accurate
align datatypes to your destination format, for example pandas work better with date and time combined into datetime due to rich features of the latter, while Apache Parquet benefits from aligning numerical precision because it is stored in chunks.

Best to do this casting upstream, adapting at the database driver level, like in this example:

import psycopg2

# cast some data types upon receiving from database
datetype_casted = psycopg2.extensions.new_type(
    psycopg2.extensions.DATE.values, "date", psycopg2.DATETIME
)
psycopg2.extensions.register_type(datetype_casted)
decimal_casted = psycopg2.extensions.new_type(
    psycopg2.extensions.DECIMAL.values, "decimal", psycopg2.extensions.FLOAT
)
psycopg2.extensions.register_type(decimal_casted)

Use Parquet to store tabular data

The Apache Parquet is invaluable to efficient work with large data in tabular format. There are two major drivers for Pyhon: pyarrow and fastparquet. The first one can be integrated with Azure data flows (e.g. you can stream and filter data), although it has been limited in supporting more sophisticated data such as timedelta. Remember that writing Parquet incurs memory overhead so better to do this in batches. The relevant code may look as below:

def sql_to_parquet(
    conn, query, column_names, target_dir, n_prefetch=1000000, **parquet_kwargs
):
    """Writes the result of a SQL query to a Parquet file (in chunks).

    Args:
        conn: Psycopg connection object (must be open)
        query: SQL query of "select" type
        column_names: column names given to the resulting SQL table
        target_dir: local directory where Parquet is written to; must exist, data is overwritten
        n_prefetch: chunk of SQL data processed (read from SQL and dumped to Parquet) at a time. Defaults to 1000000.
    """
    with conn.cursor(name="server_side_cursor") as cur:
        # start query
        cur.itersize = n_prefetch
        cur.execute(query)
        # set up consumer
        chunk = 0
        # consume until stream is empty
        while True:
            # get and process one batch
            row_group = cur.fetchmany(n_prefetch)
            chunk += 1
            if len(row_group) > 0:
                out = pd.DataFrame(data=row_group, columns=column_names)
                fname = os.path.join(target_dir, f"part_{chunk:04d}.parquet")
                out.to_parquet(fname, engine="pyarrow", **parquet_kwargs)
            else:
                break


def df_to_parquet(df, target_dir, chunk_size=100000, **parquet_kwargs):
    """Writes pandas DataFrame to parquet format with pyarrow.

    Args:
        df: pandas DataFrame
        target_dir: local directory where parquet files are written to
        chunk_size: number of rows stored in one chunk of parquet file. Defaults to 100000.
    """
    for i in range(0, len(df), chunk_size):
        slc = df.iloc[i : i + chunk_size]
        chunk = int(i / chunk_size)
        fname = os.path.join(target_dir, f"part_{chunk:04d}.parquet")
        slc.to_parquet(fname, engine="pyarrow", **parquet_kwargs)

Leverage Azure API properly!

The Azure API is both cryptic in documentation and under-explained in data science blogs. Below I am sharing a comprehensive receipt on how to upload and register Parquet data with minimum effort. While uploading makes the data persist, registering enables further Azure features (such as pipe-lining). Best to dump your Parquet file to a directory (large Parquet files should be chunked) with tempfile then use the Azure Dataset class to both upload to the storage (use universally unique identifier for reproducibility/avoiding path collisions) and register to the workspace (use Tabular subclass). Use classes Workspace and Datastore to facilitate interaction with Azure.

from azureml.core import Workspace, Dataset
import tempfile
from uuid import uuid4
import psycopg2

ws = Workspace.from_config()
dstore = ws.datastores.get("my_datastore")

with tempfile.TemporaryDirectory() as tempdir:
    sql_to_parquet(conn, query, query_cols, tempdir)
    target = (dstore, f"raw/{tab}/{str(uuid4())}")
    Dataset.File.upload_directory(tempdir, target)
    ds = Dataset.Tabular.from_parquet_files(target)
    ds.register(
        ws,
        name=f"dataset_{tab}",
        description="created from Sales table",
        tags={"JIRA": "ticket-01"},
        create_new_version=True,
    )