Working with Abstract Syntax Trees

Visualizing code as a syntax tree is both funny and useful, as seen from impressive applications such as creating lineage of SQL which helps to understand complex queries in business. Abstract syntax trees are not only widely used in industry but are still a subject of top academic research​1,2​.

This post demonstrates how to work with AST in Python by parsing C code with CLang/LLVM​3​ and visualizing by graphviz.

Parsing is relatively simple, particularly to users that have had already similar experiences with abstract trees, such as parsing XMLs. My advice for beginners is to avoid code factoring, but leverage functional coding features in Python. The example below shows how to extract declarations of functions and details of arguments:

from clang.cindex import Index, Config, CursorKind, TypeKind

SCRIPT_PATH = "./tcpdump/print-ppp.c"

# C99 is a proper C code standard for tcpdump, as per their docs
index = Index.create()
translation_unit = index.parse(SCRIPT_PATH, args=["-std=c99"])

# filter to nodes in the root script (ignore imported!)
script_node = translation_unit.cursor
all_nodes = script_node.get_children()
all_nodes = filter(lambda c: c.location.file.name == SCRIPT_PATH, all_nodes)

# filter to function nodes
func_nodes = filter(lambda c: c.kind == CursorKind.FUNCTION_DECL, all_nodes)

# print attributes and their types for each function
for fn in func_nodes:
    print(fn.spelling)
    for arg in fn.get_arguments():
        t = arg.type
        # handle pointers by describing their pointees
        if t.kind == TypeKind.POINTER:
            declr = t.get_pointee().get_declaration()
        else:
            declr = t.get_declaration()
        print('\t', 
            t.get_canonical().spelling,
            t.kind,
            f'arg declared in {arg.location.file}:L{arg.extent.start.line},C{arg.extent.start.column}-L{arg.extent.end.line},C{arg.extent.end.column}',
            f'{declr.spelling} declared in {declr.location.file}:L{declr.location.line}'
        )

Which gives the following output when tested on the tcpdump project

print_lcp_config_options
     struct netdissect_options * TypeKind.POINTER arg declared in ./tcpdump/print-ppp.c:L403,C39-L403,C59 netdissect_options declared in ./tcpdump/netdissect.h:L161
     const unsigned char TypeKind.ELABORATED arg declared in ./tcpdump/print-ppp.c:L403,C61-L403,C73 u_char declared in /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/include/sys/_types/_u_char.h:L30
     const unsigned int TypeKind.ELABORATED arg declared in ./tcpdump/print-ppp.c:L403,C75-L403,C86 u_int declared in /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/include/sys/_types/_u_int.h:L30
ppp_hdlc
     struct netdissect_options * TypeKind.POINTER arg declared in ./tcpdump/print-ppp.c:L1359,C10-L1359,C33 netdissect_options declared in ./tcpdump/netdissect.h:L161
     const unsigned char * TypeKind.POINTER arg declared in ./tcpdump/print-ppp.c:L1360,C10-L1360,C25 u_char declared in /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/include/sys/_types/_u_char.h:L30
     unsigned int TypeKind.ELABORATED arg declared in ./tcpdump/print-ppp.c:L1360,C27-L1360,C39 u_int declared in /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/include/sys/_types/_u_int.h:L30
...

However, the funny part comes from visualization. This is easy with graphviz

from graphviz import Digraph

dot = Digraph(strict=True)
dot.attr(rankdir="LR", size="20,100", fontsize="6")

node_args = {"fontsize": "8pt", "edgefontsize": "6pt"}

for fn in func_nodes:
    fn_node_name = f"{fn.spelling}\nL{fn.location.line}"
    dot.node(fn_node_name, **node_args)
    for i, arg in enumerate(fn.get_arguments(), start=1):
        arg_node_name = arg.type.get_canonical().spelling
        dot.node(arg_node_name, **node_args)
        dot.edge(fn_node_name, arg_node_name)
        t = arg.type
        # handle pointers by describing their pointees
        if t.kind == TypeKind.POINTER:
            declr = t.get_pointee().get_declaration()
        else:
            declr = t.get_declaration()
        declr_file = f"{declr.location.file}"
        dot.node(declr_file, **node_args)
        dot.edge(
            arg_node_name, declr_file, label=f"L{declr.location.line}", fontsize="6pt"
        )

from IPython.display import display_svg
display_svg(dot)

We can now enjoy the pretty informative graph 😎 It shows that multiple functions share only few types of arguments and gives precise information about their origin.

The fully working example is shared here as a Colab notebook.

  1. 1.
    Grafberger S, Groth P, Stoyanovich J, Schelter S. Data distribution debugging in machine learning pipelines. The VLDB Journal. Published online January 31, 2022:1103-1126. doi:10.1007/s00778-021-00726-w
  2. 2.
    Fu H, Liu C, Wu B, Li F, Tan J, Sun J. CatSQL             : Towards Real World Natural Language to SQL Applications. Proc VLDB Endow. Published online February 2023:1534-1547. doi:10.14778/3583140.3583165
  3. 3.
    Lattner C, Adve V. LLVM: A compilation framework for lifelong program analysis & transformation. International Symposium on Code Generation and Optimization, 2004 CGO 2004. doi:10.1109/cgo.2004.1281665

Published by mskorski

Scientist, Consultant, Learning Enthusiast

Leave a comment

Your email address will not be published. Required fields are marked *