Visualizing code as a syntax tree is both funny and useful, as seen from impressive applications such as creating lineage of SQL which helps to understand complex queries in business. Abstract syntax trees are not only widely used in industry but are still a subject of top academic research1,2.
This post demonstrates how to work with AST in Python by parsing C code with CLang
/LLVM
3 and visualizing by graphviz
.
Parsing is relatively simple, particularly to users that have had already similar experiences with abstract trees, such as parsing XMLs. My advice for beginners is to avoid code factoring, but leverage functional coding features in Python. The example below shows how to extract declarations of functions and details of arguments:
from clang.cindex import Index, Config, CursorKind, TypeKind
SCRIPT_PATH = "./tcpdump/print-ppp.c"
# C99 is a proper C code standard for tcpdump, as per their docs
index = Index.create()
translation_unit = index.parse(SCRIPT_PATH, args=["-std=c99"])
# filter to nodes in the root script (ignore imported!)
script_node = translation_unit.cursor
all_nodes = script_node.get_children()
all_nodes = filter(lambda c: c.location.file.name == SCRIPT_PATH, all_nodes)
# filter to function nodes
func_nodes = filter(lambda c: c.kind == CursorKind.FUNCTION_DECL, all_nodes)
# print attributes and their types for each function
for fn in func_nodes:
print(fn.spelling)
for arg in fn.get_arguments():
t = arg.type
# handle pointers by describing their pointees
if t.kind == TypeKind.POINTER:
declr = t.get_pointee().get_declaration()
else:
declr = t.get_declaration()
print('\t',
t.get_canonical().spelling,
t.kind,
f'arg declared in {arg.location.file}:L{arg.extent.start.line},C{arg.extent.start.column}-L{arg.extent.end.line},C{arg.extent.end.column}',
f'{declr.spelling} declared in {declr.location.file}:L{declr.location.line}'
)
Which gives the following output when tested on the tcpdump project
print_lcp_config_options
struct netdissect_options * TypeKind.POINTER arg declared in ./tcpdump/print-ppp.c:L403,C39-L403,C59 netdissect_options declared in ./tcpdump/netdissect.h:L161
const unsigned char TypeKind.ELABORATED arg declared in ./tcpdump/print-ppp.c:L403,C61-L403,C73 u_char declared in /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/include/sys/_types/_u_char.h:L30
const unsigned int TypeKind.ELABORATED arg declared in ./tcpdump/print-ppp.c:L403,C75-L403,C86 u_int declared in /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/include/sys/_types/_u_int.h:L30
ppp_hdlc
struct netdissect_options * TypeKind.POINTER arg declared in ./tcpdump/print-ppp.c:L1359,C10-L1359,C33 netdissect_options declared in ./tcpdump/netdissect.h:L161
const unsigned char * TypeKind.POINTER arg declared in ./tcpdump/print-ppp.c:L1360,C10-L1360,C25 u_char declared in /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/include/sys/_types/_u_char.h:L30
unsigned int TypeKind.ELABORATED arg declared in ./tcpdump/print-ppp.c:L1360,C27-L1360,C39 u_int declared in /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk/usr/include/sys/_types/_u_int.h:L30
...
However, the funny part comes from visualization. This is easy with graphviz
from graphviz import Digraph
dot = Digraph(strict=True)
dot.attr(rankdir="LR", size="20,100", fontsize="6")
node_args = {"fontsize": "8pt", "edgefontsize": "6pt"}
for fn in func_nodes:
fn_node_name = f"{fn.spelling}\nL{fn.location.line}"
dot.node(fn_node_name, **node_args)
for i, arg in enumerate(fn.get_arguments(), start=1):
arg_node_name = arg.type.get_canonical().spelling
dot.node(arg_node_name, **node_args)
dot.edge(fn_node_name, arg_node_name)
t = arg.type
# handle pointers by describing their pointees
if t.kind == TypeKind.POINTER:
declr = t.get_pointee().get_declaration()
else:
declr = t.get_declaration()
declr_file = f"{declr.location.file}"
dot.node(declr_file, **node_args)
dot.edge(
arg_node_name, declr_file, label=f"L{declr.location.line}", fontsize="6pt"
)
from IPython.display import display_svg
display_svg(dot)
We can now enjoy the pretty informative graph 😎 It shows that multiple functions share only few types of arguments and gives precise information about their origin.
The fully working example is shared here as a Colab notebook.
- 1.Grafberger S, Groth P, Stoyanovich J, Schelter S. Data distribution debugging in machine learning pipelines. The VLDB Journal. Published online January 31, 2022:1103-1126. doi:10.1007/s00778-021-00726-w
- 2.Fu H, Liu C, Wu B, Li F, Tan J, Sun J. CatSQL : Towards Real World Natural Language to SQL Applications. Proc VLDB Endow. Published online February 2023:1534-1547. doi:10.14778/3583140.3583165
- 3.Lattner C, Adve V. LLVM: A compilation framework for lifelong program analysis & transformation. International Symposium on Code Generation and Optimization, 2004 CGO 2004. doi:10.1109/cgo.2004.1281665