Static Analysis of Data Transformations in Jupyter Notebooks

Jupyter notebooks used to pre-process and polish raw data for data science and machine learning processes are challenging to analyze. Their data-centric code manipulates dataframes through call to library functions with complex semantics, and the properties to track over it vary widely depending on the verification task. This paper presents a novel abstract domain that simplifies writing analyses for such programs, by extracting a unique CFG from the notebook that contains all transformations applied to the data. Several properties can then be determined by analyzing such CFG, that is simpler than the original Python code. We present a first use case that exploits our analysis to infer the required shape of the dataframes manipulated by the notebook.


Introduction
The ever-increasing usage of data-driven decision processes led to data science (DS) and machine learning (ML) permeating several areas of everyday life, reaching outside the boundaries of computer science and software engineering.Ensuring correctness of these processes is particularly important when they are employed in critical areas like medicine, public policy, or finance.In contrast to robustness verification of trained ML models [9], data pre-processing of the DS/ML has received little attention.As raw data is often inconsistent or incomplete, pre-processing programs, typically implemented in Jupyter notebooks, apply transformations to polish it to the point where it can be visualized or used for training.Errors and inconsistencies at this stage can silently propagate downwards in the DS/ML chain, leading to incorrect conclusions and below-par models [1].
Verification of notebooks can take different directions, such as detecting data leakages (i.e., sharing of information between the training and test datasets [7]) or warning about the introduction of bias or skews [11].Regardless, Jupyter notebooks are challenging to analyze: code comes in blocks that can be executed in any order with repetitions, and data is manipulated through calls to a vast number of library functions with complex and possibly overlapping semantics.
This paper proposes an abstract interpretation approach to simplify the implementation of verification techniques for Jupyter notebooks containing DS/ML programs.We propose an abstract domain that tracks transformations made to dataframes, that is, the in-memory tables containing the input data, in a unique graph.The latter contains data transformations as nodes that are linked by edges encoding the order in which they are applied.The final graph produced at the end of the analysis is a control flow graph (CFG) containing only dataframe transformations: analyses such as the ones mentioned above can be implemented as fixpoints over this CFG, instead of tackling the more complex Python code.
This paper is structured as follows.Section 2 discusses related work.Section 3 introduces PyLiSA, the analyzer used to evaluate our domain, and LiSA, the framework it relies on.Section 4 defines the abstraction for dataframe values.We then explore a first use-case in Section 5, where we use our abstraction for inferring the shape of the dataframes used by a program.We then conclude with a preliminary experiment on a real DS notebook in Section 6.

Related Work
Obtaining formal guarantees on the safety and fairness of ML models has been a subject of recent widespread interest [9].
Our work builds upon this large ecosystem and proposes a verification framework for the data pre-processing stage of the ML pipeline.The closest body of work is [7] which also analyses DS notebooks along with their peculiar execution semantics and proposes an abstraction to detect data leakage.Our approach towards the shape inference of input data follows the work of [8] and extends it to support inputs to programs which contain datasets.Similarly, our objective of inferring input data usage directly derives from the compound data structure usage analysis presented in [10] and adds the ability to track the usage of selections of datasets.The objective of detecting bias/skew introduction is inspired from the mlinspect tool proposed in [4].This tool builds a directed acyclic graph (DAG) of operations (like filters or projections) applied to the data by analyzing the code and using framework-specific backends (like scikit-learn).After analyzing the DAG, it suggests potential sources of bias/skew.Although this is promising, it only places syntactic checks and cannot concretely detect which operations cause these problems.Lastly, [5] is an automated data provenance system for Python.However, it requires executing the code which is not always be feasible when large datasets are involved.

LiSA and PyLiSA
LiSA [3,6] (Library for Static Analysis) is a modular framework for developing static analyzers based on the abstract interpretation theory.LiSA analyzes CFGs whose statements do not have predefined semantics: instead, users of the framework define custom statement instances implementing language-specific semantic functions, enabling the analysis of a wide range of programming languages and the development of multilanguage analyses.The analysis infrastructure is partitioned into three main areas: call evaluation, memory modeling and value analysis.Each area corresponds to a separate analysis component, that operates agnostically w.r.t.how the others are implemented.At first, calls are abstracted by the Interprocedural Analysis, that leaves the remaining components with call-less programs.Then, memory-related expressions are abstracted by the Heap Domain, yielding calland memory-less programs for the Value Domain to analyze.Code parsing and semantics are defined in Frontends, that can also provide implementations for LiSA's components.In this paper, we employ the Python frontend PyLiSA1 .

The Dataframe Graph Domain
We present an abstraction able to capture the structure of the dataframes manipulated in Python code.Intuitively, we employ a graph structure to keep track of all operations that involve dataframes, with edges encoding the order in which they are performed.The graph thus represents the state of each dataframe at a given program point.Nodes of this graph can be referenced by variables of the program.The latter thus refers to the dataframe corresponding to the sub-graph obtained with a backward DFS starting from the node.We adopt a two-level mapping: program variables point to labels, and the latter are mapped to the nodes.This enables simple handling of dataframe aliasing (i.e., two variables will be mapped to the same label) and updates (i.e., by changing the nodes pointed by a label, we indirectly update all variables pointing to the label).
Our domain D # is meant to be an abstraction of the portion D of the program state that stores information about dataframes: we rely on an auxiliary domain to reason about the remainder of the state.As, in our experience, non-dataframe values appearing in the notebooks we target are mostly constants, D # cooperates with a simple constant propagation domain CP # .We rely on the semantics of CP # to abstract: • a string expression s as a constant string ; • a list of strings cl as a constant list of constant strings ⟨ 1 , . . .,   ⟩; • a list of dataframes dl as a constant list of abstract labels ⟨ℓ 1 , . . ., ℓ  ⟩ that will be introduced shortly; • the left-hand side of an assignment x to a set of identifiers {x 1 , . . ., x  }.
We begin defining D # by introducing the graph structure, where a graph  # = ( , ) ∈ G # is composed by a set of nodes  ⊆  and a set of edges  ⊆ ℰ. Elements of  are: • read(), initializing a dataframe with the contents of file ; • access( 1 , . . .,   ), accessing columns  1 , . . .,   ,  ∈ N; • transform( ), transforming values through an auxiliary function  ; • concat, concatenating multiple dataframes; • filter(, , ), selecting rows where     holds (with   being the value of column ); • assign( 1 , . . .,   ), assigning columns  1 , . . .,   ,  ∈ N to a new value.where  and  are string and value abstractions in CP # ,  ∈ {=, ≠, >, ≥, <, ≤} and  is the signature of a Python function.Instead, elements of ℰ are (with ,  ′ ∈ ,  ∈ N): •  →  ′ is an edge encoding the sequential order of operations; •  ⇝   ′ is a concatenation edge, where  is the -th dataframe in the concatenation that builds  ′ (note that  ′ can have more incoming concatenation edges using the same , indicating multiple candidates for the same index); •  ↠  ′ is an assign edge, where  is the right-hand side of the assignment  ′ (once more,  ′ can have more incoming assign edges, indicating multiple candidates for the right-hand side).
D # also contains two maps, both relying on abstract labels.A label ℓ ∈ ℒ is an arbitrary synthetic identifier that serves as an abstract name for a set of nodes in , where ℒ is the finite set of all possible labels.While we do not impose any specific structure on ℒ, a common characterization of labels is to have one for each program point.D # contains (i) a function ℒ → ℘() ∈ L # from labels to sets of nodes, and (ii) a map V → ℘(ℒ) ∈ V # keeping track of which possible labels a variable can refer to.Notice that, depending on the analyzer's infrastructure, variables can correspond to abstract memory locations or program variables.
We can now define D # as the Cartesian product V # × L # × G # , that is a complete lattice since G # is a Cartesian product of powersets, and V # and L # are functional lifts of powersets.One concern with infinite lattices such as D # is the convergence of fixpoint iterations over them.As G # intuitively does not satisfy ACC2 , a widening operator is required.As, in our experience, the DS notebooks that this domain targets mostly contain sequential code with very few loops that stabilize in few iterations, we employ a naive widening as With such an operator we ensure termination of the analysis, and we leave the study of a more precise widening as future work.
Example. Figure 2 reports the  # instance abstracting the code of Figure 1.For the sake of clarity, nodes of  # are enriched with a numerical identifier on the top-left corner to easily identify them.Such identifiers are used in the codomain of  # to represent them.We show how this graph is constructed while defining the abstract semantics.
The connection between D and D # is established by the abstraction function  and the concretization function .A set of functions { d1 , . . ., d }, d ∈ D, 1 ≤  ≤ ,  ∈ N ∞ can be abstracted to an element  # ∈ D # through function  : ℘(D) → D # , defined as the lub of the abstractions of each individual d.The abstraction of a single function is defined as  ( d) = ( # ,  # ,  # ), with: The abstraction of a single dataframe map exploits shape : D → G # × , an auxiliary function that extracts the shape of a concrete dataframe as a single-path graph containing only a read node followed by an access node reporting all the existing columns, and returning the graph itself and its Example  # abstracting the code of Figure 1 unique leaf.The abstraction of a set of states is thus the union of the abstraction of each individual state, generated by creating the graph  # containing the shape (that is, the access to all columns  optionally preceded by the reading of source ) of all existing dataframes, having each variable refer to the corresponding node in  # .
As  is join-preserving, the concretization function  can be defined in terms of , according to Proposition 7 of [2], also inducing the Galois connection ⟨D # , ⊑⟩, where ⊑ is the lift of ⊆ to functions and Cartesian products.
Abstract Semantics.The abstract semantics of D # is defined w.r.t the one of CP # , that is used to evaluate nondataframe expressions.D # evaluates dataframe expressions to a set of labels, identifying nodes representing the dataframes that correspond to the expression.In the following, we give intuitive definitions of the semantics of expressions that involve dataframes.
Assignment.Whenever the right-hand side of an assignment x = df is a dataframe expression,  # must be updated for the corresponding variable.Specifically, as df evaluates to a set of labels {ℓ 1 , . . ., ℓ  }, and x evaluates to a set of identifiers {x 1 , . . ., x  },  # can be updated to Example.When evaluating the assignment at line 2 of Figure 1, the semantics stores the label pointing to the read node (whose creation is dictated by the abstract semantics of read). # is thus extended with the pair (df1, {ℓ 1 }), where {ℓ 1 } is the label returned by the semantics of read.
Variable evaluation.Whenever a variable is referenced throughout the program, our semantics must evaluate it to the corresponding labels if it refers to a dataframe, while the remaining variables are handled by CP # .Thus, when x resolves to a dataframe, it evaluates to  # (x).
Example.When evaluating line 6 of Figure 1, df1 and df2 must be first resolved to the dataframes they represent: df1 evaluates to {ℓ 1 } while df2 is evaluated to {ℓ 4 }, as the two variables are mapped to {ℓ 1 } and {ℓ 4 } in  # , respectively.
As a first experiment, we selected the "Coronavirus (COVID-19) Visualization & Prediction"3 dataframe, one of the most popular notebooks aggregating data from different sources on Kaggle, a public repository of Jupyter notebooks for DS.The graph produced when analyzing such code is published on a GitHub Gist4 as it is too large for this manuscript.Note that the implemented analysis supports additional pandas constructs w.r.t. the ones presented in this paper, that have been omitted as they do not contribute further to the intuition behind the domain.In the graph, these take the form of additional node kinds, whose intuitive meaning is explained in the Gist's introduction.The analysis generates the following warnings (where URL of csv files have been trimmed for compactness), correctly identifying all column names that appear in the notebook: [File:

Conclusion
This paper presents an abstract interpretation approach to analyze Python programs employed in data science and machine learning.Such programs manipulate dataframes, that is, complex in-memory tables collecting data that can be used to guide decision processes or train machine learning models.We designed an abstract domain that extracts the operations performed over dataframes, building a graph that encodes the order in which they are performed.Such a graph can be the subject of further analyses, inferring several properties such as the shape of the dataframes read by the program, or the absence of data leakages between training and testing phases of a machine learning process.As a guiding example of how to exploit our domain, we defined a simple abstract interpretation that computes, for each file read by the source program (and thus present inside the graph), the set of columns that are either accessed before being assigned, or defined through an assignment.We provided an early implementation of both domains in PyLiSA, a LiSA frontend for Python programs.
There are plenty of future directions that our work can take.As this work is still ongoing, the obvious first line of axis is to prove the soundness of the proposed semantics