Dissertation Defense

Provenance in Modifiable Datasets

Jing Zhang

The provenance of derived data, which explains the derivation and retrieves or
captures the source data, is valuable information for the data consumers
possibly due to different purposes, e.g., audit requirements, error tracing,
data reproduction and etc. The provenance of a derived datum should include all
the details about how it is derived, including in particular, the source data
used in its derivation. The provenance of a derived datum can be recorded during
the original derivation process but storing it explicitly can incur very high
storage cost. Therefore, techniques have been developed to record only a small
amount of information, using which the full provenance can be retrieved later
from the source dataset. Such provenance retrieval relies on the provenance
being present in the dataset in order to be retrieved by tracing queries.
However, many datasets are subject to modifications, e.g, new experiemental data
is collected and stored.

In this thesis, we investigate the retrieval of the provenance of a
derived datum from a modifiable dataset, specifically we consdier the following
four questions: (1) can we explain what a particular derived datum depends on,
even if a value used in its derivation has since been modified; (2) can we determine
if a particular derived datum is still valid upon the source dataset modifications
without performing full view maintenance but through examining its provenance;
(3) can we retrieve part of the provenance of a given datum due to the users' request
or the facts that the rest of the provenance is missing; (4) can we retrieve the
provenance of a derived datum without predefined granularity in an unstructured
dataset. In thesis, we provide affirmative answers to the above questions in the form
of new techniques that use limited space and computational effort.

Sponsored by

H. V. Jagadish