Dissertation Defense

Information Extraction on Para-Relational Data

Shirley Zhe Chen

Para-relational data refers to a type of nearly-relational data that share the important
qualities of relational data but do not present themselves in a relational format.
Para-relational data often conveys highly valuable information and is widely used in many
different areas. If we are able to convert para-relational data into the relational
format, many existing tools can be leveraged for a variety of interesting applications,
such as data analysis with relational query and data integration applications.

In response, we have developed four standalone systems and each of which
addresses a specific type of para-relational data. Senbazuru is a prototype spreadsheet
database management system that is able to extract relational information from a
large number of spreadsheets; Anthias suggests an extension on the system Senbazuru
in order to convert a broader range of spreadsheets into a relational format; Lyretail
is an extraction system that aims to detect long-tail dictionary entities on webpages;
finally, DiagramFlyer is a web-based search system that obtains a large number of
diagrams automatically extracted from the web-crawled PDFs. Together, these four
systems demonstrate that converting para-relational data into the relational format
is possible today, and also suggest directions for future systems.

Sponsored by

Michael Cafarella