Dissertation Defense

Streaming Sketching: Mathematical Theory and Practical Algorithms

Dingyu WangPh.D. Candidate
WHERE:
3725 Beyster Building
SHARE:

Hybrid Event: 3725 BBB / Zoom

Abstract: Streaming sketches are data structures that can approximately answer queries of a data stream, whose size is much smaller than the stream length or the domain size. This thesis studies a wide range of topics on streaming sketching including approximate counting, cardinality estimation, tractability of queries, universality of sketches, sequential sketches, as well as sampling. The investigation leads to progress in both theoretical understanding of streaming sketches as well as the development of practical algorithms.

For the theoretical direction I bring new tools to help understanding of streaming computation with suitable simplifying assumptions.  I connect streaming sketches with the theory of Lévy processes. The Lévy-Khintchine representation theorem shades new light on the tractability of queries. It also implies the way that fresh randomness and indexed randomness should be mixed for obtaining streaming $G$-samplers with precisely correct sampling probabilities. I also construct a new universal sketch whose error is determined by the harmonic structure of a query. Furthermore, I study cardinality sketches through two information measures—Shannon’s entropy and Fisher information. A natural goodness metric of such sketches can be defined based on these two measures.

For the practical direction I strive to develop and analyze algorithms that are nearly optimal and simple to implement. I invent a class of statistics for cardinality sketches, called generalized remaining areas, based on which simple cardinality estimators can be constructed that are close to statistically efficient. A very simple $F_{1/2}$-sampler is constructed based on its connection to subordinators. I also find a simple way to patch the classic HyperLogLog sketch to estimate Fp-moments with exactly the same error distribution. Finally, I invent a simple but optimal multi-dimensional approximate counter using variable-length encoding.

Organizer

CSE Graduate Programs Office

Faculty Host

Prof. Seth Pettie