Many of the techniques and algorithms that are used in machine learning and data sciences assume that the empirical distribution of the available data is an accurate approximation of the primary phenomena being investigated.
However, when dealing with complex or high dimensional distributions, even large datasets can fail to accurately represent its core.
As examples, in large genomic datasets many rare genetic variants are unobserved, and in a large natural language corpus, many reasonable sequences of five words might not be observed.
In this webinar, Stanford’s Dr.
Gregory Valiant discusses the difficulties of and solutions for making accurate inferences in this challenging regime, in which the empirical distribution of the available data is misleading.
Learn how to extract accurate information about the underlying distribution, including information about the portion that has not been observed in the given dataset.
You will learn: