-
Notifications
You must be signed in to change notification settings - Fork 536
Home
Welcome to the deepdive wiki! Also check out Deepdive webpage.
DeepDive is a new type of system that enables developers to analyze data on a deeper level than ever before. DeepDive is a trained system, which means that it uses machine learning techniques to incorporate domain-specific knowledge and user feedback to improve the quality of its analysis. DeepDive is different from traditional systems in several ways:
- DeepDive is aware that useful data is often noisy and imprecise: Names are misspelled, natural language is ambiguous, and humans make errors. To help deal with such imprecision, DeepDive produces a calibrated probabilities for every assertion it makes. For example, if DeepDive produces a fact with probability 0.9 it means the fact is 90% likely to be correct.
- DeepDive is able to use large amounts of data from a variety of sources. Applications built using DeepDive have extracted data from millions of documents, web pages, PDFs, tables, and figures.
- DeepDive allows developers to use their knowledge of a given domain to improve the quality by writing simple rules and giving feedback on whether predictions are correct or not.
- DeepDive is able to use your data to “distantly” learn. In contrast, most machine learning systems require one to tediously train each prediction. In fact, first versions of DeepDive-based systems often do not have any traditional training data at all!
- DeepDive’s secret is a scalable, high-performance inference and learning engine. For the past few years, we have been working to make the underlying algorithms run as fast as possible. The underlying techniques pioneered in this project are part of commercial and open source tools including MADlib, Impala, a product from Oracle, and low-level techniques, such as Hogwild!, have been adopted in Google Brain.
Over the last few years, we have built applications for both broad domains that read the Web, and for specific domains, like Paleobiology. In collaboration with Shanan Peters (http://paleobiodb.org/), we built a system that reads documents with higher accuracy and from larger corpora than expert human volunteers. This is exciting as it demonstrates that trained systems may have the ability to change the way science is conducted.
In research papers, we have demonstrated DeepDive on financial, oil and gas documents, and NMR data. For example, we have shown that DeepDive is able to understand tabular data by reading the text of the reports. We are using DeepDive to support our own research into how knowledge can be used to build the next generation of data processing systems.