Skip to content
ZhengYang edited this page Aug 25, 2012 · 88 revisions

Introduction

Document Collection Foreign-data Wrapper (DC FDW) allows users to map an entire directory of documents (e.g. Reuters Corpora RCV1) as a single foreign table in PostgreSQL database. The FDW supports building inverted index and postings file in a user specific location. And then use the index and postings file to support various types of information retrieval tasks in the context of PostgreSQL database. This project is a Google Summer of Code 2012 project. Original project proposal can be found here.

Content:

Note:
1.The code uses some PostgreSQL built-in libraries, see PostgreSQL Source Code Documentation for more details.
2.More about foreign-data wrappers, see PostgreSQL SQL/MED wiki page for more information.

Design Details

Document Collection Foreign-data Wrapper allows users to map an entire directory of text files into a foreign table.
In Information Retrieval (IR) context, a document collection is a set of document from which users are interested in
extracting information (searching). By mapping a document collection to a foreign table, a user can easily search through
the set of documents by issuing an SQL query to the PostgreSQL server.

Let's take a look at what a typical document collection looks like:

For example: The Reuters-21578 benchmark corpus, ApteMod version in NLTK

$ cd ~/nltk_data/corpora/reuters/test/
$ ls
14826 15312 15751 16179 16604 17540 18095 18810 19471 20112 20651 21146
14828 15313 15753 16180 16606 17543 18096 18811 19473 20114 20652 21148
14829 15314 15757 16185 16607 17544 18099 18816 19474 20116 20653 21149
...
$ cat 16179
LME CLARIFIES NEW ALUMINIUM CONTRACT DETAILS
The London Metal Exchange (LME) has
issued a note clarifying details on its new high grade
aluminium contract, in response to questions from members
following the announcement of the contract, due to start June
1.
All deliverable shapes of aluminium under the high grade
primary aluminium contract (minimum 99.7 pct purity) will also
be deliverable against the standard primary aluminum contract
(min 99.5 pct), the LME said.
Sows will not constitute good delivery against the standard
contract until September 1, and 99.5 pct purity sows are not
good delivery and cannot be placed on LME warrant.
The dollar quotation for the high grade contract will be in
multiples of one U.S. Dollar but carries may be made at 50
cents for even tonnages only.
Singapore, which is the first port warehouse outside Europe
to be used as an LME delivery point, will be used for high
grade metal only and the rent imposed by owners Steinweg will
be 1.05 U.S. Dlr a tonne per week, the LME said.
The LME Board, in response to representation from the
trade, agreed to annul from LME contracts the minimum weight
requirements of 450 kilos for T-bars and 250 kilos for sows,
effective for high grade on June 1 and for standard on July 24.

The text document can be in any language. In this version, we only look at English. A text document in a collection consists of 2 parts: id and content.

Therefore, in foreign table options, we need to provide the following information:

data_dir      [where the files in a document collection are located]
index_dir     [where the index files are located: postings file, dictionary file]
index_method  [either In-memory(IM) Indexing or Single-pass in-memory(SPIM) indexing]
buffer_size   [when using SPIM indexing, this is the limit of memory (in MB) available]
id_col        [the column name for mapping doc id]
text_col      [the column name for mapping doc content]

Indexing

In indexing phase, the indexer will take each file in the document collection and perform the following actions:

  1. Load file content into memory
  2. Convert the content of the file into a TSVector (essentially a table of terms)
  3. Add doc_id to the postings list of each term that is appearing in its TSVector in the dictionary (implemented as a hashtable HTAB).

There are 2 indexing methods: In-memory indexing (IM) where a single global dictionary is used and postings lists are stored in memory; Single-pass-in-memory (SPIM) where the collection will be chopped into parts, each part will have its dictionaries and postings lists in memory when it's being indexed. Upon finishing, they are serialized to disk. In the end, all the dictionaries will be loaded into memory and a global postings list for each term will be produced by joining all the intermediate posting lists of the same term.

Note: More information on Single-pass-in-memory indexing can be found here.

Searching

The searcher will load the global dictionary file into memory as a hashtable HTAB. The key of the dictionary is term. Each entries in the dictionary will store the pointer to the term's postings list in the postings file. Postings lists of a term will be loaded when it's searched.

A tree structure is constructed by extracting quals from baserestrictinfo in RelOptInfo. The tree is then evaluated by recursively performing searching term (leaf node) and boolean operations (internal nodes), i.e. intersect, union, negate on postings lists. The details for boolean operation algorithms (intersect via skip pointers) can be found here

Quals that can be pushed down:

1. id = <integer>
2. content @@ <term>
3. to_tsquery( <tsquery text> )
4. plainto_tsquery ( <free text> )

Note: If no quals can be pushed down, a sequential scan on all the documents in the collection will be conducted.