Skip to content

Module interface design

asaparov edited this page Mar 20, 2013 · 2 revisions

Module isolation

The modules responsible for parsing the unstructured product information HTML are isolated from the rest of the machine if they are running locally. Our initial approach will be to use SELinux in addition to other Linux utilities such as cpulimit to ensure that the modules cannot access the network, file systems, utilize too much memory or CPU. We will also impose a time limit and a limit on the size of the information that can be returned to the data analysis core. If this approach is problematic, we will explore other options, such as restricting the language, or virtualization.

General architecture

We separate the user search/interface functionality from our product information queries. Our database is populated using the product information workflow, in which our data analysis core will periodically query product lists from the modules. The modules can be either local or remote. To communicate with local modules, the data analysis core will make a request to the module indicating what kind of information is needed: detailed information about a specific product or a listing of all products. This is passed to the module via standard input, using either a delimiter or content-length encoding. The module is kept alive as the data analysis core executes the URL request for the unstructured data. This data is then communicated back to the module, where it parses the data and returns structured data (key-value pairs) to the data analysis core. The core then stores the information into the database (after doing additional processing, such as product name recognition). Remote modules will work similarly, except they do not have to be isolated from the machine (they are not running on our machine). Instead, we make the requests via a URL. Notice that only a single exchange is necessary with remote modules, as they can make the URL request for the unstructured product information themselves. Therefore, we expect the remote module to return the structured data immediately.

Image hosting

We handle images differently, as we cannot simply hotlink images from the websites that we parse. Instead, we select a number 'authoritative image sources' from which we can download the images and store them. We download these images at a much lower frequency relative to the frequency with which we download product listings/information, since product images themselves change with very low frequency. For our project, we will store the images locally (maybe on the H drive). However, to scale, we would need to utilize a third-party image hosting service to be cost-effective.

Clone this wiki locally