forked from Findwise/Hydra
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
42 lines (32 loc) · 2.31 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
Hydra Processing Framework
Getting started
1. Install MongoDB, and start it
2. Check out and build hydra-core
3. Start the built jar with
java -jar hydra-core-jar-with-dependencies.jar
or just run the Main class from your IDE
You now have a running pipeline system. Try adding some documents using the API classes or the input-stages and play around with it.
For a basic pipeline you will need the following jars:
hydra-core-jar-with-dependencies - the framework itself
basic-stages-jar-with-dependencies - a library of reusable stages
solr-out-jar-with-dependencies - a library containing the SolrOutputStage which sends documents to Solr
Configuring your pipeline
Let's say we want to change the field source in all documents that have been touched by the stage extractPDFMetadata to pdf. It should modify documents in the database intranet. Since our stage makes static changes to fields, we can use the stage SetStaticFieldStage (included in basic-stages) instead of building our own. Since all configuration is done using Json, you'll need a Json representation of your configuration for the stage. It could look like this:
{
stageClass: "com.findwise.hydra.stage.SetStaticFieldStage",
queryOptions: ["touched(extractTitles,true)"],
fieldNames: ["source"],
fieldValues: ["web"]
}
The easiest way to do this without coding is to use the CmdlineInserter-class found in the examples library:
1. Add the library with the stage you want to add, in this case the basic-stages jar:
Program arguments: -a -p pipeline -l -i myLibaryId basic-stages-jar-with-dependencies.jar
You should see the following output:
> Added stage library with id: myLibaryId
2. Now, add your stage, by referencing the id to your library of stages, again using CmdlineInserter:
Program arguments: -a -p pipeline -s -i myLibraryId -n myStage myStage.properties
In this example, myStage.properties contains the configuration of the stage.
It is also possible to combine these two operations into one:
Program arguments: -a -p pipeline -l -i myLibraryId basic-stages-jar-with-dependencies.jar -s -n myStage myStage.properties
For more information on the CmdlineInserter class, try invoking it and it should be fairly self-explanatory.
For further information, refer to the Wiki on http://github.com/Findwise/Hydra/wiki