-
Notifications
You must be signed in to change notification settings - Fork 1
Component: Bolts
This page contains the list of all bolts currently available in the project. The standard qtopology bolts can be found here.
The basic bolt is the template class from which all other bolts are extended.
It contains the get
and set
methods - used for getting and setting attributes of the message.
GET. The get
method enables retrieving attributes from the message object. It accepts two
parameters:
Parameter | Description |
---|---|
object | The JSON object from which we want to retrieve an attribute |
path | The path to the attribute of the path.to.attribute format, where the dots signify the level of the JSON object |
For example, we have an object
:
{
"title": "document name",
"language": "en",
"url": "http://www.example.com/document.pdf",
"metadata": {
"source": {
"domain": "source name"
}
}
}
To retrieve the "title"
of the object
we would invoke this.get(object, 'title')
and to
retrieve the "domain"
attribute we would use this.get(object, 'metadata.source.domain')
.
SET. The set
method enables setting attributes inside the message object. It accepts three
parameters:
Parameter | Description |
---|---|
object | The JSON object from which we want to retrieve an attribute |
path | The path to the attribute of the path.to.attribute format, where the dots signify the level of the JSON object |
value | The value we wish to set |
If we take the previous object
example, to set a new title we invoke this.set(object, 'title', 'new document name')
and to
create a new attribute inside the metadata
attribute, we can use this.set(object, 'metadata.new_attribute', 0)
.
In addition, if the object doesn't already have an existing attribute object, the set
method will create it. For example,
this.set(object, 'path.to.new.attribute', 'hey')
will change the object
to:
{
"title": "document name",
"language": "en",
"url": "http://www.example.com/document.pdf",
"metadata": {
"source": {
"domain": "source name"
}
},
"path": {
"to": {
"new": {
"attribute": "hey"
}
}
}
}
The doc type bolt is able to extract the type of the document based URL address of the document. It requires the following parameters:
Parameter | Description |
---|---|
document_location_path | The path to the document location |
document_type_path | The path to the type object containing the extension (ext ) and mimetype (mime ) of the document |
document_error_path | (optional) The path to store the error message (Default: error ) |
The schema for this bolt in the ontology is:
{
"name": "document-type-component-name",
"type": "inproc",
"working_dir": "./components/bolts",
"cmd": "doc_type_bolt.js",
"inputs": [{
"source": "source-spout-or-bolt-name",
}],
"init": {
"document_location_path": "location-path",
"document_type_path": "type-path",
"document_error_path": "error"
}
}
The ocr bolt uses tesseract.js to identify the text in the PDF spans. (NOTE: requires pdf-image conditions for converting the PDF into images, which are then processed by tesseract to extract the content). It requires the following parameters:
Parameter | Description |
---|---|
document_location_path | The path to the document location |
document_location_type | The type of the document location. Options: local - for local documents, remote - for documents provided by an URL. (default: remote ) |
document_language_path | The path to the document language |
document_ocr_path | The path where the OCR content is stored |
ocr_data_folder | (optional) The path to where the OCR datasets are cached (default: ../data/ocr-data ) |
ocr_verbose | (optional) The boolean telling if the OCR component should log out the progress (default: boolean ) |
document_error_path | (optional) The path to store the error message (default: error ) |
temporary_folder | (optional) The folder path containing the temporary files generated when extracting the content via OCR (default: ../tmp ) |
The schema for this bolt in the ontology is:
{
"name": "ocr-component-name",
"type": "inproc",
"working_dir": "./components/bolts",
"cmd": "ocr_bolt.js",
"inputs": [{
"source": "source-spout-or-bolt-name",
}],
"init": {
"document_location_path": "location-path",
"document_location_type": "remote",
"document_language_path": "language-path",
"document_ocr_path": "ocr-path",
"ocr_data_folder": "../data/ocr-data",
"ocr_verbose": false,
"document_error_path": "error",
"temporary_folder": "../tmp"
}
}
The pdf bolt extracts the requested PDF metadata and content. In addition, it is able to convert a microsoft office file into PDF before extracting the PDF metadata and content (NOTE: requires libreoffice to use the conversion feature). It requires the following parameters:
Parameter | Description |
---|---|
document_location_path | The path to the document location |
document_location_type | The type of the document location. Options: local - for local documents, remote - for documents provided by an URL. (default: remote ) |
document_pdf_path | The path where the pdf metadata and content is stored |
document_error_path | (optional) The path to store the error message (default: error ) |
pdf_extract_metadata | (optional) The PDF values to be extracted (default: ["pages", "info", "metadata", "text"] ) |
pdf_trim_text | (optional) Trims the extracted PDF text (default: false ) |
convert_to_pdf | (optional) The boolean value if the files should be converted to PDF; requires installed libreoffice (default: false ) |
The schema for this bolt in the ontology is:
{
"name": "pdf-component-name",
"type": "inproc",
"working_dir": "./components/bolts",
"cmd": "pdf_bolt.js",
"inputs": [{
"source": "source-spout-or-bolt-name",
}],
"init": {
"document_location_path": "location-path",
"document_location_type": "remote",
"document_pdf_path": "text-path",
"document_error_path": "error",
"pdf_extract_metadata": ["pages", "info", "metadata", "text"],
"pdf_trim_text": false,
"convert_to_pdf": false
}
}
The text bolt is able to extract the document content in text format. The content is extracted through the document URL address using the textract library and requires the following parameters:
Parameter | Description |
---|---|
document_location_path | The path to the document location |
document_location_type | The type of the document location. Options: local - for local documents, remote - for documents provided by an URL. (default: remote ) |
document_text_path | The path where the document content text is stored |
document_error_path | (optional) The path to store the error message (default: error ) |
textract_config | (optional) The textract configuration files (default: {} ) |
textract_config.preserve_line_breaks | (optional) Pass this in as true and textract will not strip any line breaks (default: false ) |
textract_config.preserve_only_multiple_line_breaks | (optional) Some extractors, like PDF, insert line breaks at the end of every line, even if the middle of a sentence. If this option is set to true , then any instances of a single line break are removed but multiple line breaks are preserved. Check your output with this option, though, this doesn't preserve paragraphs unless there are multiple breaks (default: false ) |
textract_config.include_alt_text | (optional) If true , when extracting HTML whether or not to include alt text with the extracted text. (default: false ) |
The schema for this bolt in the ontology is:
{
"name": "text-component-name",
"type": "inproc",
"working_dir": "./components/bolts",
"cmd": "text_bolt.js",
"inputs": [{
"source": "source-spout-or-bolt-name",
}],
"init": {
"textract_config": {
"preserve_line_breaks": false,
"preserve_only_multiple_line_breaks": false,
"include_alt_text": false
},
"document_location_path": "location-path",
"document_location_type": "remote",
"document_text_path": "text-path",
"document_error_path": "error"
}
}
The text ttp bolt sends a request to MLLP, the media transcription and translation platform, and gets the translations of the provided text. The service is payable but also provide an experimental account. To send the request, the following parameters need to be provided:
Parameter | Description |
---|---|
document_language_path | The path to the document language |
document_title_path | The path to the document title |
document_text_path | The path to the document text |
document_transcriptions_path | The path to where the transcriptions are saved |
ttp_id_path | The path to where the MLLP (TTP) id used for retrieving the metadata from MLLP is saved |
temporary_folder | The folder path containing the temporary files generated when requesting the translations |
document_error_path | (optional) The path to store the error message (default: error ) |
ttp | The TTP configuration files |
ttp.user | The TTP user name |
ttp.token | The TTP token used for sending requests |
ttp.url | (optional) The base URL for sending requests to get text translation (default: https://ttp.mllp.upv.es/api/v3/text ) |
ttp.languages | (optional) The object containing the ISO 639-1 code languages in which the text would be translated (default: { es: {}, en: {}, sl: {}, de: {}, fr: {}, it: {}, pt: {}, ca: {} } ) |
ttp.formats | (optional) The mapping of the formats in which the translations would be provided. The format options are available on the TTP documentation page (default: { 3: "plain" } ) |
ttp.timeout_millis | (optional) The duration between to requests on the translation status (default: 120000 ) |
The schema for this bolt in the ontology is:
{
"name": "text-ttp-component-name",
"type": "inproc",
"working_dir": "./components/bolts",
"cmd": "text_ttp_bolt.js",
"inputs": [{
"source": "source-spout-or-bolt-name",
}],
"init": {
"ttp": {
"user": "ttp-username",
"token": "ttp-token",
"url": "https://ttp.mllp.upv.es/api/v3/text",
"languages": {
"es": {},
"en": {},
"sl": {},
"de": {},
"fr": {},
"it": {},
"pt": {},
"ca": {}
},
"formats": {
"3": "plain"
},
"timeout_millis": 120000,
},
"document_language_path": "language-path",
"document_title_path": "title-path",
"document_text_path": "text-path",
"document_transcriptions_path": "transcriptions-path",
"document_error_path": "error",
"ttp_id_path": "ttp-id-path",
"temporary_folder": "./tmp"
}
}
The video ttp bolt sends a request to MLLP, the media transcription and translation platform, and gets the transcriptions and translations of the provided video. The service is payable but also provide an experimental account To send the request, the following parameters need to be provided:
Parameter | Description |
---|---|
document_language_path | The path to the document language |
document_location_path | The path to the document location |
document_authors_path | (optional) The path to the document authors (default: null ) |
document_title_path | The path to the document title |
document_text_path | The path to the document text |
document_transcriptions_path | The path to where the transcriptions are saved |
ttp_id_path | The path to where the MLLP (TTP) id used for retrieving the metadata from MLLP is saved |
temporary_folder | The folder path containing the temporary files generated when requesting the translations |
document_error_path | (optional) The path to store the error message (default: error ) |
ttp | The TTP configuration files |
ttp.user | The TTP user name |
ttp.token | The TTP token used for sending requests |
ttp.url | (optional) The base URL for sending requests to get video transcription and translation (default: https://ttp.mllp.upv.es/api/v3/speech ) |
ttp.languages | (optional) The object containing the ISO 639-1 code languages in which the document would be translated (default: { es: { sub: {} }, en: { sub: {} }, sl: { sub: {} }, de: { sub: {} }, fr: { sub: {} }, it: { sub: {} }, pt: { sub: {} }, ca: { sub: {} } } ) |
ttp.formats | (optional) The mapping of the formats in which the transcriptions and translations would be provided. The format options are available on the TTP documentation page (default: { 0: "dfxp", 3: "webvtt", 4: "plain" } ) |
ttp.timeout_millis | (optional) The duration between to requests on the translation status (default: 120000 ) |
The schema for this bolt in the ontology is:
{
"name": "video-ttp-component-name",
"type": "inproc",
"working_dir": "./components/bolts",
"cmd": "video_ttp_bolt.js",
"inputs": [{
"source": "source-spout-or-bolt-name",
}],
"init": {
"ttp": {
"user": "ttp-username",
"token": "ttp-token",
"url": "https://ttp.mllp.upv.es/api/v3/speech",
"languages": {
"es": { "sub": {} },
"en": { "sub": {} },
"sl": { "sub": {} },
"de": { "sub": {} },
"fr": { "sub": {} },
"it": { "sub": {} },
"pt": { "sub": {} },
"ca": { "sub": {} }
},
"formats": {
"0": "dfxp",
"3": "webvtt",
"4": "plain"
},
"timeout_millis": 120000,
},
"document_language_path": "language-path",
"document_location_path": "location-path",
"document_authors_path": null,
"document_title_path": "title-path",
"document_text_path": "text-path",
"document_transcriptions_path": "transcriptions-path",
"document_error_path": "error",
"ttp_id_path": "ttp-id-path"
}
}
The wikipedia bolt leverages the Wikifier service for annotating the document text with Wikipedia concepts. It requires the following parameters:
Parameter | Description |
---|---|
document_text_path | The path to the document content text |
wikipedia_concept_path | The path to the document Wikipedia concept |
document_error_path | (optional) The path to store the error message (default: error ) |
wikifier | The wikifier configuration object |
wikifier.user_key | The wikifier user key (can be acquired here) |
wikifier.wikifier_url | (optional) The wikifier URL endpoint (default: 'http://www.wikifier.org' ) |
wikifier.max_length | (optional) For longer text, the bolt will slice the text into chunks and aggregate the wikifier output into a single object. This parameter will setup the max_length of the text chunks. Note: it cannot be greater than 20000 , due to Wikifier restrictions (default: 10000 ) |
The schema for this bolt in the ontology is:
{
"name": "wikipedia-component-name",
"type": "inproc",
"working_dir": "./components/bolts",
"cmd": "wikipedia_bolt.js",
"inputs": [{
"source": "source-spout-or-bolt-name",
}],
"init": {
"wikifier": {
"user_key": "wikifier-user-key",
"wikifier_url": "http://wikifier.org",
"max_length": 10000
},
"document_text_path": "text-path",
"wikipedia_concept_path": "wikipedia-concept-path",
"document_error_path": "error"
}
}
The kafka bolt sends the message to the given Kafka system, to a specific topic. It requires the following parameters:
Parameter | Description |
---|---|
format_message | (optional) The function used to format the message (default: null ) |
kafka | The kafka configurations used in the component |
kafka.host | The kafka host |
kafka.topic | The kafka topic to which the message is sent |
The schema for this bolt in the ontology is:
{
"name": "kafka-component-name",
"type": "inproc",
"working_dir": "./components/bolts",
"cmd": "kafka_bolt.js",
"inputs": [{
"source": "source-spout-or-bolt-name",
}],
"init": {
"kafka": {
"host": "kafka-host",
"topic": "kafka-topic"
},
"format_message": null
}
}
The log postgresql bolt updated a table in PostgreSQL database with the given messages and literal attributes. It requires the following parameters:
Parameter | Description |
---|---|
pg | The postgresql configurations used to connect to the database |
pg.host | The postgresql host |
pg.port | The postgresql port number |
pg.user | The user connecting to the postgresql database |
pg.password | The password used to connect to the postgresql database |
pg.database | The postgresql database name |
pg.max | (optional) The maximum number of connections with postgresql |
pg.idleTimeoutMillis | (optional) The timeout milliseconds must pass before the connection becomes idle |
pg.schema | (optional) The schema name of the database tables to access the database |
pg.version | (optional) The version of the database to which we want to connect |
postgres_table | The name of the postgresql table for updating the records |
postgres_method | (optional) The method used for updating the records: update - update the record, upsert - update or insert the record (default: update ) |
postgres_primary_id | The column in the postgresql table containing the primary IDs of the records |
message_primary_id | The attribute of the message containing the primary ID |
postgres_message_attrs | The dictionary telling which postgresql table column should be updated with which message attribute. Used for updating the table with message attributes. |
postgres_literal_attrs | The dictionary telling which postgresql table column should be updated with which literal/static value. Used for updating the table with literal values. |
postgres_time_attrs | The dictionary telling which postgresql table column should be updated with the current time. |
final_bolt | (optional) Tells if the component is the final bolt or not (default: false ) |
document_error_path | (optional) The path to store the error message (default: error ) |
The schema for this bolt in the ontology is:
{
"name": "log-postgresql-component-name",
"type": "inproc",
"working_dir": "./components/bolts",
"cmd": "log_postgresql_bolt.js",
"inputs": [{
"source": "source-spout-or-bolt-name",
}],
"init": {
"pg": {
"host": "pg-host",
"port": "pg-port",
"user": "pg-user",
"password": "pg-password",
"database": "pg-database",
"max": 10,
"idleTimeoutMillis": 10000,
"schema": null,
"version": null,
},
"postgres_table": "pg-table-name",
"postgres_method": "update",
"postgres_primary_id": "pg-primary-id",
"message_primary_id": "message-primary-id",
"postgres_message_attrs": {
"column_name1": "message_attr1",
"column_name2": "message_attr2",
},
"postgres_literal_attrs": {
"column_name3": "give a specific message",
"column_name4": 10,
},
"postgres_time_attrs": {
"column_name5": true,
},
"postgres_time_attrs": {
"column_name5": true,
},
"final_bolt": false,
"document_error_path": "error"
}
}
The validate bolt validates the if the message object has the structure and values specified in the given json schema. The bolt adopts the json schema format and accepts the following parameters:
Parameter | Description |
---|---|
json_schema | The JSON schema used to validate the message structure. Must be of the json schema format |
document_error_path | (optional) The path to store the error message (default: error ) |
The schema for this bolt in the ontology is:
{
"name": "validate-component-name",
"type": "inproc",
"working_dir": "./components/bolts",
"cmd": "validate_bolt.js",
"inputs": [{
"source": "source-spout-or-bolt-name",
}],
"init": {
"json_schema": { },
"document_error_path": "error"
}
}
This project has received funding from the European Union’s Horizon new policy 2020 research and innovation programme under grant agreement No 761758 (X5GON).