-
Notifications
You must be signed in to change notification settings - Fork 25
The Parameters of the Configuration file
eiglesias34 edited this page Jul 6, 2021
·
12 revisions
Section [default]:
- main_directory: The directory where the data sources and mapping are located.
Section [datasets]:
- number_of_datasets: How many datasets will be converted to a knowledge graph.
- output_folder: The location where the output file will be generated.
- all_in_one_file: When multiple data sources are converted, each dataset will have its own output file. This option allows the user to choose instead of having multiple output files to just have one output file.
- remove_duplicate: Remove duplicates from the knowledge graph.
- name: Name for the stats file, as well as, if the option all_in_one_file is "yes", then it will be the name of the output file.
- enrichment: When removing duplicates the RDFizer has two settings. When the option "yes" is chosen the RDFizer uses a hash table data structure that stores the generated triples (This option has the best possible performance). When the option "no" is chosen the RDFizer uses an array-based data structure to store the generated triples.
- dbType: Indicates what type of database will be used. This can only be "mysql" or "postgres".
- ordered: When the option is "yes", the SDM-RDFizer will organize the triples maps in such a way that the minimum amount of memory is consumed. When the option is "no", the SDM-RDFizer will not organize the triples maps. (Note: This parameter can only be used with the version of the SDM-RDFizer present in the beta branch.)
- large_file: This parameter is can only be used for CSV files. When the option is "false", the SDM-RDFizer uses the
pandas
library to remove duplicate rows from the data source. When dealing with files larger than 1.2 GB,pandas
does not work properly. For that reason, this parameter must be set to "true". This option allows the SDM-RDFizer to process the files with a different library. Unfortunately, this library does not remove the duplicate rows from the data source. (Note: This parameter can only be used with the version of the SDM-RDFizer present in the beta branch.)
Section [dataset1] or [dataset2] or ...:
- name: Name of the output file.
- mapping: Location of the mapping.
- user: User for the database. (This option is only necessary if a MySQL or Postgres database is being used)
- password: Password for the database. (This option is only necessary if a MySQL or Postgres database is being used)
- host: Host for the database. (This option is only necessary if a MySQL or Postgres database is being used)
- port: Port for the database. (This option is only necessary if a MySQL database is being used)
- db: Postgres database is being used. (This option is only necessary if a MySQL or Postgres database is being used)