Skip to content

Latest commit

 

History

History
136 lines (93 loc) · 11.7 KB

README.md

File metadata and controls

136 lines (93 loc) · 11.7 KB

Scior-Dataset

Dataset with results of Scior tests using the Scior-Tester automation tool performed on the OntoUML/UFO Catalog.

Description

The Scior-Dataset is composed of files with results of Scior tests performed via the Scior-Tester on the OntoUML/UFO Catalog.

The FAIR Model Catalog for Ontology-Driven Conceptual Modeling Research, short-named OntoUML/UFO Catalog, is a structured and open-source catalog that contains OntoUML and UFO ontology models. The catalog was conceived to allow collaborative work and to be easily accessible to all its users. Its goal is to support empirical research in OntoUML and UFO, as well as for the general conceptual modeling area, by providing high-quality curated, structured, and machine-processable data on why, where, and how different modeling approaches are used. The catalog offers a diverse collection of conceptual models, created by modelers with varying modeling skills, for a range of domains, and for different purposes.

The tests were performed using the automation tool named Scior-Tester, which runs over Scior. Scior is the abbreviated name for Identification of Ontological Categories for OWL Ontologies, a software that aims to support the semi-automatic semantic improvement of lightweight web ontologies. We aim to reach the referred semantic improvement via the association of gUFO—a lightweight implementation of the Unified Foundational Ontology (UFO)—concepts to the OWL entities. The aim of gUFO is "to provide a lightweight implementation of the Unified Foundational Ontology (UFO) suitable for Semantic Web OWL 2 DL applications".

This document presents the structure of the files generated during the Scior-Tester execution. For a complete comprehension of the tests (regarding scope, objectives, implementation, etc.), please refer to the Scior-Tester description file.

The aim of the publication of the resulting datasets is to share with the community data that can be analyzed in different ways, even though all executed tests are totally reproducible.

Contents

Nomenclature of Files and Folders

For avoiding long names for files and directories, all content available in the datasets in this repository follows the nomenclature here presented:

  1. Numbers with up to three digits are always presented with three digits (e.g., 001). Numbers higher than three digits must be presented without additional digits
  2. All numbers must be attached directly to its corresponding item (e.g., test, execution, etc.)
  3. The following words must be changed for the corresponding simplifications:
    • test: tt
    • taxonomy: tx
    • execution: ex
    • percentage: pc
  4. The Scior parameters must be represented using the following simplifications:
    • automatic: a
    • interactive: i
    • complete: c
    • incomplete: n
  5. The automation parameter (a or i) must come first, and the completion parameter must follow it (c or n)
  6. The parameters must be displayed integrated (e.g., ac, in, etc.)
  7. Files names must be without spaces, which must be substituted by hyphens
  8. Separation between different items in the file name must be done using underlines
  9. The following item order must be used whenever possible: file name, dataset name, test name/number, test parameters, taxonomy number, execution number, percentage number

Build Generated Files

The Scior-Tester creates a directory for each one of the catalog's datasets that are tested. Each directory contains other folders with the results of the tests that were performed, but they also contain two different files generated by the Scior-Tester to be used as input for the tests. For generating these files, the Tester decomposes the original taxonomy from a dataset in its (possibly multiple) independent taxonomies (isolated group of classes related via specialization/generalization relations between each other). Both files are presented in this document, as well as a hashes register file.

Taxonomical Graph ttl File

Each XXX_txYYY.ttl file (with XXX being the dataset name and YYY ranging from 001 to the number of independent taxonomies available in the dataset's OntoUML model) contains an isolated taxonomical graph in OWL (in turtle syntax) got from the OWL taxonomy provided in the catalog's dataset to be tested. An example of a generated taxonomy file is: aguiar2018rdbs-o_tx001.ttl.

For instance, a single model that has two not connected hierarchical structures of concepts will generate two files, each one containing only the following properties: rdfs:subClassOf, owl:Class, and rdf:type.

For generating the concept's URIs, the Scior-Tester uses the following namespace for all taxonomies generated for all datasets: http://taxonomy.model/

Taxonomical Graph Information csv File

Each data_XXX_txYYY.csv file (with XXX being the dataset name and with YYY ranging from 01 to the number of independent taxonomies available in the dataset's OntoUML model) contains information about all the classes that are part of the taxonomical graph with the corresponding number (i.e., the file data_aguiar2018rdbs-o_tx001.csv refers to the taxonomy saved in the file aguiar2018rdbs-o_tx001.ttl). The difference between the results of a test and the inputted data should use this file, as it contains the source data.

The generated csv file contains the following columns:

  • class_name: name of the OntoUML class as it is in the original model (i.e., without namespace)
  • ontouml_stereotype: the class's OntoUML stereotype as was attributed by its modeler
  • gufo_classification: the class's OntoUML stereotype mapped to a gUFO endurant type (click here for more information)
  • is_root: Boolean value that shows if the class is a root node in the taxonomical graph (i.e., if it has no superclasses)
  • is_leaf: Boolean value that shows if the class is a leaf node in the taxonomical graph (i.e., if it has no subclasses)
  • is_intermediate: Boolean value that shows if the class is an intermediate node in the taxonomical graph (i.e., if it has subclasses and superclasses)
  • number_superclasses: the sum of the number of all direct and indirect superclasses that the class have
  • number_subclasses: the sum of the number of all direct and indirect subclasses that the class have

As every class must be a root, a leaf, or an intermediate node, note that this file would be inconsistent if:

  • is_root OR is_leaf OR is_intermediate != True, or if
  • is_root AND is_leaf AND is_intermediate != False

Taxonomies Resume csv File

This file, named taxonomies.csv, contains information about all taxonomies created in all datasets during the build function. The aim of this file is to display information to the user in a simple way so she/he can analyze it for creating tests or manipulating tests’ results.

The generated csv file contains the following columns:

  • taxonomy_name: a string with the name of the dataset file (e.g., abrahao2018agriculture-operations_tx001.ttl)
  • dataset_name: a string with the dataset that contains this taxonomy (e.g., abrahao2018agriculture-operations)
  • num_mapped_classes: an integer representing the number of classes that the taxonomy has that have classifications different than the string "other"
  • num_other_classes: an integer representing the number of classes that the taxonomy has that are classified with the string "other"
  • num_classes: an integer representing the number of classes that the taxonomy has

Note that the sum of num_mapped_classes and num_other_classes must equal num_classes. These fields classifications are related to the mapping process (described here).

A single taxonomies.csv file, located in the /catalog folder is created after the build function is completed.

Hashes Register CSV File

For traceability, the Scior-Tester provides a function for generating a SHA256 hash of its generated files and of the files that originated them. The whole dataset contains a single csv register file named hash_sha256_register.csv, containing four columns of data that are incremented every time the Tester creates new files. The columns are:

  • file_name: complete path of the file being hashed
  • file_hash: SHA256 hash of the file
  • source_file_name: file used as a source for the generation of the file being hashed
  • source_file_hash: SHA256 hash of the source file

We could cite as an example of use of this file the case where a user would like to know if he is using the same source data for generating his results, so he can get the SHA256 hash of the files she/he is using check if it exists in the hashes register file.

Tests – Generated Files and their Descriptions

Currently, datasets generated from the execution of two tests are available. Please use the following links for accessing the tests descriptions and results.

Related Repositories

  • Scior: software for identification of ontological categories for OWL ontologies.
  • Scior-Tester: used for automating tests on Scior.
  • Scior-Dataset: contains data resulting from the Scior-Tester.
  • OntoUML/UFO Catalog: source of models used for the performed tests.

Contributors

Acknowledgements

This work is a collaboration between the Free University of Bozen-Bolzano, the University of Twente, and Accenture Israel Cybersecurity Labs.