The segmented data presented here is an application of the naming zones ontology SegmOnto associated with the zone Entry, created for the catalogs. If you need a segmented data using only zones established by the SegmOnto initiative, you can use the ALTO data with the XSL Transformation Sheets available here.
The idea was to train a segmentation model which could automatically tag the differents zones and lines of an image, so that the TEI transformation would be easier. Due to this, the dataset has been prepared in eScriptorium using a system of tagging lines and regions of an image. Do to that, we rely on the work done by SegmOnto initiative, which is a group aiming to create a TEI-based ontology for HTR.
If the lines in this dataset are all defined by default, there are various types of zones represented. Here is a list of them and their explanation done by SegmOnto:
- main : the main area designed to contain text, either as a single or several columns
- title : caracterises a zone containing a title distinct from the main text
- numbering : caracterises a zone containing the page number
- running title : caracterises a zone containing a running title
- figure : caracterises a zone containing a figure
- stamp : caracterises a zone containing a stamp
Here are typical exemples of zones' tagging.
Main area is in purple, title in pink, numbering in green, running title in yellow and figure in blue.
Title are also a less recurent type of elements in the dataset. As it can be seen on the two images below, it has been decided only the biggest titles would be defined as title in the dataset. Therefore, smaller title are included in a main area.
The entry zone has been added the SegmOnto zones in the Dataset in order to have a more representative segmentation of our corpus.
From Left to Right: An entry for Exhibition Catalogs, an entry for Annuaires and an entry for Fair's Manuscripts Catalogs.
An exemple of an entry going on differents pages or columns.An entry is contained in a main zone and can contain other zones such as figure, as it is shown on the image. Purple: EntryEnd, Blue: Entry
When an entry straddles multiples columns or pages, such as the image above, the beginning of the entry is described as a basic entry, and the zone EntryEnd is used for the other part, such as the image above.