Skip to content

Latest commit

 

History

History
54 lines (43 loc) · 3.83 KB

README.md

File metadata and controls

54 lines (43 loc) · 3.83 KB

Segmentation of the dataset

The segmented data presented here is an application of the naming zones ontology SegmOnto associated with the zone Entry, created for the catalogs. If you need a segmented data using only zones established by the SegmOnto initiative, you can use the ALTO data with the XSL Transformation Sheets available here.

Naming zones and lines with SegmOnto

The idea was to train a segmentation model which could automatically tag the differents zones and lines of an image, so that the TEI transformation would be easier. Due to this, the dataset has been prepared in eScriptorium using a system of tagging lines and regions of an image. Do to that, we rely on the work done by SegmOnto initiative, which is a group aiming to create a TEI-based ontology for HTR.

If the lines in this dataset are all defined by default, there are various types of zones represented. Here is a list of them and their explanation done by SegmOnto:

  • main : the main area designed to contain text, either as a single or several columns
  • title : caracterises a zone containing a title distinct from the main text
  • numbering : caracterises a zone containing the page number
  • running title : caracterises a zone containing a running title
  • figure : caracterises a zone containing a figure
  • stamp : caracterises a zone containing a stamp

Application and choices

Here are typical exemples of zones' tagging.
Main area is in purple, title in pink, numbering in green, running title in yellow and figure in blue.

The first image is from a exhibition catalog and present the most basic structure of the dataset, which is a page containing a numbering area at the top and a large and unique main zone. The second is an annuaire's page and looks quite the same, except there are two columns of text, hence two mains' zone. This example is the second most common type of pages of the dataset. Lastly, the third image, a page of a manuscripts' fair catalog, shows the least frequent structure of the dataset, with figures imbricated in main, running title and numbering.

Title are also a less recurent type of elements in the dataset. As it can be seen on the two images below, it has been decided only the biggest titles would be defined as title in the dataset. Therefore, smaller title are included in a main area.

The Entry Zone

The entry zone has been added the SegmOnto zones in the Dataset in order to have a more representative segmentation of our corpus.

From Left to Right: An entry for Exhibition Catalogs, an entry for Annuaires and an entry for Fair's Manuscripts Catalogs.

An exemple of an entry going on differents pages or columns.
An entry is contained in a main zone and can contain other zones such as figure, as it is shown on the image.

Purple: EntryEnd, Blue: Entry
When an entry straddles multiples columns or pages, such as the image above, the beginning of the entry is described as a basic entry, and the zone EntryEnd is used for the other part, such as the image above.