Skip to content

The merge process

sholzer edited this page May 20, 2016 · 6 revisions

q:toc: toc::[] :idprefix: :idseparator: - = The merge process This section covers the merge algorithm and the way it uses the merge schemas.

Chosing the merge schema

The merge algorithm choses the merge schemas on its own, depending on the namespaces it encounters during the merge. As shown on the Home page the api expects the path to a folder containing the merge schemas. If the algorithm encounters a namespace it choses the first merge schema for this namespace from the folder.

Warning
It is possible to have multiple merge schemas for a namespace uri in the folder altough it is not encouraged. It is not specified exactly which merge schema will be chosen.

Conflict handling

The algorithm supports four conflict handling modes for the merge process.

Prefere values from Base

Prefere values from Patch

No text attachment

BASEOVERWRITE

PATCHOVERWRITE

Text attachment

BASEATTACHOROVERWRITE

PATCHATTACHOROVERWRITE

The conflict handling has an effect on how the textual content and the attributes of the elements will be merged. Elements themselfs will always be merged or attached. The BASEOVERWRITE mode will use the value present in the base document. PATCHOVERWRITE is analogous prefering the patch document. The ATTACH modes will use the attachable and separationString attributes in the <attribute> and the attachable-text attribute in the <handling> elements. If an ATTACH mode is used the patch value will be attached to the base value in all attributes or textual contents marked as attachable. The value of separationString will be used to separate the base and patch attribute values.

Merging two elements

The merge algorithm works recursively on the input documents. Therefore we will only cover the merge of two elements in it’s length here.

Preparations

Document wide preparations

After starting the merge process with a base document and a patch document we first determine the used namespaces. Since each document can define it’s own namespace prefixes an unique mapping between prefixes and namespaces is created. The prefix used in the base is prefered here and the mapping is then applied to both input documents. If the namespaces of the base and the patch document do not match the merge is considered not possible and an exception is thrown.

Element specific preparations

Before the actual merge process of two elements the list of visible <handling> elements is updated (see The Merge Schema). This includes replacing overwriten <handling> elements as well as adding the referenced <handling> elements from the namespace-ref and label-ref attributes of the <handling> element for the input elements. <handling> elements for the recursive call of the merge algorithm are chosen from this list.

Merging the attributes

The algorithm starts with merging the attributes from the element from the base document and the one from the patch document (for simplicity we’ll refere those two elements with base element and patch element). For a given attribute it’s value in the merged document (namely the merge result) depends on the used conflict handling and the attributes existence in the base and the patch element. In the following table you see the possible outcomes when merging the attributes. In the table Base and Patch exist means that in botch base and patch element the attribute is set. B refers to the value of the attribute in the base element and S is the value of the `<attribute>’s separation-string attribute.

Table 1. Merge results for attributes

ConflictHandlingType

BASEOVERWRITE

PATCHOVERWRITE

ATTACH

ATTACH

<attribute>

attachable=true

Base and Patch exist

B

B+S+P

P

B+S+P

only Base exists

B

only Patch exists

P

attachable=false

Base and Patch exist

B

P

only Base exists

B

only Patch exists

P

Merging child nodes

Preparation

The merge of the child nodes (elements and text nodes) uses the base element as a basis. Depending on the used conflict handling the base or the patch element are cleared from any text nodes (i.e If the used conflict handling is BASEOVERWRITE the patch element is cleared from it’s text nodes and if it is PATCHOVERWRITE this goes analogous for the base element. In case of an attaching conflict handling no text nodes are discarded).

Matching and Merging of Elements

Now for each child element in the patch element a match is searched in the base element. To do this the <handling> element for each child element p in the patch is retrieved. The namespace of p is relevant on how the recursive element merge is invoked but of no interest in this section. To determine a match between an base element b and p the retrieved <handling> elements <criterion> is considered. Depending in it’s xpath evaluation b and p can be declared to match and represent the same object. The merge algorithm is then invoked on b and p and it’s result replaces b in the base. p is then removed from the patch.

For this procedure text nodes are considered part of the predeccesing element. If the first child nodes of the patch element are only text nodes they’re considered part of the subsequent element. Text nodes coming from the patch are always placed behing text nodes from the base. If text nodes from the matched base element and the matched patch element contain the same text those are considered to be the same and only one of both is put into the merge result.

Handling the Remaining Patch Elements.

If the patch element still has child nodes they are now accumulated into the base element. To sustain as much ordered structure from the base and the patch elements a mapping is created between the original base and patch elements. The mapping can be understood as an alignment of the base and the patch element. Such an alignment can be seen below.

mapping
Figure 1. Example for the document alignment

As it is shown in the figure above the element <a> above <b> will also be there in the merge result. The elements between <b> and <d> will be kept between the two although the elements from the patch will be insterted after the elements from the base. This behavior is independent of the chosen ConflictHandlingType.

This approach works for simple structures as in the figure above. A more interwined structure won’t be merged without loss of some of the structure. Therefore the structure preservation can be seen as a best effort service.

Optional Validation

The main goal of this merge algorithm is to preserve as much information from the input documents into the merge result. The documents structure as defined in it’s document definition is one of those. At the current state of implementation some assumptions are made concerning the possible structure of documents. We assume that

  • an element occurs once or arbitrary often as a child element

  • only occurs in a sequence (i.e. abbbc but not ababac) or it’s position isn’t relevant to the document definition (in XSD this is declared with the <xsd:choice/> element)

With this assumptions some structures cannot be sustained and a valid merge won’t be possible with some valid documents as input. But this assumptions are met by most of the popular document definitions leading to a valid merge result. To check the conformity of the merge result with the document definition the algorithm ends with an validation.

The validation is only once per document merge performed and in the current implementation only possible for XSD defined documents. If the merge schema specifies an XSD in it’s <definition> element this XSD is used for validation. Otherwise it is tried to fetch the XSD from the schema-location attribute of the document itself. By default the validation is set to be soft. This means that the validation is performed but it’s result is only printed into the log. A failing validation will then only cause a WARN in the log messages but nothing more. If the validation set to hard a failing validation causes an exception in the merge process and no result is returned.