The parallel corpus representation

The graph is stored in a data type which has the following type:

interface Graph {
  source: Token[]
  target: Token[]
  edges: Record<string, Edge>
  comment?: string
}

interface Token {
  text: string
  /** an identifier that is unique over the whole graph */
  id: string
}

interface Edge {
  /** a convenience copy of the identifier used in the edges object of the graph */
  id: string
  /** these are ids to source and target tokens */
  ids: string[]
  /** labels on this edge */
  labels: string[]
  /** is this manually or automatically aligned */
  manual: boolean
  comment?: string
}

Invariant

The graph is subject to an invariant (checked with the function check_invariant):

all defined identifiers are unique over the graph
all referenced identifiers exist
each tokens is referenced in one edge
text tokens match the regex /\s*\S+\s+/
the graph is aligned
each of the graph comment and edge comments are non-empty strings if present

Example

A graph with source and target texts each being w1 w2:

{
  "source": [{"id": "s0", "text": "w1 "}, {"id": "s1", "text": "w2 "}],
  "target": [{"id": "t0", "text": "w1 "}, {"id": "t1", "text": "w2 "}],
  "edges": {
    "e-s0-t0": {
      "id": "e-s0-t0",
      "ids": ["s0", "t0"],
      "labels": [],
      "manual": false
    },
    "e-s1-t1": {
      "id": "e-s1-t1",
      "ids": ["s1", "t1"],
      "labels": [],
      "manual": false
    }
  }
}

The source word apa automatically aligned with bepa, with the label "A":

{
  "source": [{"id": "s0", "text": "apa "}],
  "target": [{"id": "t0", "text": "bepa "}],
  "edges": {
    "e-s0-t0": {
      "id": "e-s0-t0",
      "ids": ["s0", "t0"],
      "labels": ["A"],
      "manual": false
    }
  }
}

The diff view

There is a derived form of looking at the data which is used to draw the graph in the interface. The data types look like this:

interface Dropped {
  edit: 'Dropped'
  target: Token
  id: string
  manual: boolean
}

interface Dragged {
  edit: 'Dragged'
  source: Token
  id: string
  manual: boolean
}

interface Edited {
  edit: 'Edited'
  source: Token[]
  target: Token[]
  id: string
  manual: boolean
}

The names are inspired by that Edited have a fixed position and the displaced tokens have been Dragged from somewhere in the source text and Dropped somewhere in the target text.

Additionally there is an enriched version that gives intra-token character-diffs:

type RichDiff =
  | Edited & {index: number} & {target_diffs: TokenDiff[]; source_diffs: TokenDiff[]}
  | Dragged & {index: number} & {source_diff: TokenDiff}
  | Dropped & {index: number} & {target_diff: TokenDiff}

type TokenDiff = [-1 | 0 | 1 /* deleted, unmodified, inserted */, string][]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Representation.md

Representation.md

The parallel corpus representation

Invariant

Example

The diff view

Files

Representation.md

Latest commit

History

Representation.md

File metadata and controls

The parallel corpus representation

Invariant

Example

The diff view