Skip to content

Real world text editing traces for benchmarking CRDT and Rope data structures

Notifications You must be signed in to change notification settings

josephg/editing-traces

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What is this?

This repository contains some editing histories from real world character-by-character editing traces. The goal of this repository is to provide some standard benchmarks that we can use to compare the performance of rope libraries and various OT / CRDT implementations.

Where is the data?

This repository stores 2 kinds of data, in 2 subdirectories:

The sequential_traces folder contains a set of simple editing traces where all the edits can be applied in sequence to produce a final text document.

Most of these data sets come from individual users typing into text documents. Each editing event (keystroke) has been recorded so they can be replayed later.

Some of these traces are generated by linearizing ("flattening") the concurrent traces (below). Regardless, the data format is the same.

These traces are super simple to replay - just apply each change, one by one, into an empty document and you'll get the expected output.

See sequential_traces/README.md for detail on the data format used and other notes.

These traces are useful for benchmarking how CRDTs behave when there is only a single user making changes to a text document. Or benchmarking rope libraries.

These data sets describe their editing positions using unicode character offsets. If you don't want to think about unicode offsets while benchmarking, use the ascii_only variants of these traces. In the ascii variants, all non-ascii inserts have been replaced with the underscore character.

The concurrent_traces folder contains editing traces where multiple users typed into a shared text document concurrently. (Concurrently means, they were typing at the same time).

These traces are much harder to replay, because each editing position listed in the file is relative to the version of the document on that user's computer when they were typing. This complexity is, unfortunately, necessary to replay a collaborative editing session between multiple users. - Which is what we need when benchmarking text based CRDTs.

See concurrent_traces/README.md for detail on the data format used and notes.