OTAB Data Format

A well-defined delimited text format that supports arbitrary data and works well with existing tools. This is an alternative to CSV or tab-delimited text files.

The problem I want to solve

CSV is a pain because the standard Unix command set can't deal with it.
Tab-delimited is a pain because it can't represent text with embedded newlines or tabs.
Both of those are a pain because many format details vary among tools.
Both of those are a pain when it's unclear what character encoding is used.

Goals of OTAB

Accurately store and transmit any 2-dimensional matrix of strings.
Work with the Unix toolset (cat, sed, awk, cut, tr, sort, join, etc) in data-safe, obvious ways.
Always use UTF-8 without BOM so there is no question of character encoding.
Have a clear specification so there is no question of what is and is not valid.
Be simple enough to implement quickly in any language.

Big Picture

A file is a set of rows; a row is a set of fields; a field is a string of bytes. Rows are terminated by either \n or \r\n. Fields in a row are separated by \t. These bytes ([\r\n\t]) never appear within a field value, because field values are escaped using a backslash-based scheme nearly identical to the one used by Go. All fields' values are escaped. There are no quoted fields.

Tools which don't need to understand field values byte-for-byte need not understand the escaping. A tool can safely manipulate data at the file, row, and field level as simple delimited text with no knowledge of the field encoding. This enables almost every existing delimited-text tool to read and write OTAB in a reasonable if not perfect way.

By convention, many tools will treat the first row in a file as a header with column names. This is an expected and encouraged convention, but not part of the specification.

OTAB 0.2 Specification

This specification was last modified July 9, 2025.

The spec is not expected to change, but I would like to let it bake a while and maybe talk to other users before declaring 1.0.

Below is a grammar for the OTAB language in Wirth Syntax Notation with some conventions taken from the Go Specification.

OTAB is non-canonicalized Unicode text encoded in UTF-8. The word "character" below refers to a single Unicode code point.

/* Characters we care about. */
LF = /* Unicode code point U+000A */ .
CR = /* Unicode code point U+000D */ .
TAB = /* Unicode code point U+0009 */ .

/* These characters are defined so we can exclude them explicitly. */
NUL = /* Unicode code point U+0000 */ .
BOM = /* Unicode code point U+FEFF */ .

/*
We consider special the LF, CR, TAB, NUL, BOM, and `\` characters.
Basic characters are all other Unicode code points.
Note that NUL and BOM never appear in valid OTAB.
*/
BASIC_CHAR = /* one basic character as described above */ .

/* A file is a sequence of zero or more lines. */
file = { line } .

/*
A line is a sequence of one or more fields.
Fields are separated by a single tab character.
A line is terminated by either LF or CR LF.
*/
line = field { TAB field } ( LF | CR LF ) .

/*
A field is an encoded sequence of values.
The encoding borrows much from Go's string literal syntax.
Unicode values represent a single unicode character
and decode to the UTF-8 byte sequence for that character.
Byte values represent a single byte.

Note that this encoding allows arbitrary data in fields.
The data need not be valid UTF-8, even though the file is.
*/
field = { unicode_value | byte_value } .

unicode_value    = BASIC_CHAR | little_u_value | big_u_value | escaped_char .
byte_value       = octal_byte_value | hex_byte_value .
octal_byte_value = `\` octal_digit octal_digit octal_digit .
hex_byte_value   = `\` "x" hex_digit hex_digit .
little_u_value   = `\` "u" hex_digit hex_digit hex_digit hex_digit .
big_u_value      = `\` "U" hex_digit hex_digit hex_digit hex_digit
                           hex_digit hex_digit hex_digit hex_digit .
escaped_char     = `\` ( "a" | "b" | "f" | "n" | "r" | "t" | "v" | `\` ) .
hex_digit        = "0" … "9" | "A" … "F" | "a" … "f" .
octal_digit      = "0" … "7"

Validation

OTAB is a regular language. Here is a regular expression that exactly matches OTAB. White space in the regexp is only for readability.

(
	(
		[^\\\t\r\n\0\x{feff}]
		|\\\\
		|\\[abfnrtv]
		|\\[0-7][0-7][0-7]|\\x[0-9A-Fa-f][0-9A-Fa-f]
		|\\u[0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f]
		|\\U[0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f]
		    [0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f]
	)*
	(\t(
		[^\\\t\r\n\0\x{feff}]
		|\\\\
		|\\[abfnrtv]
		|\\[0-7][0-7][0-7]|\\x[0-9A-Fa-f][0-9A-Fa-f]
		|\\u[0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f]
		|\\U[0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f]
		    [0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f]
	)*)*
\r?\n)*

A valid UTF-8 string matching the expression above is a valid OTAB file.

Notes

An empty file contains zero lines, not one line with an empty field.

An empty line contains one field which is the empty string.

A file with a UTF-8 BOM is invalid OTAB. Tools that read OTAB may skip a leading BOM for compatibility with broken processors, but tools claiming to write valid OTAB must not put a BOM in an OTAB file. An OTAB field can contain a BOM encoded as \uFEFF or similar.

A file containing NUL characters is invalid OTAB. Many tools cannot deal with NULs properly, so we disallow them. OTAB fields may contain NUL bytes, but they must be encoded. Programs which decode OTAB fields must handle NUL or risk corrupting valid files.

A non-empty file that ends without a newline character is invalid OTAB. Tools that read OTAB may infer a final newline for compatibility with broken processors, but tools claiming to write valid OTAB must terminate every line, including the last.

Implementations

For now, I just have a Go implementation that isn't quite feature complete, and a CLI tool based on it.

These already make my data work easier.

Legal

OTAB is placed in the public domain. See LICENSE for details.

The Go code in this repository is also in the public domain, except for escape.go which derives from Go itself and contains its own license.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
cmd/otab		cmd/otab
testdata		testdata
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
escape.go		escape.go
go.mod		go.mod
go.sum		go.sum
otab.go		otab.go
otab_test.go		otab_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OTAB Data Format

The problem I want to solve

Goals of OTAB

Big Picture

OTAB 0.2 Specification

Validation

Notes

Implementations

Legal

About

Uh oh!

Releases

Packages

Languages

License

mstetson/otab

Folders and files

Latest commit

History

Repository files navigation

OTAB Data Format

The problem I want to solve

Goals of OTAB

Big Picture

OTAB 0.2 Specification

Validation

Notes

Implementations

Legal

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages