A well-defined delimited text format that supports arbitrary data and works well with existing tools. This is an alternative to CSV or tab-delimited text files.
- CSV is a pain because the standard Unix command set can't deal with it.
- Tab-delimited is a pain because it can't represent text with embedded newlines or tabs.
- Both of those are a pain because many format details vary among tools.
- Both of those are a pain when it's unclear what character encoding is used.
- Accurately store and transmit any 2-dimensional matrix of strings.
- Work with the Unix toolset (cat, sed, awk, cut, tr, sort, join, etc) in data-safe, obvious ways.
- Always use UTF-8 without BOM so there is no question of character encoding.
- Have a clear specification so there is no question of what is and is not valid.
- Be simple enough to implement quickly in any language.
A file is a set of rows; a row is a set of fields; a field is a string of bytes. Rows are terminated by either \n
or \r\n
. Fields in a row are separated by \t
. These bytes ([\r\n\t]
) never appear within a field value, because field values are escaped using a backslash-based scheme nearly identical to the one used by Go. All fields' values are escaped. There are no quoted fields.
Tools which don't need to understand field values byte-for-byte need not understand the escaping. A tool can safely manipulate data at the file, row, and field level as simple delimited text with no knowledge of the field encoding. This enables almost every existing delimited-text tool to read and write OTAB in a reasonable if not perfect way.
By convention, many tools will treat the first row in a file as a header with column names. This is an expected and encouraged convention, but not part of the specification.
This specification was last modified July 9, 2025.
The spec is not expected to change, but I would like to let it bake a while and maybe talk to other users before declaring 1.0.
Below is a grammar for the OTAB language in Wirth Syntax Notation with some conventions taken from the Go Specification.
OTAB is non-canonicalized Unicode text encoded in UTF-8. The word "character" below refers to a single Unicode code point.
/* Characters we care about. */
LF = /* Unicode code point U+000A */ .
CR = /* Unicode code point U+000D */ .
TAB = /* Unicode code point U+0009 */ .
/* These characters are defined so we can exclude them explicitly. */
NUL = /* Unicode code point U+0000 */ .
BOM = /* Unicode code point U+FEFF */ .
/*
We consider special the LF, CR, TAB, NUL, BOM, and `\` characters.
Basic characters are all other Unicode code points.
Note that NUL and BOM never appear in valid OTAB.
*/
BASIC_CHAR = /* one basic character as described above */ .
/* A file is a sequence of zero or more lines. */
file = { line } .
/*
A line is a sequence of one or more fields.
Fields are separated by a single tab character.
A line is terminated by either LF or CR LF.
*/
line = field { TAB field } ( LF | CR LF ) .
/*
A field is an encoded sequence of values.
The encoding borrows much from Go's string literal syntax.
Unicode values represent a single unicode character
and decode to the UTF-8 byte sequence for that character.
Byte values represent a single byte.
Note that this encoding allows arbitrary data in fields.
The data need not be valid UTF-8, even though the file is.
*/
field = { unicode_value | byte_value } .
unicode_value = BASIC_CHAR | little_u_value | big_u_value | escaped_char .
byte_value = octal_byte_value | hex_byte_value .
octal_byte_value = `\` octal_digit octal_digit octal_digit .
hex_byte_value = `\` "x" hex_digit hex_digit .
little_u_value = `\` "u" hex_digit hex_digit hex_digit hex_digit .
big_u_value = `\` "U" hex_digit hex_digit hex_digit hex_digit
hex_digit hex_digit hex_digit hex_digit .
escaped_char = `\` ( "a" | "b" | "f" | "n" | "r" | "t" | "v" | `\` ) .
hex_digit = "0" … "9" | "A" … "F" | "a" … "f" .
octal_digit = "0" … "7"
OTAB is a regular language. Here is a regular expression that exactly matches OTAB. White space in the regexp is only for readability.
(
(
[^\\\t\r\n\0\x{feff}]
|\\\\
|\\[abfnrtv]
|\\[0-7][0-7][0-7]|\\x[0-9A-Fa-f][0-9A-Fa-f]
|\\u[0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f]
|\\U[0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f]
[0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f]
)*
(\t(
[^\\\t\r\n\0\x{feff}]
|\\\\
|\\[abfnrtv]
|\\[0-7][0-7][0-7]|\\x[0-9A-Fa-f][0-9A-Fa-f]
|\\u[0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f]
|\\U[0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f]
[0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f]
)*)*
\r?\n)*
A valid UTF-8 string matching the expression above is a valid OTAB file.
An empty file contains zero lines, not one line with an empty field.
An empty line contains one field which is the empty string.
A file with a UTF-8 BOM is invalid OTAB. Tools that read OTAB may skip a leading BOM for compatibility with broken processors, but tools claiming to write valid OTAB must not put a BOM in an OTAB file. An OTAB field can contain a BOM encoded as \uFEFF or similar.
A file containing NUL characters is invalid OTAB. Many tools cannot deal with NULs properly, so we disallow them. OTAB fields may contain NUL bytes, but they must be encoded. Programs which decode OTAB fields must handle NUL or risk corrupting valid files.
A non-empty file that ends without a newline character is invalid OTAB. Tools that read OTAB may infer a final newline for compatibility with broken processors, but tools claiming to write valid OTAB must terminate every line, including the last.
For now, I just have a Go implementation that isn't quite feature complete, and a CLI tool based on it.
These already make my data work easier.
OTAB is placed in the public domain. See LICENSE for details.
The Go code in this repository is also in the public domain, except for escape.go
which derives from Go itself and contains its own license.