chopdoc

A command-line tool for splitting documents into chunks, optimized for RAG (Retrieval-Augmented Generation) and LLM applications.

Features

Supports chunking methods: characters, words, sentences, recursive, markdown.
Configurable chunk size and overlap
Text cleaning and normalization
JSONL output format
Supported formats: txt (or any plain text)

Installation

Homebrew:

brew tap mirpo/homebrew-tools
brew install chopdoc

Using go install:

go install github.com/mirpo/chopdoc@latest

Local Build

git clone https://github.com/mirpo/chopdoc.git
cd chopdoc
make build

Usage

chopdoc -input pg_essay.txt -output chunks.jsonl -size 1000 -clean aggressive
chopdoc -input pg_essay.txt -output chunks.jsonl -size 1000 -overlap 100
chopdoc -input pg_essay.txt -output chunks.jsonl -size 1000 -overlap 100 -method char -clean aggressive
chopdoc -input pg_essay.txt -output chunks.jsonl -size 1000 -overlap 100 -method word
chopdoc -input pg_essay.txt -output chunks.jsonl -size 10   -overlap 1   -method sentence
chopdoc -input pg_essay.txt -output chunks.jsonl -size 100  -overlap 0   -method recursive
chopdoc -input pg_essay.txt -output chunks.jsonl -size 100  -overlap 0   -method recursive
chopdoc -input pg_essay.txt -output chunks.jsonl                         -method markdown -strip-headers
chopdoc -input pg_essay.txt -output chunks.jsonl                         -method markdown -headers 1-2 -add-metadata

chopdoc can be piped:

cat pg_essay.txt | chopdoc -size 1 -method sentence
cat pg_essay.txt | chopdoc -size 1 -method sentence > piped.jsonl
cat pg_essay.txt | chopdoc -size 1 -method sentence -output output_as_arg.jsonl

Options

  -add-metadata
        Include header metadata in output (default false, markdown method only)
  -clean string
        Cleaning mode: none, normal, aggressive (default "none")
  -headers string
        Header levels to use for markdown method (e.g. 1-6, 2-4) (default "1-6")
  -input string
        Input file path
  -method string
        Default chunking method: char (default "char")
  -output string
        Output file path (must end with .jsonl)
  -overlap int
        Overlap size in characters
  -size int
        Chunk size in characters (default 1000)
  -strip-headers
        Remove headers from content (default false, markdown method only)
  -version
        Get current version of sentences

Output Format

Each chunk is written as a JSON line:

{"chunk": "content here"}

Contributing

Fork the repository
Create your feature branch
Run tests: go test ./...
Submit a pull request

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github		.github
chopper		chopper
cleaner		cleaner
config		config
runner		runner
tests		tests
.gitignore		.gitignore
.golangci.yaml		.golangci.yaml
.goreleaser.yaml		.goreleaser.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
chopdoc.go		chopdoc.go
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

chopdoc

Features

Installation

Local Build

Usage

Options

Output Format

Contributing

License

About

Releases 7

Contributors 2

Languages

License

mirpo/chopdoc

Folders and files

Latest commit

History

Repository files navigation

chopdoc

Features

Installation

Local Build

Usage

Options

Output Format

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 7

Contributors 2

Languages