Codecrawl

Empower your AI apps with clean data from any repository. Featuring advanced codebase file-trees, semantic search, llms.txt and data extraction capabilities.

What is Codecrawl?

Codecrawl is an API service that takes a repository URL, crawls it, and converts it into clean markdown or structured data, generate embeddings then store them in a vector database. We currently support only public codebases accessible on different codehosts like GitHub and Gitlab and give you clean data for each.

How to use it?

We will provide an easy to use API with our hosted version. You can find the playground and documentation here. You can also self host the backend if you'd like.

Check out the following resources to get started:

API: Documentation
SDKs: Node
Want an SDK or Integration? Let us know by opening an issue.

To run locally, refer to guide here.

API Key

To use the API, you need to sign up on Codecrawl and get an API key.

Features

File Structure: Get repository file structure to feed to LLMs.
LLMs.txt: Generate a Llms.txt to feed directly to any model
Extract: Get structured data from single repo, multiple repo with the help AI.
Search: Search repository content with semantic understanding
Batch: Process multiple repositories simultaneously

Powerful Capabilities

Multiple Output Formats: Convert repository content to markdown, XML, plain text
Structured Data: Extract metadata like file stats, token counts, and repository info
Advanced Search: Find relevant files and content with semantic search
Repository Analytics: Get insights on file sizes, token counts and top files
Scalable Processing: Handle large codebases with configurable limits and batch operations
Clean Data: Remove comments, empty lines and get compressed output as needed

LLMs.txt

Generate a llms.txt file for a repository, optimized for feeding Language Model training. This endpoint initiates a job to create the llms.txt file and returns a job ID to track its progress.

curl -X POST https://api.irere.dev/v1/llmstxt \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -d '{
    "url": "https://github.com/irere123/run-lang"
  }'

Returns a job ID to check the status of the llms.txt generation.

{
  "success": true,
  "id": "123-456-789"
}

Check LLMs.txt Job

Check the status and retrieve the content of a llms.txt generation job using the job ID.

curl -X GET https://api.irere.dev/v1/llmstxt/123-456-789 \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Returns the status and data of the llms.txt generation job.

{
  "success": true,
  "status": "completed",
  "data": {
    "llmstxt": "Content of the llms.txt file..."
  }
}

Generate FileTree

Used to get the file tree of the whole repository using its URL. This returns plain tree for given repository.

curl -X POST https://api.irere.dev/v1/tree \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://github.com/irere123/run-lang", 
    }'

Check FileTree Job

Check the status and retrieve the content of a /v1/tree generation job using the job ID.

curl -X GET https://api.irere.dev/v1/tree/123-456-789 \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Returns the status and data of the /v1/tree generation job.

{
  "success": true,
  "status": "completed",
  "data": {
    "tree": "Repository filetree..."
  }
}

Contributing

We love contributions! Please read our contributing guide before submitting a pull request. If you'd like to self-host, refer to the self-hosting guide.

It is the sole responsibility of the end users to respect repositories' policies when indexing, searching and crawling with Codecrawl. Users are advised to adhere to the applicable privacy policies and Licenses prior to initiating any indexing activities. By default, Codecrawl respects the directives specified in the repository's .gitignore files when indexing. By utilizing Codecrawl, you expressly agree to comply with these conditions.

Contributors

Credits

Built with inspiration and techniques from **Firecrawl. Special thanks to their contributors for pioneering web crawling.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
.github		.github
.vscode		.vscode
apps		apps
assets		assets
packages/sdk		packages/sdk
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SELF_HOST.md		SELF_HOST.md
biome.jsonc		biome.jsonc
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Codecrawl

What is Codecrawl?

How to use it?

API Key

Features

Powerful Capabilities

LLMs.txt

Check LLMs.txt Job

Generate FileTree

Check FileTree Job

Contributing

Contributors

Credits

About

Contributors 2

Languages

License

Idee8/codecrawl

Folders and files

Latest commit

History

Repository files navigation

Codecrawl

What is Codecrawl?

How to use it?

API Key

Features

Powerful Capabilities

LLMs.txt

Check LLMs.txt Job

Generate FileTree

Check FileTree Job

Contributing

Contributors

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages