Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
-
Updated
Nov 4, 2024 - Python
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Bitextor generates translation memories from multilingual websites
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Python library for handling audio datasets.
OpusFilter - Parallel corpus processing toolkit
Utilities for Processing the Switchboard Dialogue Act Corpus
An open source reimplementation of Benny Brodda's BETA in Python
An advanced, extensible web front-end for the Manatee-open corpus search engine
SpeCT - Speech Corpus Toolkit for Praat. Documentation: https://lennes.github.io/spect/
A set of workflows for corpus building through OCR, post-correction and normalisation
Multi-Language Dataset Cleaner/Creator for Mozilla's DeepSpeech Framework
A parser for annotated MuseScore 3 files.
Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
Python library for extracting quantitative, reproducible metrics of multi-level alignment between speakers in naturalistic language corpora.
Reading the data from OPIEC - an Open Information Extraction corpus
Rezonator: Dynamics of human engagement
Utilities for Processing the Meeting Recorder Dialogue Act Corpus
Praaline is an open-source system to manage, annotate, visualise and analyse spoken language corpora
Add a description, image, and links to the corpus-tools topic page so that developers can more easily learn about it.
To associate your repository with the corpus-tools topic, visit your repo's landing page and select "manage topics."