tokenizer
A grammar describes the syntax of a programming language, and might be defined in Backus-Naur form (BNF). A lexer performs lexical analysis, turning text into tokens. A parser takes tokens and builds a data structure like an abstract syntax tree (AST). The parser is concerned with context: does the sequence of tokens fit the grammar? A compiler is a combined lexer and parser, built for a specific grammar.
Here are 1,240 public repositories matching this topic...
Parser Building Toolkit for JavaScript
-
Updated
Jan 10, 2025 - TypeScript
Persian NLP Toolkit
-
Updated
Jul 16, 2024 - Python
Solves basic Russian NLP tasks, API for lower level Natasha projects
-
Updated
Oct 17, 2024 - Python
한국어 자연어처리를 위한 파이썬 라이브러리입니다. 단어 추출/ 토크나이저 / 품사판별/ 전처리의 기능을 제공합니다.
-
Updated
Apr 13, 2024 - Python
Self-contained Japanese Morphological Analyzer written in pure Go
-
Updated
Oct 24, 2024 - Go
Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
-
Updated
May 16, 2023 - JavaScript
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
-
Updated
Jan 8, 2025 - Python
🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.
-
Updated
May 14, 2018 - Swift
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
-
Updated
Feb 27, 2024 - Python
专注于可解释的NLP技术 An NLP Toolset With A Focus on Explainable Inference
-
Updated
Feb 3, 2021 - Java
Open Korean Text Processor - An Open-source Korean Text Processor
-
Updated
Mar 12, 2024 - Scala
The fast scanner generator for Java™ with full Unicode support
-
Updated
Jan 1, 2025 - Java
Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
-
Updated
Jul 2, 2024 - Go
数据标注是一款专门对文本数据进行处理和标注的工具,通过简化快捷的文本标注流程和动态的算法反馈,支持用户快速标注关键词并能通过算法持续减少人工标注的成本和时间。数据标注的过程先由人工标注构筑基础,再由自动标注反哺人工标注,最后由人工标注进行纠偏,从而大幅度提高标注的精准度和高效性。数据标注是一个完全开源的项目,无商业版,但是需要依赖开源的数字底座进行人员岗位管控。各类词库结果会定期在本平台公开。
-
Updated
Dec 13, 2024 - Java
🌿 NodeJS PHP Parser - extract AST or tokens
-
Updated
Dec 31, 2024 - JavaScript
- Followers
- 10.7k followers
- Wikipedia
- Wikipedia