Releases: SulRash/huggingface-text-data-analyzer
Releases · SulRash/huggingface-text-data-analyzer
v1.1.0
v1.0.0
Fixed a ton of bugs to make it release ready and added important features:
- Supports graph visualization of results.
- Removed dependency on fast_text, focused on using Huggingface models.
- Added more args.
- Fixed tons of bugs.
- Cleaned up files.
Also have an image for the repository now :)
v0.1.0-alpha
HuggingFace Text Data Analyzer v0.1.0
Initial release of a comprehensive tool for analyzing text datasets from HuggingFace's datasets library. This release provides both command-line and programmatic interfaces for performing detailed analysis of text datasets.
Installation
pip install huggingface-text-data-analyzer
Key Features
Basic Analysis
- Text length statistics with field-specific analysis
- Word distribution analysis and visualization
- Junk text detection (HTML tags, special characters)
- Batch-processed tokenizer analysis
- Chat template support for conversational datasets
- Configurable field analysis
Advanced Analysis
- Part-of-Speech (POS) tagging
- Named Entity Recognition (NER)
- Language detection
- Sentiment analysis
Performance Optimizations
- Batch processing for tokenization
- Progress tracking with rich console output
- Tokenizer parallelism
- Caching support for tokenized texts
- Memory-efficient large dataset processing
Usage
Basic analysis:
analyze-dataset "dataset_name" --split "train" --output-dir "results"
Full analysis with all features:
analyze-dataset "dataset_name" \
--advanced \
--use-pos \
--use-ner \
--use-lang \
--use-sentiment \
--tokenizer "bert-base-uncased" \
--fields instruction response
Requirements
- Python 3.8+
- Key dependencies: transformers, datasets, spacy, rich, torch, pandas, numpy, scikit-learn
Documentation
Full documentation and usage examples are available in the README.
Notes
- First public release
- Apache License 2.0
- Contributions welcome