HuggingFace Text Data Analyzer v0.1.0

Initial release of a comprehensive tool for analyzing text datasets from HuggingFace's datasets library. This release provides both command-line and programmatic interfaces for performing detailed analysis of text datasets.

Installation

pip install huggingface-text-data-analyzer

Key Features

Basic Analysis

Text length statistics with field-specific analysis
Word distribution analysis and visualization
Junk text detection (HTML tags, special characters)
Batch-processed tokenizer analysis
Chat template support for conversational datasets
Configurable field analysis

Advanced Analysis

Part-of-Speech (POS) tagging
Named Entity Recognition (NER)
Language detection
Sentiment analysis

Performance Optimizations

Batch processing for tokenization
Progress tracking with rich console output
Tokenizer parallelism
Caching support for tokenized texts
Memory-efficient large dataset processing

Usage

Basic analysis:

analyze-dataset "dataset_name" --split "train" --output-dir "results"

Full analysis with all features:

analyze-dataset "dataset_name" \
    --advanced \
    --use-pos \
    --use-ner \
    --use-lang \
    --use-sentiment \
    --tokenizer "bert-base-uncased" \
    --fields instruction response

Requirements

Python 3.8+
Key dependencies: transformers, datasets, spacy, rich, torch, pandas, numpy, scikit-learn

Documentation

Full documentation and usage examples are available in the README.

Notes

First public release
Apache License 2.0
Contributions welcome

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.1.0-alpha