Skip to content

v0.1.0-alpha

Pre-release
Pre-release
Compare
Choose a tag to compare
@SulRash SulRash released this 05 Dec 13:48
· 22 commits to main since this release

HuggingFace Text Data Analyzer v0.1.0

Initial release of a comprehensive tool for analyzing text datasets from HuggingFace's datasets library. This release provides both command-line and programmatic interfaces for performing detailed analysis of text datasets.

Installation

pip install huggingface-text-data-analyzer

Key Features

Basic Analysis

  • Text length statistics with field-specific analysis
  • Word distribution analysis and visualization
  • Junk text detection (HTML tags, special characters)
  • Batch-processed tokenizer analysis
  • Chat template support for conversational datasets
  • Configurable field analysis

Advanced Analysis

  • Part-of-Speech (POS) tagging
  • Named Entity Recognition (NER)
  • Language detection
  • Sentiment analysis

Performance Optimizations

  • Batch processing for tokenization
  • Progress tracking with rich console output
  • Tokenizer parallelism
  • Caching support for tokenized texts
  • Memory-efficient large dataset processing

Usage

Basic analysis:

analyze-dataset "dataset_name" --split "train" --output-dir "results"

Full analysis with all features:

analyze-dataset "dataset_name" \
    --advanced \
    --use-pos \
    --use-ner \
    --use-lang \
    --use-sentiment \
    --tokenizer "bert-base-uncased" \
    --fields instruction response

Requirements

  • Python 3.8+
  • Key dependencies: transformers, datasets, spacy, rich, torch, pandas, numpy, scikit-learn

Documentation

Full documentation and usage examples are available in the README.

Notes

  • First public release
  • Apache License 2.0
  • Contributions welcome