Skip to content

Releases: SulRash/huggingface-text-data-analyzer

v1.1.0

06 Dec 03:05
Compare
Choose a tag to compare

New features n stuff:

  • Better caching, caching is granular to your field and analysis type
  • Prompting to use cached data
  • More args to define control, like skipping basic analysis or defaulting to always using cached
  • Bug fixes

v1.0.0

05 Dec 17:20
Compare
Choose a tag to compare

Fixed a ton of bugs to make it release ready and added important features:

  • Supports graph visualization of results.
  • Removed dependency on fast_text, focused on using Huggingface models.
  • Added more args.
  • Fixed tons of bugs.
  • Cleaned up files.

Also have an image for the repository now :)

v0.1.0-alpha

05 Dec 13:48
Compare
Choose a tag to compare
v0.1.0-alpha Pre-release
Pre-release

HuggingFace Text Data Analyzer v0.1.0

Initial release of a comprehensive tool for analyzing text datasets from HuggingFace's datasets library. This release provides both command-line and programmatic interfaces for performing detailed analysis of text datasets.

Installation

pip install huggingface-text-data-analyzer

Key Features

Basic Analysis

  • Text length statistics with field-specific analysis
  • Word distribution analysis and visualization
  • Junk text detection (HTML tags, special characters)
  • Batch-processed tokenizer analysis
  • Chat template support for conversational datasets
  • Configurable field analysis

Advanced Analysis

  • Part-of-Speech (POS) tagging
  • Named Entity Recognition (NER)
  • Language detection
  • Sentiment analysis

Performance Optimizations

  • Batch processing for tokenization
  • Progress tracking with rich console output
  • Tokenizer parallelism
  • Caching support for tokenized texts
  • Memory-efficient large dataset processing

Usage

Basic analysis:

analyze-dataset "dataset_name" --split "train" --output-dir "results"

Full analysis with all features:

analyze-dataset "dataset_name" \
    --advanced \
    --use-pos \
    --use-ner \
    --use-lang \
    --use-sentiment \
    --tokenizer "bert-base-uncased" \
    --fields instruction response

Requirements

  • Python 3.8+
  • Key dependencies: transformers, datasets, spacy, rich, torch, pandas, numpy, scikit-learn

Documentation

Full documentation and usage examples are available in the README.

Notes

  • First public release
  • Apache License 2.0
  • Contributions welcome