Skip to content

Release v0.2.0: Multimodal Support & DJ-SORA

Compare
Choose a tag to compare
@HYLcool HYLcool released this 07 Mar 12:24
· 173 commits to main since this release
156ed20

New Features

  • ๐Ÿš€ We introduce DJ-SORA to provide open large-scale, high-quality datasets for SORA-like models. #227
  • ๐Ÿš€ We introduce hundreds of dedicated video, image, audio, text, and other multi-modal data processing operators and tools.
  • ๐Ÿ’ฅ Our paper has been accepted by SIGMOD'24 industrial track! #211
  • ๐Ÿ’ฅ "BetterMixture" โ€” Our second data-centric LLM competition has kicked off and is about to end soon. #174

New OPs

Multimodal

  • video_frames_text_similarity_filter: keeps samples whose similarities between sampled video frame images and text within a specific range. #227
  • video_tagging_from_frames_mapper: generates video tags from frames extracted from the video. #227
  • video_tagging_from_audio_mapper: generates video tags from audio streams extracted from videos. #227
  • video_captioning_from_video_mapper: generates captions from frame images extracted from video to augment datasets. #227
  • video_captioning_from_audio_mapper: captions a video according to its audio streams. #227
  • image_captioning_mapper: generates captions based on a language model and the image. This OP will increase the number of samples in the dataset. #131 #191 #227
  • image_captioning_from_gpt4v_mapper: generates captions based on GPT-4-Vision and the image. This OP will increase the number of samples in the dataset. #214 #227
  • image_diffusion_mapper: generates and augments the images based on the Stable Diffusion model and their original images and texts. This OP will increase the number of samples in the dataset. #200

Video

Filter

  • video_duration_filter: keeps samples whose videos' durations are within a specified range. #227
  • video_aspect_ratio_filter: filters samples according to the aspect ratios of videos (a fraction of width by height, r=w/h) in them. #227
  • video_resolution_filter: filters samples according to the resolution of videos in them. #227
  • video_ocr_area_ratio_filter: keeps samples whose detected text area ratios for specified frames in the video are within a specified range. #227
  • video_aesthetics_filter: filters samples according to the aesthetics score of frame images extracted from videos. #227
  • video_motion_score_filter: keeps samples with video motion scores within a specific range. #227

Mapper

  • video_split_by_scene_mapper: splits videos into scene clips. #227
  • video_split_by_duration_mapper: splits videos by specified duration interval. #227
  • video_split_by_key_frame_mapper: splits videos by their keyframes. #227
  • video_resize_aspect_ratio_mapper: resizes aspect ratios of videos (a fraction of width by height, r=w/h) to a specified range. #227
  • video_resize_resolution_mapper: maps videos to ones with a given resolution range. #227
  • video_ffmpeg_wrapped_mapper: a wrapper to apply ffmpeg to video data more conveniently. #227

Deduplicator

  • video_deduplicator: deduplicates samples at document-level using exact matching of videos between documents. #227

Audio

  • audio_duration_filter: keeps samples whose audios' durations are within a specified range. #177
  • audio_size_filter: keeps samples whose audios' sizes are within a specified range. #184
  • audio_nmf_snr_filter: keeps samples whose audios' Signal Noise Ratios (computed based on Non-Negative Matrix Factorization algorithm) are within a specified range. #189
  • audio_ffmpeg_wrapped_mapper: a wrapper to apply ffmpeg to audio data more conveniently. #227

Image

  • image_blur_mapper: adds random noises to images to blur them. #180
  • image_aesthetics_filter: filter samples according to the aesthetics scores of images. #227

Document Updates

  • "Bad" Data Exhibition EN ZH: shows how Data-Juicer finds those "bad" data and how they look like.
  • Awesome LLM Data EN: a collection of awesome LLM datasets with fine-grained tags.
  • Developer Guide enhancement EN ZH: adds guides on how to accelerate the models in your OP with GPUs and how to implement a batched OP for sample augmentation. #203 #220
  • OP Insight Visualization Demo code: adds a demo to visualize how each OP works.

Bugs Fixed

  • Fix stats computation error in the ray mode due to the inappropriate initialization method. #173
  • Fix the bug that some images will be lost when converting their paths to absolute paths. #178
  • Fix the dependency problems of OPs who depend on other OPs. #181
  • Fix the bug that the predict.py tool gets stuck on the help page. #183
  • Fix face_area_filter: constrains the detection coordinates within the image. #202
  • Fix MMC4 conversion tools: resolves the situation where multiple images match the same sentence. #195
  • Fix or update invalid links in Data-Juicer. #201 #219

Others

  • Optimize the model management module. #196 #227
  • Optimize the unit test actions. #195 #196 #216 #227
  • Optimize the multiprocessing strategy and model inference efficiency could be increased due to GPU support. #203 #217 #222 #227
  • Update the docker image with JDK. #208
  • Support more multimodal (video) dataset conversion tools: #227
    • InternVid: 234M video-caption data
    • Youku-mPLUG: 36TB video-caption data
    • Video-ChatGPT: 100k video-instruction data
  • Optimize the generated multimodal data storage. #227
  • Support running data-juicer process jobs on Aliyun PAI-DLC. #227
  • Better support for multi-machine distributed data processing in Ray mode. #227

Acknowledgment

Here we thank public contributors for their PRs to make Data-Juicer better!