Releases: huggingface/datasets
Releases · huggingface/datasets
2.18.0
Dataset features
- Make JSON builder support an array of strings by @albertvillanova in #6696
- Base parquet batch_size on parquet row group size by @lhoestq in #6701
- Faster cold start for streaming
- Change default compression argument for JsonDatasetWriter by @Rexhaif in #6659
- Automatic Conversion for uint16/uint32 to Compatible PyTorch Dtypes by @mohalisad in #6660
- fsspec: support fsspec>=2023.12.0 glob changes by @pmrowla in #6687
- Support latest fsspec up to 2024.2.0
General improvements and bug fixes
- Fix for Incorrect ex_iterable used with multi num_worker by @kq-chen in #6582
- Previously using PyTorch DDP and
num_workers
could lead to incorrect shards assignments to workers and cause errors
- Previously using PyTorch DDP and
- Fix imagefolder dataset url by @mariosasko in #6683
- Improve error message for gated datasets on load by @lewtun in #6684
- Updated Quickstart Notebook link by @Codeblockz in #6685
- Update the print message for chunked_dataset in process.mdx by @gzbfgjf2 in #6693
- Faster
xlistdir
by @mariosasko in #6698 - Update GitHub Actions to Node 20 by @albertvillanova in #6682
- Update release instructions by @albertvillanova in #6681
- Pass through information about location of cache directory. by @stridge-cruxml in #6677
- Allow SplitDict setitem to replace existing SplitInfo by @lhoestq in #6665
- Update ruff by @lhoestq in #6706
- Silence ruff deprecation messages by @mariosasko in #6707
- fix: show correct package name to install biopython by @BioGeek in #6662
- Fix data_files when passing data_dir by @lhoestq in #6705
- Release: 2.18.0 by @lhoestq in #6708
New Contributors
- @Codeblockz made their first contribution in #6685
- @gzbfgjf2 made their first contribution in #6693
- @stridge-cruxml made their first contribution in #6677
- @pmrowla made their first contribution in #6687
- @BioGeek made their first contribution in #6662
- @Rexhaif made their first contribution in #6659
- @mohalisad made their first contribution in #6660
- @kq-chen made their first contribution in #6582
Full Changelog: 2.17.1...2.18.0
2.17.1
Bug Fixes
- Revert the changes in
arrow_writer.py
from #6636 by @bryant1410 in #6664 - Remove deprecated verbose parameter from CSV builder by @albertvillanova in #6672
Full Changelog: 2.17.0...2.17.1
2.17.0
Dataset Features
- [WebDataset] Audio support and bug fixes by @lhoestq in #6573
- Add concurrent loading of shards to datasets.load_from_disk by @kkoutini in #6464
- Support data_dir parameter in push_to_hub by @albertvillanova in #6634
- Support push_to_hub without org/user to default to logged-in user by @albertvillanova in #6629
- Allow concatenation of datasets with mixed structs by @Dref360 in #6587
General improvements and bug fixes
- Fix parallel downloads for datasets without scripts by @lhoestq in #6551
- Fix imagefolder with one image by @lhoestq in #6556
- Fix tests based on datasets that used to have scripts by @lhoestq in #6574
- remove eli5 test by @lhoestq in #6583
- [IterableDataset] Fix
drop_last_batch
in map after shuffling or sharding by @lhoestq in #6575 - Support standalone yaml by @lhoestq in #6557
- Drop redundant None guard. by @xkszltl in #6596
- fix os.listdir return name is empty string by @d710055071 in #6581
- Fix CI: pyarrow 15, pandas 2.2 and sqlachemy by @lhoestq in #6617
- Dedicated RNG object for fingerprinting by @mariosasko in #6606
- Migrate from
setup.cfg
topyproject.toml
by @mariosasko in #6619 - keep more info in DatasetInfo.from_merge #6585 by @JochenSiegWork in #6586
- Read GeoParquet files using parquet reader by @weiji14 in #6508
- Use schema metadata only if it matches features by @lhoestq in #6616
- Raise error on bad split name by @lhoestq in #6626
- Disable
tqdm
bars in non-interactive environments by @mariosasko in #6627 - Add
with_rank
param toDataset.filter
by @mariosasko in #6608 - Bump max range of dill to 0.3.8 by @ringohoffman in #6630
- Fix filelock: use current umask for filelock >= 3.10 by @lhoestq in #6631
- Faster webdataset streaming by @lhoestq in #6578
- Multi gpu docs by @lhoestq in #6550
- dataset viewer requires no-script by @severo in #6633
- Make split slicing consistent with list slicing by @mariosasko in #5891
- Do not use Parquet exports if revision is passed by @albertvillanova in #6555
- Make CLI test support multi-processing by @albertvillanova in #6628
- Fix reload cache with data dir by @lhoestq in #6632
- Fix array cast/embed with null values by @mariosasko in #6283
- Faster column validation and reordering by @psmyth94 in #6636
- Better multi-gpu example by @lhoestq in #6646
- Fix missing info when loading some datasets from Parquet export by @lhoestq in #6635
- Minor multi gpu doc improvement by @lhoestq in #6649
- Document usage of hfh cli instead of git by @lhoestq in #6648
New Contributors
- @xkszltl made their first contribution in #6596
- @kkoutini made their first contribution in #6464
- @JochenSiegWork made their first contribution in #6586
- @weiji14 made their first contribution in #6508
- @ringohoffman made their first contribution in #6630
- @psmyth94 made their first contribution in #6636
Full Changelog: 2.16.1...2.17.0
2.16.1
Bug fixes
- Fix dl_manager.extract returning FileNotFoundError by @lhoestq in #6543
- Fix bug causing FileNotFoundError when passing a relative directory as
cache_dir
toload_dataset
- Fix bug causing FileNotFoundError when passing a relative directory as
- Fix custom configs from script by @lhoestq in #6544
- Fix bug when loading a dataset with a loading script using custom arguments would fail
- e.g.
load_dataset("ted_talks_iwslt", language_pair=("ja", "en"), year="2015")
Full Changelog: 2.16.0...2.16.1
2.16.0
Security features
- Add trust_remote_code argument by @lhoestq in #6429
- Some Hugging Face datasets contain custom code which must be executed to correctly load the dataset. The code can be inspected in the repository content at
https://hf.co/datasets/<repo_id>
. A warning is shown to let the user know about the custom code, and they can avoid this message in future by passing the argumenttrust_remote_code=True
. - Passing
trust_remote_code=True
will be mandatory to load these datasets from the next major release ofdatasets
. - Using the environment variable
HF_DATASETS_TRUST_REMOTE_CODE=0
you can already disable custom code by default without waiting for the next release ofdatasets
- Some Hugging Face datasets contain custom code which must be executed to correctly load the dataset. The code can be inspected in the repository content at
- Use parquet export if possible by @lhoestq in #6448
- This allows loading most old datasets based on custom code by downloading the Parquet export provided by Hugging Face
- You can see a dataset's Parquet export at
https://hf.co/datasets/<repo_id>/tree/refs%2Fconvert%2Fparquet
Features
- Webdataset dataset builder by @lhoestq in #6391
- Implement get dataset default config name by @albertvillanova in #6511
- Lazy data files resolution and offline cache reload by @lhoestq in #6493
- This speeds up the
load_dataset
step that lists the data files of big repositories (up to x100) but requireshuggingface_hub
0.20 or newer - Fix
load_dataset
that used to reload data from cache even if the dataset was updated on Hugging Face - Reload a dataset from your cache even if you don't have internet connection
- New cache directory scheme for no-script datasets:
~/.cache/huggingface/datasets/username___dataset_name/config_name/version/commit_sha
- Backward comaptibility: cached datasets from
datasets
2.15 (using the old scheme) are still reloaded from cache
- This speeds up the
General improvements and bug fixes
- Remove unused argument in
_get_data_files_patterns
by @lhoestq in #6343 - Set
usedforsecurity=False
in hashlib methods (FIPS compliance) by @Wauplin in #6414 - Use
ruff
for formatting by @mariosasko in #6434 - Create DatasetNotFoundError and DataFilesNotFoundError by @albertvillanova in #6431
- Fix multi gpu map example by @lhoestq in #6415
- Better
tqdm
wrapper by @mariosasko in #6433 - Remove
Table.__getstate__
andTable.__setstate__
by @LZHgrla in #6444 - Use
filelock
package for file locking by @mariosasko in #6445 - Fix metadata file resolution when inferred pattern is
**
by @mariosasko in #6449 - Update hub-docs reference by @mishig25 in #6453
- Refactor
dill
logic by @mariosasko in #6454 - Don't require trust_remote_code in inspect_dataset by @lhoestq in #6456
- [docs] troubleshooting guide by @MKhalusova in #6424
- Missing DatasetNotFoundError by @lhoestq in #6462
- Disable benchmarks in PRs by @lhoestq in #6463
- More robust temporary directory deletion by @mariosasko in #6426
- Fix shard retry mechanism in
push_to_hub
by @mariosasko in #6461 - Use auth to get parquet export by @lhoestq in #6468
- Remove delete doc CI by @lhoestq in #6471
- Fix CI quality by @albertvillanova in #6473
- Fix PermissionError on Windows CI by @albertvillanova in #6477
- More robust preupload retry mechanism by @mariosasko in #6479
- Add IterableDataset
__repr__
by @lhoestq in #6480 - Fix max lock length on unix by @lhoestq in #6482
- Fix ArrayXD YAML conversion by @mariosasko in #6168
- Fix docs phrasing about supported formats when sharing a dataset by @albertvillanova in #6486
- Fix deprecation warning when building conda package by @albertvillanova in #6425
- Make push_to_hub return CommitInfo by @albertvillanova in #6492
- docs: add reference Git over SSH by @severo in #6499
- Fallback on dataset script if user wants to load default config by @lhoestq in #6498
- Don't expand_info in HF glob by @lhoestq in #6469
- Fix streaming xnli by @lhoestq in #6503
- Pickle support for
torch.Generator
objects by @mariosasko in #6502 - Enable setting config as default when push_to_hub by @albertvillanova in #6500
- Better cast error when generating dataset by @lhoestq in #6509
- Replace
list_files_info
withlist_repo_tree
inpush_to_hub
by @mariosasko in #6510 - Remove deprecated HfFolder by @lhoestq in #6512
- Support huggingface-hub pre-releases by @albertvillanova in #6516
- Support push_to_hub canonical datasets by @albertvillanova in #6519
- Support commit_description parameter in push_to_hub by @albertvillanova in #6520
- fix get_metadata_patterns function args error by @d710055071 in #6518
- Fix metrics dead link by @qgallouedec in #6491
- fix tests by @lhoestq in #6523
- Cache backward compatibility with 2.15.0 by @lhoestq in #6514
- Preserve order of configs and splits when using Parquet exports by @albertvillanova in #6526
New Contributors
- @LZHgrla made their first contribution in #6444
- @d710055071 made their first contribution in #6518
Full Changelog: 2.15.0...2.16.0
2.15.0
What's Changed
- Fix typo in Audio dataset documentation by @prassanna-ravishankar in #6222
- Add push_to_hub with multiple configs docs by @lhoestq in #6226
- Remove RGB -> BGR image conversion in Object Detection tutorial by @mariosasko in #6228
- Update README.md by @NinoRisteski in #6233
- Don't skip hidden files in
dl_manager.iter_files
when they are given as input by @mariosasko in #6230 - Update README.md by @NinoRisteski in #6223
- Remove unused global variables in
audio.py
by @mariosasko in #6241 - Improve error message for missing function parameters by @suavemint in #6232
- Fix cast from fixed size list to variable size list by @mariosasko in #6243
- Update create_dataset.mdx by @EswarDivi in #6247
- [DOCS] Fix typo: Elasticsearch by @leemthompo in #6258
- Support streaming datasets with pyarrow.parquet.read_table by @albertvillanova in #6251
- Temporarily pin tensorflow < 2.14.0 by @albertvillanova in #6264
- Fix CI 404 errors by @albertvillanova in #6262
- Remove
apache_beam
import inBeamBasedBuilder._save_info
by @mariosasko in #6265 - Improve documentation of dataset.from_generator by @hartmans in #6281
- Fix parquet columns argument in streaming mode by @lhoestq in #6295
- Doc readme improvements by @mariosasko in #6298
- Unpin
tensorflow
maximum version by @mariosasko in #6301 - Unpin
jax
maximum version by @mariosasko in #6300 - Fix ArrayXD cast by @mariosasko in #6297
- Reduce the number of commits in
push_to_hub
by @mariosasko in #6269 - Fix typo in code example in docs by @bryant1410 in #6307
- Update README.md by @smty2018 in #6304
- Deterministic set hash by @lhoestq in #6318
- docs: resolving namespace conflict, refactored variable by @smty2018 in #6312
- Fix typos by @python273 in #6321
- Fix commit message formatting in multi-commit uploads by @qgallouedec in #6313
- Temporarily pin fsspec < 2023.10.0 by @albertvillanova in #6331
- Unpin fsspec by @lhoestq in #6336
- Fix use_dataset.mdx by @angel-luis in #6351
- Add
fsspec
version to thedatasets-cli env
command output by @mariosasko in #6356 - Expanduser in save_to_disk() by @Unknown3141592 in #6098
- Fix time measuring snippet in docs by @mariosasko in #6367
- Temporarily pin pyarrow < 14.0.0 by @albertvillanova in #6375
- Fix typo in
Dataset.map
docstring by @bryant1410 in #6373 - Avoid redundant warning when encoding NumPy array as
Image
by @mariosasko in #6379 - Replace deprecated license_file in setup.cfg by @albertvillanova in #6332
- Minor release step improvement by @lhoestq in #6339
- Fix dependency conflict within CI build documentation by @albertvillanova in #6411
- Remove redundant condition in builders by @albertvillanova in #6398
- Handle future deprecation argument by @winglian in #6390
- Remove token value from warnings by @mariosasko in #6418
- Rename audio_classificiation.py to audio_classification.py by @carlthome in #6416
- Add pyarrow-hotfix to release docs by @albertvillanova in #6421
- Simplify filesystem logic by @mariosasko in #6362
- Fix conda release by adding pyarrow-hotfix dependency by @albertvillanova in #6423
New Contributors
- @prassanna-ravishankar made their first contribution in #6222
- @NinoRisteski made their first contribution in #6233
- @suavemint made their first contribution in #6232
- @EswarDivi made their first contribution in #6247
- @leemthompo made their first contribution in #6258
- @hartmans made their first contribution in #6281
- @smty2018 made their first contribution in #6304
- @python273 made their first contribution in #6321
- @angel-luis made their first contribution in #6351
- @Unknown3141592 made their first contribution in #6098
- @winglian made their first contribution in #6390
- @carlthome made their first contribution in #6416
Full Changelog: 2.14.7...2.15.0
2.14.7
Bug Fixes
- Fix UnboundLocalError if preprocessing returns an empty list by @cwallenwein in #6346
- Fix python formatting for complex types in format_table by @mariosasko in #6368
- Support pyarrow 14.0.0 by @albertvillanova in #6378
- Do not try to download from HF GCS for generator by @yundai424 in #6372
- Support pyarrow 14.0.1 and fix vulnerability CVE-2023-47248 by @albertvillanova in #6404
New Contributors
- @cwallenwein made their first contribution in #6346
- @yundai424 made their first contribution in #6372
Full Changelog: 2.14.6...2.14.7
2.14.6
What's Changed
- Ignore dataset_info.json in data files resolution by @mariosasko in #6224
- Check builder cls default config name in inspect by @lhoestq in #6253
- Add support for fsspec>=2023.9.0 by @mariosasko in #6244
- Create DefunctDatasetError by @albertvillanova in #6286
- Fix get_data_patterns for directories with the word data twice by @albertvillanova in #6309
- Fix loading Hub datasets with CSV metadata file by @albertvillanova in #6316
- datasets.filesystems: fix is_remote_filesystems by @ap-- in #6334
- Pin upper version of fsspec by @albertvillanova in #6337
- Fix regex get_data_files formatting for base paths by @ZachNagengast in #6322
New Contributors
- @ap-- made their first contribution in #6334
- @ZachNagengast made their first contribution in #6322
Full Changelog: 2.14.5...2.14.6
2.14.5
Bug fixes
- Bump fsspec from 2021.11.1 to 2022.3.0 by @mariosasko in #6091
- Minor fix in
iter_files
for hidden files by @mariosasko in #6092 - Use yaml instead of get data patterns when possible by @lhoestq in #6154
- Fix Parquet loading with
columns
by @mariosasko in #6160 - Fix: Missing a MetadataConfigs init when the repo has a
datasets_info.json
but no README by @clefourrier in #6164 - PyArrow 13 CI fixes by @mariosasko in #6175
- Don't alter input in Features.from_dict by @lhoestq in #6189
- Fix multiprocessing with spawn in iterable datasets by @Hubert-Bonisseur in #6165
- Set minimal fsspec version requirement to 2023.1.0 by @mariosasko in #6192
- Temporarily pin pandas < 2.1.0 by @albertvillanova in #6200
- Preserve split order in DataFilesDict by @albertvillanova in #6198
- Add missing
revision
argument by @qgallouedec in #6191 - Temporarily pin fsspec < 2023.9.0 by @albertvillanova in #6210
- Do not filter out .zip extensions from no-script datasets by @albertvillanova in #6208
- Fix empty splitinfo json by @lhoestq in #6211
- Fix to_json ValueError and remove pandas pin by @albertvillanova in #6201
- Fix checking patterns to infer packaged builder by @polinaeterna in #6215
- Rename old push_to_hub configs to "default" in dataset_infos by @lhoestq in #6218
Other improvements
- Deprecate
Dataset.export
by @mariosasko in #6081 - Deprecate
download_custom
by @mariosasko in #6093 - Ignore CI lint rule violation in Pickler.memoize by @albertvillanova in #6138
- Remove unused allowed_extensions param by @albertvillanova in #6135
- Export to_iterable_dataset to document by @npuichigo in #6145
- [Docs] Add description of
select_columns
to guide by @unifyh in #6119 - Ignore parallel warning in map_nested by @lhoestq in #6148
- [docs] Complete
to_iterable_dataset
by @stevhliu in #6158 - Raise FileNotFoundError when passing data_files that don't exist by @lhoestq in #6155
- Fix typo in about_mapstyle_vs_iterable.mdx by @lhoestq in #6171
- Document BUILDER_CONFIG_CLASS by @lhoestq in #6166
- Fix import in
image_load
doc by @mariosasko in #6181 - Use object detection images from
huggingface/documentation-images
by @mariosasko in #6177 - Use
hf-internal-testing
repos for hosting test dataset repos by @mariosasko in #6180
New Contributors
- @npuichigo made their first contribution in #6145
- @unifyh made their first contribution in #6119
Full Changelog: 2.14.4...2.14.5
2.13.2
Bug fixes
- Do not filter out .zip extensions from no-script datasets by @albertvillanova in #6208
Full Changelog: 2.13.1...2.13.2