-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of JSON loader #6867
Comments
Thanks! Feel free to ping me for examples. May not respond immediately because we're all busy but would like to help. |
Hi @natolambert, could you please give some examples of JSON files to benchmark? Please note that this JSON file (https://huggingface.co/datasets/allenai/reward-bench-results/blob/main/eval-set-scores/Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback.json) is not in "records" orient; instead it has the following structure: {
"chat_template": "tulu",
"id": [30, 34, 35,...],
"model": "Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback",
"model_type": "Seq. Classifier",
"results": [1, 1, 1, ...],
"scores_chosen": [4.421875, 1.8916015625, 3.8515625,...],
"scores_rejected": [-2.416015625, -1.47265625, -0.9912109375,...],
"subset": ["alpacaeval-easy", "alpacaeval-easy", "alpacaeval-easy",...]
"text_chosen": ["<s>[INST] How do I detail a...",...],
"text_rejected": ["<s>[INST] How do I detail a...",...]
} Note that "records" orient should be a list (not a dict) with each row as one item of the list: [
{"chat_template": "tulu", "id": 30,... },
{"chat_template": "tulu", "id": 34,... },
...
] |
We use a mix (which is a mess), here's an example with the records orient There are more in that folder, ~40mb maybe? |
@albertvillanova here's a snippet so you don't need to click
|
Thanks again for your feedback, @natolambert. However, strictly speaking, the last file is not in JSON format but in kind of JSON-Lines like format (although not properly either because there are multiple newline characters within each object). Not even pandas can read that file format. Anyway, for JSON-Lines, I would expect that A proper JSON file in records orient should be a list (a JSON array): the first character should be Anyway, I am generating a JSON file from your JSON-Lines file to test performance. |
As reported by @natolambert, loading regular JSON files with
datasets
shows poor performance.The cause is that we use the
json
Python standard library instead of other faster libraries. See my old comment: #2638 (review)I remember having a discussion about this and it was decided that it was better not to include an additional dependency on a 3rd-party library.
However:
pandas
andpandas
depends onujson
: so we have an indirect dependency onujson
ujson
as an optional extra dependency, and check at runtime if it is installed to decide which library to use, either json or ujsonThe text was updated successfully, but these errors were encountered: