Get random access to the rows #13

severo · 2021-08-26T08:21:34Z

Currently, only the first rows can be obtained with /rows. We want to get access to slices of the rows through pagination, eg /rows?from=40000&rows=10

severo · 2022-01-31T21:41:52Z

Currently, some datasets are fully on the disk (those that cannot be streamed, and have a small size) and could be queried for ranges of rows, or could produce statistics.
But it's a fallback, not the normal mode, and if we implement this, we would want to have it to all the datasets if possible.

severo · 2022-06-17T15:41:14Z

Another Hub that gives random access to the data: https://app.activeloop.ai/activeloop/coco-train (right panel)

github-actions · 2022-09-16T15:20:15Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

severo · 2022-12-06T10:39:33Z

Design thoughts (thanks to discussion with @lhoestq):

the endpoint will be GET /rows, with required parameters dataset, config, split and optional parameters to control the number of results, eg: ~~from and limit?~~ offset and length (see Get random access to the rows #13 (comment))
the API service will request the parquet files (if they exist), and return the formatted JSON. Beware: it might take some time, in particular if the rows are very big (every row can contain several MB of JSON) so we should not impose a timeout on the response.
to be discussed: what to do with the audio files and images? They are stored as a struct with fields: bytes (the raw bytes) and path (the filename). We could return a temporary URL in the /rows JSON, and then generate a temporary static file if the URL is called from the client (again: without a timeout, because generating the temporary file can take some time). Fo images, maybe depend on an external service like cloudinary.

lhoestq · 2022-12-07T11:43:34Z

the endpoint will be GET /rows, with required parameters dataset, config, split and optional parameters to control the number of results, eg: from and limit?

for consistency with the arrow terminology you can use offset and length

mariosasko · 2023-01-24T13:47:46Z

the API service will request the parquet files (if they exist), and return the formatted JSON. Beware: it might take some time, in particular if the rows are very big (every row can contain several MB of JSON) so we should not impose a timeout on the response.

Maybe we could eventually switch to Apache Flight to make this performant.

A nice article that compares Apache Flight and the standard approach based on returning a binary format over HTTP: https://voltrondata.com/resources/data-transfer-at-the-speed-of-flight

severo · 2023-01-24T14:31:45Z

We could provide two response formats for the /rows endpoint, depending on what the client wants:

JSON
Arrow Flight

We could use the Accept header, or a query parameter, to select the format of the response.

severo · 2023-01-24T15:06:07Z

Another option would be to create an Arrow Flight server that can be accessed on a dedicated endpoint (it's RPC, not REST API); and the /rows endpoint would use that RPC endpoint to serve a JSON serialized response.

severo · 2023-06-14T12:16:22Z

Done

severo added the enhancement label Aug 26, 2021

severo mentioned this issue Sep 23, 2021

Add a parameter to specify the number of rows #33

Closed

3 tasks

severo changed the title ~~Get access to a large number of rows~~ Get random access to the rows Sep 23, 2021

severo added the low-priority label Jan 26, 2022

severo added the move-to-datasets-server label Feb 4, 2022

severo closed this as completed Feb 4, 2022

severo reopened this May 3, 2022

severo removed the move-to-datasets-server label May 3, 2022

severo added this to the Random access to the rows milestone Jun 17, 2022

severo mentioned this issue Jun 29, 2022

Deprecate /rows, and replace /splits with the current /splits-next #427

Closed

severo added feature request Request for a new feature and removed enhancement labels Sep 16, 2022

severo added the new processing step Processing steps: /splits, /first-rows, /parquet... label Sep 29, 2022

severo mentioned this issue Jan 6, 2023

Enable the private datasets #39

Closed

3 tasks

severo mentioned this issue Jan 31, 2023

Dataset Viewer issue for jonas/undp_jobs_raw #731

Closed

severo closed this as completed Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get random access to the rows #13

Get random access to the rows #13

severo commented Aug 26, 2021 •

edited

Loading

severo commented Jan 31, 2022

severo commented Jun 17, 2022 •

edited

Loading

github-actions bot commented Sep 16, 2022

severo commented Dec 6, 2022 •

edited

Loading

lhoestq commented Dec 7, 2022

mariosasko commented Jan 24, 2023

severo commented Jan 24, 2023

severo commented Jan 24, 2023

severo commented Jun 14, 2023

Get random access to the rows #13

Get random access to the rows #13

Comments

severo commented Aug 26, 2021 • edited Loading

severo commented Jan 31, 2022

severo commented Jun 17, 2022 • edited Loading

github-actions bot commented Sep 16, 2022

severo commented Dec 6, 2022 • edited Loading

lhoestq commented Dec 7, 2022

mariosasko commented Jan 24, 2023

severo commented Jan 24, 2023

severo commented Jan 24, 2023

severo commented Jun 14, 2023

severo commented Aug 26, 2021 •

edited

Loading

severo commented Jun 17, 2022 •

edited

Loading

severo commented Dec 6, 2022 •

edited

Loading