Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get random access to the rows #13

Closed
severo opened this issue Aug 26, 2021 · 9 comments
Closed

Get random access to the rows #13

severo opened this issue Aug 26, 2021 · 9 comments
Labels
feature request Request for a new feature new processing step Processing steps: /splits, /first-rows, /parquet...

Comments

@severo
Copy link
Collaborator

severo commented Aug 26, 2021

Currently, only the first rows can be obtained with /rows. We want to get access to slices of the rows through pagination, eg /rows?from=40000&rows=10

@severo severo changed the title Get access to a large number of rows Get random access to the rows Sep 23, 2021
@severo
Copy link
Collaborator Author

severo commented Jan 31, 2022

Currently, some datasets are fully on the disk (those that cannot be streamed, and have a small size) and could be queried for ranges of rows, or could produce statistics.
But it's a fallback, not the normal mode, and if we implement this, we would want to have it to all the datasets if possible.

@severo
Copy link
Collaborator Author

severo commented Jun 17, 2022

Another Hub that gives random access to the data: https://app.activeloop.ai/activeloop/coco-train (right panel)

Capture d’écran 2022-06-17 à 17 42 12

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@severo severo added feature request Request for a new feature and removed enhancement labels Sep 16, 2022
@severo severo added the new processing step Processing steps: /splits, /first-rows, /parquet... label Sep 29, 2022
@severo
Copy link
Collaborator Author

severo commented Dec 6, 2022

Design thoughts (thanks to discussion with @lhoestq):

  • the endpoint will be GET /rows, with required parameters dataset, config, split and optional parameters to control the number of results, eg: from and limit? offset and length (see Get random access to the rows #13 (comment))
  • the API service will request the parquet files (if they exist), and return the formatted JSON. Beware: it might take some time, in particular if the rows are very big (every row can contain several MB of JSON) so we should not impose a timeout on the response.
  • to be discussed: what to do with the audio files and images? They are stored as a struct with fields: bytes (the raw bytes) and path (the filename). We could return a temporary URL in the /rows JSON, and then generate a temporary static file if the URL is called from the client (again: without a timeout, because generating the temporary file can take some time). Fo images, maybe depend on an external service like cloudinary.

@lhoestq
Copy link
Member

lhoestq commented Dec 7, 2022

the endpoint will be GET /rows, with required parameters dataset, config, split and optional parameters to control the number of results, eg: from and limit?

for consistency with the arrow terminology you can use offset and length

@severo severo mentioned this issue Jan 6, 2023
3 tasks
@mariosasko
Copy link
Contributor

the API service will request the parquet files (if they exist), and return the formatted JSON. Beware: it might take some time, in particular if the rows are very big (every row can contain several MB of JSON) so we should not impose a timeout on the response.

Maybe we could eventually switch to Apache Flight to make this performant.

A nice article that compares Apache Flight and the standard approach based on returning a binary format over HTTP: https://voltrondata.com/resources/data-transfer-at-the-speed-of-flight

@severo
Copy link
Collaborator Author

severo commented Jan 24, 2023

We could provide two response formats for the /rows endpoint, depending on what the client wants:

  • JSON
  • Arrow Flight

We could use the Accept header, or a query parameter, to select the format of the response.

@severo
Copy link
Collaborator Author

severo commented Jan 24, 2023

Another option would be to create an Arrow Flight server that can be accessed on a dedicated endpoint (it's RPC, not REST API); and the /rows endpoint would use that RPC endpoint to serve a JSON serialized response.

@severo
Copy link
Collaborator Author

severo commented Jun 14, 2023

Done

@severo severo closed this as completed Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature new processing step Processing steps: /splits, /first-rows, /parquet...
Projects
None yet
Development

No branches or pull requests

3 participants