Retrieving large amounts of data from a database #24

leamingrad · 2019-05-14T08:48:54Z

Firstly, thanks for a great package, caqti has proved to be super-useful for handling database interactions in ocaml.

I'm currently trying to work out how to efficiently extract large amounts of data (multiple GB) from a postgresql database, which I then want to fold over using lwt. Unfortunately, naively using a single SELECT query and fold_s does not work, as caqti loads all of the data in memory, and causes my machine to run out of RAM).

I have had a look through the postgresql documentation, and it seems like there are two ways to do this (if you can think of others I would be happy to hear them):

Break the query into multiple smaller queries using OFFSET /LIMIT batching, potentially with an extra query at the start to count the rows that will be returned
Making use of postgresql's single row mode in order to stream through the results. This is supported by the ocaml postgres package, but isn't currently used by caqti as far as I can tell

I could potentially implement the OFFSET/LIMIT behaviour in my application, but that would mean that I could no-longer make use of the fold_s helper function, which is very useful IMO. Similarly, it probably does not make sense for caqti to implement this sort of batching, as the API does not provide enough detail about the SQL statements being executed to make sure that adding OFFSET and LIMIT parameters would not conflict.

Do you think it would be worth adding the single row mode behaviour to caqti when using the postgres driver - either by default or (more likely) as an option?

I would be happy to put in a PR to add this function.

The text was updated successfully, but these errors were encountered:

paurkedal · 2019-05-14T20:26:07Z

I think your assessment is right. I consider adding streaming retrieval for the PostgreSQL driver, but haven't looked into the details yet. On the other hand the OFFSET/LIMIT solution is best done in the application. It should be possible to compose the fold_s if you first determine the number of batches:

let rec loop batch acc =
  if batch >= batch_count then Lwt.return_ok acc else
  Db.fold_s request f (params, batch) acc >>=? loop (batch + 1)
in
loop 0

where Db is the connection module and (>>=?) is the combined bind for lwt and result monads.

leamingrad · 2019-05-16T20:45:45Z

Thanks for the prompt reply. I have had a go at implementing the batching solution in my application and it seems to work quite nicely.

I would still be happy to implement single row mode as it would useful going forwards, particularly when using the stream functions added by #22.

I haven't looked into it in detail, but my guess would be that there are three main ways this could be added:

Write a second postgresql driver that just uses single row mode for every query, and can then be selected at connection time
Provide a parameter when building a Caqti_request.t to indicate it should be executed in single row mode
Provide parameters to the fold_s, iter_s and to_stream methods to indicate that single row mode should be used

Do you have a preference for which way would be best? 1 seems like it might be overkill, but it would mean that the other drivers do not have to care about the single row mode config. I think that 2 and 3 would both work, but I think 2 might be a cleaner API.

paurkedal · 2019-05-17T17:00:21Z

A PR would be much appreciated, and discussing the details now is already useful even if I implement it myself later.

I agree that 1 is an overkill, but see my option 5 below which would provide the same functionality from the user point of view.

You option 2 offers selection per query, while 3 offers selection per query and per query parameters. Both gets a bit invasive on the API, since the parameter must be duplicated across convenience functions, but I think we can find an acceptable solution. I have a slight preference for 3 due to the finer granularity, supplementing the Caqti_connection_sig API (call, iter_s, fold, fold_s) with an optional flag, but let's discuss the details if we go with this.

I think there are two other options of selecting single-row-mode:

Always use single-row-mode when the multiplicity is `Many.
Let the driver accept a query parameter in the URI which activates single-row-mode when the multiplicity is `Many.

These are nice from an API point of view, but if the segmentation of transfers implied by single-row-mode has a significant impact on performance, having better granularity would be good. Option 5 would allow exploring performance impact, as well as allowing the sysadmin to configure the mode according to DB size vs memory size.

So, I think one way forward would be to implement 5, and supplement with 2 or 3 only when/if needed, in which case the URI parameter would act as a default.

leamingrad · 2019-05-24T07:38:49Z

5 sounds sensible to me. Just to check, you would mean that the user could pass a connection URL that looks like:

postgresql://localhost?single_row_mode=true

to enable single row mode for queries with Many multiplicity. This seems fine from a usability point of view, although it would have to be documented carefully somewhere.

Looking at the documentation here my understanding is that the parameter would be ignored by other clients, which would print a warning to stderr if the URL was accidentally passed somewhere else. I think this is acceptable behaviour.

paurkedal · 2019-05-24T19:13:52Z

Yes, this was my though precisely. I see your point about the URI being reused, that could even happen if another part of a project use the same configuration parameter to connect directly. But I think we can leave it up to the application to split the URI into two parts, e.g. like

db_uri: "postgresql://localhost/foo?connect_timeout=10"
db_caqti_options: "?single_row_mode=true"

and merge in the second query string before passing the URI to the Caqti connector.

The sqlite3 driver already has some facilities to decode query parameters like get_uri_bool which could be moved to Caqti_driver_lib.

At the moment the URI documentation is in the mli files of the drivers, though would be good to make it more visible. Candidate places to document it would be in Caqti_connect_sig or a separate .mld file.

vog · 2019-09-23T12:08:54Z

I'd like to argue in favor of point "4.":

4 Always use single-row-mode when the multiplicity is `Many.

The main concern is:

if the segmentation of transfers implied by single-row-mode has a significant impact on performance, having better granularity would be good.

Other database client libraries in other languages usually solve this by using a batch mode. That is, they do perform streaming, but fetch 100 rows at once before further processing. (or 128 rows if you prefer round numbers)

This is usually not configurable by enabling/disabling streaming completely, but instead by allowing to specify the batch size on a per-query basis.

In practice I've never found a reason to adjust those defaults. I did perform experiments and measurements, but never found a batch size that performs significantly better than the defaults. However, in theory this might be desirable if you fetch huge rows, e.g. if every single rows has already 100 MiB.

paurkedal · 2019-09-24T07:17:27Z

Do you know how batches are implemented in those client libraries? Using OFFSET and LIMIT? The single-row mode in PostgreSQL seems to be limited to sending single tuples at the time. I agree 4 is an attractive option. 5 is useful for experimenting with the performance impact, I think we want single-row by default in either case.

vog · 2019-09-25T12:36:49Z

Oh, I'm sorry, I just realized that I confused reading and writing. I was describing batch inserts, where the inserts are not executed one-by-one, but also not executed all at once, but instead in batched of around 100 rows. Sorry for the noise!

I'm curious if the streaming reading will actually have a performance impact or not. In interpreted languages, it usually does, but that doesn't tell us anything about OCaml. In the best case, it doesn't, so we can just fully switch to that one in the future.

paurkedal · 2019-09-26T06:59:40Z

@vog Did they do any benchmarking to pin down the bottleneck? I would think, at least for compiled languages, the slowdown would be due to fragmenting the data into individual packets, rather than in the application itself. If so, all solutions based on the same client library would have this issue.

paurkedal · 2019-09-26T07:24:34Z

@vog BTW, we are also implementing streaming in the other direction in #26. So far, only generic support, but the libpq API, at least, supports buffering data independent of row-boundaries.

vog · 2019-09-26T14:23:46Z

@paurkedal Sorry, I did not pin down the bottleneck, I was just tuning parameters. But when was I talked about other programming languages, I had mostly interpreted languages in mind (Python, Ruby, PHP, Perl), so those languages would probably not be good for comparison anyway.

paurkedal · 2019-10-02T19:43:10Z

What I found on the net about libpq is that the single-row mode performance [1, 2] suggests a slowdown of a factor 2, for a presumably efficient C code running against localhost. I am not sure how this compares to the following benchmarks, since we don't have the absolute row rate from the C benchmarks.

This is less relevant to this issues, but I benchmarked the stream implementation against other options:

Throughputs for "List", "Seq", "Stream_blocking", "Stream_lwt" each running for at least 30 CPU seconds:
           List: 30.13 WALL (30.12 usr +  0.01 sys = 30.12 CPU) @ 3838.26/s (n=115626)
            Seq: 31.41 WALL (31.39 usr +  0.00 sys = 31.39 CPU) @ 1878.52/s (n=58965)
Stream_blocking: 31.62 WALL (31.61 usr +  0.00 sys = 31.61 CPU) @ 1190.19/s (n=37617)
     Stream_lwt: 31.60 WALL (31.58 usr +  0.00 sys = 31.58 CPU) @ 571.30/s (n=18042)

                  Rate    Stream_lwt Stream_blocking           Seq          List
     Stream_lwt  571/s            --            -52%          -70%          -85%
Stream_blocking 1190/s          108%              --          -37%          -69%
            Seq 1879/s          229%             58%            --          -51%
           List 3838/s          572%            222%          104%            --

I committed the code; use dune exec tests/bench_stream.exe to reproduce. The benchmark merely sums up 100 000 floats from a list, seq, or stream, so the computation done per element is minimal. As can be seen, the slowdown compared to iteration over a list is a factor 6.7 of which a factor 2 is due to concurrency and the remaining 3.3 is dune to the stream. I expect the performance ratios to be smaller than this in the real case, since we are also encoding/decoding the tuples and often processing rows wider than a single float.

paurkedal · 2022-11-08T16:26:39Z

Commit 60f2e01 implements single-row mode for multi-row requests when use_single_row_mode=true is passed to the endpoint URI. The performance isn't good though:

╭──────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮
│name          │  major-allocated          │  minor-allocated          │  monotonic-clock          │
├──────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤
│  fetch-many  │          1345.6000 mjw/run│      17324808.6000 mnw/run│       68667869.0727 ns/run│
╰──────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────╯

for normal mode vs

╭──────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮
│name          │  major-allocated          │  minor-allocated          │  monotonic-clock          │
├──────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤
│  fetch-many  │          5347.0000 mjw/run│      36301415.0000 mnw/run│     6200377141.0000 ns/run│
╰──────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────╯

for single-row mode, when using a local socket connection. The test bench_fetch_many.ml.txt can be compiled with

(executable
 (name bench_fetch_many)
 (modules bench_fetch_many)
 (libraries
    bechamel bechamel-notty
    caqti caqti.blocking caqti-driver-postgresql
    notty.unix))

paurkedal · 2022-11-10T19:36:37Z

It looks like the inefficiency is due to polling, with the following incorrect replacement, the single-row mode is within a factor 2 slowdown compared to the normal mode:

diff --git a/caqti-driver-postgresql/lib/caqti_driver_postgresql.ml b/caqti-driver-postgresql/lib/caqti_driver_postgresql.ml
index a2712b6..d2ece8b 100644
--- a/caqti-driver-postgresql/lib/caqti_driver_postgresql.ml
+++ b/caqti-driver-postgresql/lib/caqti_driver_postgresql.ml
@@ -381,18 +381,8 @@ module Connect_functor (System : Caqti_platform_unix.System_sig.S) = struct
        | exception Pg.Error msg -> return (Error msg)
        | socket -> Unix.wrap_fd aux (Obj.magic socket))
 
-    let get_next_result ~uri ~query db =
-      let rec retry fd =
-        db#consume_input;
-        if db#is_busy then
-          Unix.poll ~read:true fd >>= (fun _ -> retry fd)
-        else
-          return (Ok db#get_result)
-      in
-      try Unix.wrap_fd retry (Obj.magic db#socket)
-      with Pg.Error err ->
-        let msg = extract_communication_error db err in
-        return (Error (Caqti_error.request_failed ~uri ~query msg))
+    let get_next_result ~uri:_ ~query:_ db =
+      return (Ok db#get_result)
 
     let get_one_result ~uri ~query db =
       get_next_result ~uri ~query db >>=? function

paurkedal self-assigned this May 14, 2019

leamingrad mentioned this issue May 24, 2019

Postgresql Copy Mode Support #26

Closed

paurkedal added a commit that referenced this issue Nov 8, 2022

Implement single-row-mode for postgresql (#24).

60f2e01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieving large amounts of data from a database #24

Retrieving large amounts of data from a database #24

leamingrad commented May 14, 2019

paurkedal commented May 14, 2019

leamingrad commented May 16, 2019

paurkedal commented May 17, 2019

leamingrad commented May 24, 2019

paurkedal commented May 24, 2019

vog commented Sep 23, 2019 •

edited

Loading

paurkedal commented Sep 24, 2019 •

edited

Loading

vog commented Sep 25, 2019

paurkedal commented Sep 26, 2019

paurkedal commented Sep 26, 2019

vog commented Sep 26, 2019

paurkedal commented Oct 2, 2019 •

edited

Loading

paurkedal commented Nov 8, 2022 •

edited

Loading

paurkedal commented Nov 10, 2022

Retrieving large amounts of data from a database #24

Retrieving large amounts of data from a database #24

Comments

leamingrad commented May 14, 2019

paurkedal commented May 14, 2019

leamingrad commented May 16, 2019

paurkedal commented May 17, 2019

leamingrad commented May 24, 2019

paurkedal commented May 24, 2019

vog commented Sep 23, 2019 • edited Loading

paurkedal commented Sep 24, 2019 • edited Loading

vog commented Sep 25, 2019

paurkedal commented Sep 26, 2019

paurkedal commented Sep 26, 2019

vog commented Sep 26, 2019

paurkedal commented Oct 2, 2019 • edited Loading

paurkedal commented Nov 8, 2022 • edited Loading

paurkedal commented Nov 10, 2022

vog commented Sep 23, 2019 •

edited

Loading

paurkedal commented Sep 24, 2019 •

edited

Loading

paurkedal commented Oct 2, 2019 •

edited

Loading

paurkedal commented Nov 8, 2022 •

edited

Loading