Concurrent queries being limited #407

gabrielmcg44 · 2024-10-07T19:40:00Z

Hey folks, not sure if this is an issue or there is a configuration I should be using.

But I was testing concurrent queries using clickhouse-connect. For it, I am running multiple queries to sleep 3 seconds concurrently.

I run 100 of these in parallel and I was expecting a time close to 3 seconds. But after running the code I noticed that the queries are started all together, but computed in blocks of 12 for some reason and it takes around 30 seconds.

I used to use another client and all of the queries seemed to run in parallel when I did the same test with it.

Here is the test I am running:

async def run_sleep_query(ch_client, sleep_time, id):
    print(f"starting query {id} at {time.time() % 1000}")

    statement = textwrap.dedent(
        f"""
        SELECT sleep({sleep_time}) as sleep
        """
    ).strip()

    result = await ch_client.query(statement)
    print(f"Query {id} finished at {time.time() % 1000}")
    return result


async def main():

    sleep_times = []
    for i in range(100):
        sleep_times.append((i, 3))

    connection = ClickHouseConnectConnection.from_environment()

    ch_client = await get_async_client(
        host=connection.host,
        user=connection.username,
        password=connection.password,
        port=connection.port_http,
        database=connection.database,
        session_id=connection.session_id,
        connect_timeout=5,
    )
    start = time.time()
    # Run queries in parallel
    results = await asyncio.gather(*[run_sleep_query(ch_client, sleep_time, id) for id, sleep_time in sleep_times])

    # Print results
    for i, sleep_result in enumerate(results):
        print(f"Query {i+1} finished with sleep result")

    end = time.time()
    print(f"Total time: {end - start:.2f}s")


if __name__ == "__main__":
    asyncio.run(main())

I also tried to workaround this by creating a class with multiple clients and randomly choosing one of them for each query, but there was no difference at all.

genzgd · 2024-10-07T21:15:11Z

Under the covers, the AsyncClient uses a ThreadPoolExecutor to actually run the async requests. That means each client query in fact runs in a separate thread. In cases where the thread is mostly waiting around for I/O (like your example where each thread will sleep for 3 seconds), this is just fine.

However, in cases where the thread is doing real CPU work, such as preparing data for insert, converting incoming data from Native format to Python objects, or application processing of the data after it is retrieved from ClickHouse, in theory having multiple threads doing heavy CPU operations could bring the whole machine/container to a crawl (this is unlikely to happen unless you are running multiple processes, since Python is essentially restricted to a single thread by the famous GIL -- global interpreter lock).

Accordingly, there are always a maximum number of threads in the default ThreadPoolExecutor. That number is set to the number of detected cores on your machine + 4. So if you have a 8 core machine, Python will only assign async work to 12 threads at a time. One of those threads must free up before Python will assign another thread to one of your queries.

In the next patch release we'll add an option to make the number of worker threads configurable, but in general it's not a great idea to increase that to a large number to prevent the clickhouse-connect application from starving the machine. However in a use case where clickhouse-connect is doing very little real work (just orchestrating long running ClickHouse queries, for example), it will be safe to increase that number once the option is available.

gabrielmcg44 · 2024-10-07T23:12:04Z

this make perfect sense. Is this threadpool shared between multiple clients? Trying to understand why creating multiple clients and randomly assigning tasks did not work

genzgd · 2024-10-07T23:18:37Z

Yes, the current code uses the shared default Executor for all AsyncClients. In the next release each client will have its own executor. That's slightly risky, as then multiple clients could end up with the same thread starvation issue, but I think it's reasonable to let the user tune each client separately.

gabrielmcg44 · 2024-10-08T00:53:20Z

cool! thanks for the clarifications!

genzgd added the question Further information is requested label Oct 7, 2024

genzgd added the enhancement New feature or request label Oct 7, 2024

genzgd mentioned this issue Oct 7, 2024

Add executor_threads parameter to async client #408

Merged

1 task

genzgd closed this as completed in #408 Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent queries being limited #407

Concurrent queries being limited #407

gabrielmcg44 commented Oct 7, 2024

genzgd commented Oct 7, 2024

gabrielmcg44 commented Oct 7, 2024

genzgd commented Oct 7, 2024

gabrielmcg44 commented Oct 8, 2024

Concurrent queries being limited #407

Concurrent queries being limited #407

Comments

gabrielmcg44 commented Oct 7, 2024

genzgd commented Oct 7, 2024

gabrielmcg44 commented Oct 7, 2024

genzgd commented Oct 7, 2024

gabrielmcg44 commented Oct 8, 2024