Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent queries being limited #407

Closed
gabrielmcg44 opened this issue Oct 7, 2024 · 4 comments · Fixed by #408
Closed

Concurrent queries being limited #407

gabrielmcg44 opened this issue Oct 7, 2024 · 4 comments · Fixed by #408
Labels
enhancement New feature or request question Further information is requested

Comments

@gabrielmcg44
Copy link

Hey folks, not sure if this is an issue or there is a configuration I should be using.

But I was testing concurrent queries using clickhouse-connect. For it, I am running multiple queries to sleep 3 seconds concurrently.

I run 100 of these in parallel and I was expecting a time close to 3 seconds. But after running the code I noticed that the queries are started all together, but computed in blocks of 12 for some reason and it takes around 30 seconds.

I used to use another client and all of the queries seemed to run in parallel when I did the same test with it.

Here is the test I am running:

async def run_sleep_query(ch_client, sleep_time, id):
    print(f"starting query {id} at {time.time() % 1000}")

    statement = textwrap.dedent(
        f"""
        SELECT sleep({sleep_time}) as sleep
        """
    ).strip()

    result = await ch_client.query(statement)
    print(f"Query {id} finished at {time.time() % 1000}")
    return result


async def main():

    sleep_times = []
    for i in range(100):
        sleep_times.append((i, 3))

    connection = ClickHouseConnectConnection.from_environment()

    ch_client = await get_async_client(
        host=connection.host,
        user=connection.username,
        password=connection.password,
        port=connection.port_http,
        database=connection.database,
        session_id=connection.session_id,
        connect_timeout=5,
    )
    start = time.time()
    # Run queries in parallel
    results = await asyncio.gather(*[run_sleep_query(ch_client, sleep_time, id) for id, sleep_time in sleep_times])

    # Print results
    for i, sleep_result in enumerate(results):
        print(f"Query {i+1} finished with sleep result")

    end = time.time()
    print(f"Total time: {end - start:.2f}s")


if __name__ == "__main__":
    asyncio.run(main())

I also tried to workaround this by creating a class with multiple clients and randomly choosing one of them for each query, but there was no difference at all.

@genzgd genzgd added the question Further information is requested label Oct 7, 2024
@genzgd
Copy link
Collaborator

genzgd commented Oct 7, 2024

Under the covers, the AsyncClient uses a ThreadPoolExecutor to actually run the async requests. That means each client query in fact runs in a separate thread. In cases where the thread is mostly waiting around for I/O (like your example where each thread will sleep for 3 seconds), this is just fine.

However, in cases where the thread is doing real CPU work, such as preparing data for insert, converting incoming data from Native format to Python objects, or application processing of the data after it is retrieved from ClickHouse, in theory having multiple threads doing heavy CPU operations could bring the whole machine/container to a crawl (this is unlikely to happen unless you are running multiple processes, since Python is essentially restricted to a single thread by the famous GIL -- global interpreter lock).

Accordingly, there are always a maximum number of threads in the default ThreadPoolExecutor. That number is set to the number of detected cores on your machine + 4. So if you have a 8 core machine, Python will only assign async work to 12 threads at a time. One of those threads must free up before Python will assign another thread to one of your queries.

In the next patch release we'll add an option to make the number of worker threads configurable, but in general it's not a great idea to increase that to a large number to prevent the clickhouse-connect application from starving the machine. However in a use case where clickhouse-connect is doing very little real work (just orchestrating long running ClickHouse queries, for example), it will be safe to increase that number once the option is available.

@genzgd genzgd added the enhancement New feature or request label Oct 7, 2024
@gabrielmcg44
Copy link
Author

this make perfect sense. Is this threadpool shared between multiple clients? Trying to understand why creating multiple clients and randomly assigning tasks did not work

@genzgd
Copy link
Collaborator

genzgd commented Oct 7, 2024

Yes, the current code uses the shared default Executor for all AsyncClients. In the next release each client will have its own executor. That's slightly risky, as then multiple clients could end up with the same thread starvation issue, but I think it's reasonable to let the user tune each client separately.

@gabrielmcg44
Copy link
Author

cool! thanks for the clarifications!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants