-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added multithreading support for additional connections (+fixes) #645
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only change that would be good is changing MsticpyInstace to DataProviderInstance.
Looks great though!!!
@d3vzer0 - I'm going to make some additions to this to add the same capability for splitting queries by time. |
Update branch with msticpy main
# Conflicts: # msticpy/data/core/data_providers.py # msticpy/data/drivers/driver_base.py
…es for drivers that support multi-threading in query_provider_connections_mixin.py Added unit tests for threading code in test_async_queries.py Added driver properties to azure_kusto_driver.py, azure_monitor_driver.py and odata.py (mdatp_driver and security_graph_driver) Fixed test in test_azure_kusto_driver.py Some doc fixes to docstring in DataProv-Kusto-New.rst, DataProv-MSSentinel-New.rst, DataProviders.rst Unrelated doc fixes in polling_detection.py, Installing.rst, SentinelIncidents.rst
Hey @d3vzer0, |
…nto pr/d3vzer0/645
…for pivot tests. Changing test_nbinit.py to avoid using config locking and just use monkeypatch.setenv
…able if nested threading is happening. Converting pd.Timestamps to datetimes to allow serialization in Azure-azure_monitor_driver (in AZmon SDK) Fixing some logger info outputs in nbinit.py - that normally have no output.
…nto pr/d3vzer0/645
…_mixin. Add more logging to data_providers.QueryProvider and azure_monitor_driver.py
Format of cluster name has changed in new KustoClient. Fixing test cases to allow for old and new format.
/azpipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
Reverted change to pop start and end parameters Fixed failing test in test_dataqueries.py::test_split_query_err
…azure_kusto_driver.py
/azpipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
Support for running a query across multiple connections (with optional async operation)
It is common for data services to be spread across multiple tenants or workloads. E.g., multiple Sentinel workspaces,
Microsoft Defender subscriptions or Splunk instances. You can use the
QueryProvider
to run a query across multiple connections and return the results in a single DataFrame.To create a multi-instance provider, create an instance of a QueryProvider for your data source and execute the
connect()
method to connect to the first instance of your data service.Then use the
add_connection()
method. This takes the same parameters as theconnect()
method (the parameters for this method vary by data provider).add_connection()
also supports analias
parameter to allow you to refer to the connection by a friendly name.When you now run a query for this provider, the query will be run on all of the connections and the results will be returned as a single dataframe.
Some of the MSTICPy drivers support asynchronous execution of queries against multiple instances, so that the time taken to run the query is much reduced compared to running the queries sequentially. Drivers that support asynchronous queries will use this automatically. The initial set of multi-threaded drivers are:
By default, the queries will use at most 4 concurrent threads. You can override this by initializing the QueryProvider with the
max_threads
parameter to set it to the number of threads you want.Multi-threaded support for split/shared queries
MSTICPy has supported splitting large queries by time-slice for a while. This is documented here Splitting a Query into time chunks. With this release, we've added asynchronous support for this (if the driver supports threaded/async operation) so that multiple chunks of the query will run in parallel.
Use the parameter
split_query_by
orsplit_by
to specify a time range (the time unit uses the same syntax as pandas time intervals - e.g. "1D", "4h", etc. - the the pandas documentation for more details on this).In this release sharding is also supported for ad hoc queries as long as you add "start" and "end" parameters to the query (this is still experimental, so let us know if you have issues with this).