Added multithreading support for additional connections (+fixes) #645

d3vzer0 · 2023-03-17T11:48:37Z

Support for running a query across multiple connections (with optional async operation)

It is common for data services to be spread across multiple tenants or workloads. E.g., multiple Sentinel workspaces,
Microsoft Defender subscriptions or Splunk instances. You can use the QueryProvider to run a query across multiple connections and return the results in a single DataFrame.

To create a multi-instance provider, create an instance of a QueryProvider for your data source and execute the connect() method to connect to the first instance of your data service.

Then use the add_connection() method. This takes the same parameters as the connect() method (the parameters for this method vary by data provider).

add_connection() also supports an alias parameter to allow you to refer to the connection by a friendly name.

    qry_prov = QueryProvider("MSSentinel")
    qry_prov.connect(workspace="Workspace1")
    qry_prov.add_connection(workspace="Workspace2, alias="Workspace2")
    qry_prov.list_connections()

When you now run a query for this provider, the query will be run on all of the connections and the results will be returned as a single dataframe.

    test_query = '''
        SecurityAlert
        | take 5
        '''

    query_test = qry_prov.exec_query(query=test_query)
    query_test.head()

Some of the MSTICPy drivers support asynchronous execution of queries against multiple instances, so that the time taken to run the query is much reduced compared to running the queries sequentially. Drivers that support asynchronous queries will use this automatically. The initial set of multi-threaded drivers are:

MSSentinel_New (the new version of the MSSentinel driver)
Kusto_New (the new version of the Kusto/Azure Data Explorer driver)

By default, the queries will use at most 4 concurrent threads. You can override this by initializing the QueryProvider with the
max_threads parameter to set it to the number of threads you want.

    qry_prov = QueryProvider("MSSentinel", max_threads=10)

Multi-threaded support for split/shared queries

MSTICPy has supported splitting large queries by time-slice for a while. This is documented here Splitting a Query into time chunks. With this release, we've added asynchronous support for this (if the driver supports threaded/async operation) so that multiple chunks of the query will run in parallel.

    qry_prov.SecurityAlert.list_alerts(start=start, end=end, split_by="1d")

Use the parameter split_query_by or split_by to specify a time range (the time unit uses the same syntax as pandas time intervals - e.g. "1D", "4h", etc. - the the pandas documentation for more details on this).

In this release sharding is also supported for ad hoc queries as long as you add "start" and "end" parameters to the query (this is still experimental, so let us know if you have issues with this).

msticpy/data/core/data_providers.py

ianhelle

I think the only change that would be good is changing MsticpyInstace to DataProviderInstance.
Looks great though!!!

msticpy/data/core/data_providers.py

…ithreading

Needs some further review

ianhelle · 2023-04-21T16:49:54Z

@d3vzer0 - I'm going to make some additions to this to add the same capability for splitting queries by time.
I've made some refactoring changes in PR #656 - I want to put all of the multi-connection functionality in a separate mixin class. So I'll get back to this when this PR is merged and synced with this one.
Awesome stuff though - thanks for the work.

Update branch with msticpy main

# Conflicts: # msticpy/data/core/data_providers.py # msticpy/data/drivers/driver_base.py

…es for drivers that support multi-threading in query_provider_connections_mixin.py Added unit tests for threading code in test_async_queries.py Added driver properties to azure_kusto_driver.py, azure_monitor_driver.py and odata.py (mdatp_driver and security_graph_driver) Fixed test in test_azure_kusto_driver.py Some doc fixes to docstring in DataProv-Kusto-New.rst, DataProv-MSSentinel-New.rst, DataProviders.rst Unrelated doc fixes in polling_detection.py, Installing.rst, SentinelIncidents.rst

ianhelle · 2023-05-16T02:51:44Z

Hey @d3vzer0,
Inspired by the PR, I implemented this for both multiple connections and for split (by time interval) queries.
Since the QueryProvider class was getting a bit out of hand, I moved all of the connections/threading code into its own mixin class. I haven't full tested this yet but it seems to work for both scenarios.
I've also added this all to the docs so that it's visible to more than just you and me. :-D
Let me know what you think.

…ings changes

…nto pr/d3vzer0/645

…for pivot tests. Changing test_nbinit.py to avoid using config locking and just use monkeypatch.setenv

…able if nested threading is happening. Converting pd.Timestamps to datetimes to allow serialization in Azure-azure_monitor_driver (in AZmon SDK) Fixing some logger info outputs in nbinit.py - that normally have no output.

…nto pr/d3vzer0/645

…_mixin. Add more logging to data_providers.QueryProvider and azure_monitor_driver.py

Format of cluster name has changed in new KustoClient. Fixing test cases to allow for old and new format.

ianhelle · 2023-05-24T22:53:25Z

/azpipelines run

azure-pipelines · 2023-05-24T22:53:37Z

Azure Pipelines successfully started running 1 pipeline(s).

msticpy/data/core/query_provider_connections_mixin.py

msticpy/data/drivers/mdatp_driver.py

Reverted change to pop start and end parameters Fixed failing test in test_dataqueries.py::test_split_query_err

…azure_kusto_driver.py

ianhelle · 2023-07-05T18:44:53Z

/azpipelines run

azure-pipelines · 2023-07-05T18:45:03Z

Azure Pipelines successfully started running 1 pipeline(s).

d3vzer0 and others added 2 commits March 14, 2023 20:46

Multithreading support when using multiple connections

b6b5a12

Merge branch 'microsoft:main' into multithreading

b25d812

d3vzer0 commented Mar 17, 2023

View reviewed changes

msticpy/data/core/data_providers.py Outdated Show resolved Hide resolved

d3vzer0 commented Mar 17, 2023

View reviewed changes

msticpy/data/core/data_providers.py Outdated Show resolved Hide resolved

d3vzer0 commented Mar 17, 2023

View reviewed changes

msticpy/data/core/data_providers.py Outdated Show resolved Hide resolved

d3vzer0 changed the title ~~DRAFT: Added multithreading support for additional connections (+fixes)~~ Added multithreading support for additional connections (+fixes) Mar 20, 2023

ianhelle previously approved these changes Mar 20, 2023

View reviewed changes

msticpy/data/core/data_providers.py Outdated Show resolved Hide resolved

msticpy/data/core/data_providers.py Outdated Show resolved Hide resolved

msticpy/data/core/data_providers.py Outdated Show resolved Hide resolved

d3vzer0 and others added 4 commits March 21, 2023 13:08

Renamed additional connection column in results df

05bab7d

Merge branch 'multithreading' of github.com:d3vzer0/msticpy into mult…

051f55e

…ithreading

Fix flake warning

3406c9d

Merge branch 'main' into multithreading

41f8589

d3vzer0 and others added 4 commits May 5, 2023 13:43

Merge pull request #2 from microsoft/main

ead4cb5

Update branch with msticpy main

Merge branch 'main' into pr/d3vzer0/645

3f75c5b

# Conflicts: # msticpy/data/core/data_providers.py # msticpy/data/drivers/driver_base.py

Merge branch 'main' into multithreading

08f2236

ianhelle added 11 commits May 16, 2023 11:51

Fixing issue with unit_test_lib not properly isolating temporary sett…

d4f56b6

…ings changes

Merge branch 'multithreading' of https://github.com/d3vzer0/msticpy i…

4ce1f48

…nto pr/d3vzer0/645

Adding locking around pivot data providers loader to fix config file …

ceab46e

…for pivot tests. Changing test_nbinit.py to avoid using config locking and just use monkeypatch.setenv

Merge branch 'main' into multithreading

77aaddd

Merge branch 'main' into multithreading

b9362ae

Merge branch 'multithreading' of https://github.com/d3vzer0/msticpy i…

a1f9921

…nto pr/d3vzer0/645

Merge branch 'main' into pr/d3vzer0/645

1033330

Fxing handling of datetime/pd.Timestamp in query_provider_connections…

eb5c92a

…_mixin. Add more logging to data_providers.QueryProvider and azure_monitor_driver.py

Typo in data_providers (self.logger instead of logger)

85598d4

Typo calling logger.info in data_providers.py

2d1263f

Format of cluster name has changed in new KustoClient. Fixing test cases to allow for old and new format.

ianhelle self-assigned this May 24, 2023

ianhelle added this to the Release 2.6.0 milestone May 24, 2023

rcobb-scwx reviewed May 25, 2023

View reviewed changes

msticpy/data/core/query_provider_connections_mixin.py Outdated Show resolved Hide resolved

msticpy/data/drivers/mdatp_driver.py Outdated Show resolved Hide resolved

ianhelle added 6 commits May 25, 2023 15:08

Cleaned up and refactored code in query_provider_connections_mixin.py

99035d1

Typo in type annotation in query_provider_connections_mixin

3d6deff

Reverted change to pop start and end parameters Fixed failing test in test_dataqueries.py::test_split_query_err

Removing redundant line in mdatp_driver

19a9cc1

Merge branch 'main' into pr/d3vzer0/645

a4dc918

Merge branch 'main' into multithreading

bd57b04

Bug in commit from merge - missing self._connection_str attribute in …

45fb85b

…azure_kusto_driver.py

ianhelle approved these changes Jul 5, 2023

View reviewed changes

ianhelle merged commit 7504862 into microsoft:main Jul 5, 2023
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added multithreading support for additional connections (+fixes) #645

Added multithreading support for additional connections (+fixes) #645

d3vzer0 commented Mar 17, 2023 •

edited by ianhelle

Loading

ianhelle left a comment

ianhelle commented Apr 21, 2023

ianhelle commented May 16, 2023

ianhelle commented May 24, 2023

azure-pipelines bot commented May 24, 2023

ianhelle commented Jul 5, 2023

azure-pipelines bot commented Jul 5, 2023

Added multithreading support for additional connections (+fixes) #645

Added multithreading support for additional connections (+fixes) #645

Conversation

d3vzer0 commented Mar 17, 2023 • edited by ianhelle Loading

Support for running a query across multiple connections (with optional async operation)

Multi-threaded support for split/shared queries

ianhelle left a comment

Choose a reason for hiding this comment

ianhelle commented Apr 21, 2023

ianhelle commented May 16, 2023

ianhelle commented May 24, 2023

azure-pipelines bot commented May 24, 2023

ianhelle commented Jul 5, 2023

azure-pipelines bot commented Jul 5, 2023

d3vzer0 commented Mar 17, 2023 •

edited by ianhelle

Loading