Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added multithreading support for additional connections (+fixes) #645

Merged
merged 27 commits into from
Jul 5, 2023

Conversation

d3vzer0
Copy link
Contributor

@d3vzer0 d3vzer0 commented Mar 17, 2023

Support for running a query across multiple connections (with optional async operation)

It is common for data services to be spread across multiple tenants or workloads. E.g., multiple Sentinel workspaces,
Microsoft Defender subscriptions or Splunk instances. You can use the QueryProvider to run a query across multiple connections and return the results in a single DataFrame.

To create a multi-instance provider, create an instance of a QueryProvider for your data source and execute the connect() method to connect to the first instance of your data service.

Then use the add_connection() method. This takes the same parameters as the connect() method (the parameters for this method vary by data provider).

add_connection() also supports an alias parameter to allow you to refer to the connection by a friendly name.

    qry_prov = QueryProvider("MSSentinel")
    qry_prov.connect(workspace="Workspace1")
    qry_prov.add_connection(workspace="Workspace2, alias="Workspace2")
    qry_prov.list_connections()

When you now run a query for this provider, the query will be run on all of the connections and the results will be returned as a single dataframe.

    test_query = '''
        SecurityAlert
        | take 5
        '''

    query_test = qry_prov.exec_query(query=test_query)
    query_test.head()

Some of the MSTICPy drivers support asynchronous execution of queries against multiple instances, so that the time taken to run the query is much reduced compared to running the queries sequentially. Drivers that support asynchronous queries will use this automatically. The initial set of multi-threaded drivers are:

  • MSSentinel_New (the new version of the MSSentinel driver)
  • Kusto_New (the new version of the Kusto/Azure Data Explorer driver)

By default, the queries will use at most 4 concurrent threads. You can override this by initializing the QueryProvider with the
max_threads parameter to set it to the number of threads you want.

    qry_prov = QueryProvider("MSSentinel", max_threads=10)

Multi-threaded support for split/shared queries

MSTICPy has supported splitting large queries by time-slice for a while. This is documented here Splitting a Query into time chunks. With this release, we've added asynchronous support for this (if the driver supports threaded/async operation) so that multiple chunks of the query will run in parallel.

    qry_prov.SecurityAlert.list_alerts(start=start, end=end, split_by="1d")

Use the parameter split_query_by or split_by to specify a time range (the time unit uses the same syntax as pandas time intervals - e.g. "1D", "4h", etc. - the the pandas documentation for more details on this).

In this release sharding is also supported for ad hoc queries as long as you add "start" and "end" parameters to the query (this is still experimental, so let us know if you have issues with this).

@d3vzer0 d3vzer0 changed the title DRAFT: Added multithreading support for additional connections (+fixes) Added multithreading support for additional connections (+fixes) Mar 20, 2023
ianhelle
ianhelle previously approved these changes Mar 20, 2023
Copy link
Contributor

@ianhelle ianhelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only change that would be good is changing MsticpyInstace to DataProviderInstance.
Looks great though!!!

msticpy/data/core/data_providers.py Outdated Show resolved Hide resolved
msticpy/data/core/data_providers.py Outdated Show resolved Hide resolved
msticpy/data/core/data_providers.py Outdated Show resolved Hide resolved
@ianhelle ianhelle dismissed their stale review March 23, 2023 20:37

Needs some further review

@ianhelle
Copy link
Contributor

@d3vzer0 - I'm going to make some additions to this to add the same capability for splitting queries by time.
I've made some refactoring changes in PR #656 - I want to put all of the multi-connection functionality in a separate mixin class. So I'll get back to this when this PR is merged and synced with this one.
Awesome stuff though - thanks for the work.

d3vzer0 and others added 4 commits May 5, 2023 13:43
Update branch with msticpy main
# Conflicts:
#	msticpy/data/core/data_providers.py
#	msticpy/data/drivers/driver_base.py
…es for drivers that support multi-threading in query_provider_connections_mixin.py

Added unit tests for threading code in test_async_queries.py
Added driver properties to azure_kusto_driver.py, azure_monitor_driver.py and odata.py (mdatp_driver and security_graph_driver)
Fixed test in test_azure_kusto_driver.py
Some doc fixes to docstring in DataProv-Kusto-New.rst, DataProv-MSSentinel-New.rst, DataProviders.rst
Unrelated doc fixes in polling_detection.py, Installing.rst, SentinelIncidents.rst
@ianhelle
Copy link
Contributor

Hey @d3vzer0,
Inspired by the PR, I implemented this for both multiple connections and for split (by time interval) queries.
Since the QueryProvider class was getting a bit out of hand, I moved all of the connections/threading code into its own mixin class. I haven't full tested this yet but it seems to work for both scenarios.
I've also added this all to the docs so that it's visible to more than just you and me. :-D
Let me know what you think.

…for pivot tests.

Changing test_nbinit.py to avoid using config locking and just use monkeypatch.setenv
…able if nested threading is happening.

Converting pd.Timestamps to datetimes to allow serialization in Azure-azure_monitor_driver (in AZmon SDK)
Fixing some logger info outputs in nbinit.py - that normally have no output.
…_mixin.

Add more logging to data_providers.QueryProvider and azure_monitor_driver.py
Format of cluster name has changed in new KustoClient. Fixing test cases to allow for old and new format.
@ianhelle
Copy link
Contributor

/azpipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ianhelle ianhelle self-assigned this May 24, 2023
@ianhelle ianhelle added this to the Release 2.6.0 milestone May 24, 2023
@ianhelle
Copy link
Contributor

ianhelle commented Jul 5, 2023

/azpipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ianhelle ianhelle merged commit 7504862 into microsoft:main Jul 5, 2023
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants