SaaS Crawler Module #5095

san81 · 2024-10-21T19:15:22Z

Description

Introducing SaaS Source Plugins module and a base Jira Source plugin class

Issues Resolved

Resolves #4754

Check List

New functionality includes testing.
New functionality has a documentation issue. Please link to it in this PR.
- New functionality has javadoc added
Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…odule for all of the gradle sources Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

Signed-off-by: Maxwell Brown <mxwelwbr@amazon.com>

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

full test coverage for base folder, spotless fixes

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

…ources-module

.gitignore

dlvenable · 2024-10-22T15:48:09Z

data-prepper-plugins/saas-source-plugins/jira-source/build.gradle

+    implementation 'com.fasterxml.jackson.core:jackson-core'
+    implementation 'com.fasterxml.jackson.core:jackson-databind'
+    implementation 'com.mashape.unirest:unirest-java:1.4.9'
+    implementation 'com.google.code.gson:gson:2.8.9'


Do we need this dependency? Please remove if possible.

Removed them

data-prepper-plugins/saas-source-plugins/jira-source/build.gradle

dlvenable · 2024-10-22T15:49:15Z

data-prepper-plugins/saas-source-plugins/jira-source/build.gradle

+    implementation 'org.projectlombok:lombok:1.18.30'
+    annotationProcessor 'org.projectlombok:lombok:1.18.30'
+
+    testImplementation platform('org.junit:junit-bom:5.10.0')


You don't need either of these two lines.

removed them

dlvenable · 2024-10-22T15:49:47Z

data-prepper-plugins/saas-source-plugins/saas-crawler/build.gradle

+    enabled = false
+}
+
+repositories {


Please remove this block. You don't need it.

dlvenable · 2024-10-22T15:51:40Z

.../opensearch/dataprepper/plugins/source/saas/crawler/SaasCrawlerApplicationContextMarker.java

@@ -0,0 +1,7 @@
+package org.opensearch.dataprepper.plugins.source.saas.crawler;


Suggested change

package org.opensearch.dataprepper.plugins.source.saas.crawler;

package org.opensearch.dataprepper.plugins.source.saas_crawler;

Let's use this package name in the other files as well.

Ok. Modified the code to use this new package name.

Based on the discussion, changed this name to source_crawler

dlvenable · 2024-10-22T15:53:01Z

...ensearch/dataprepper/plugins/source/saas/crawler/base/SaasPluginExecutorServiceProvider.java

+
+@Named
+public class SaasPluginExecutorServiceProvider {
+    Logger  log = LoggerFactory.getLogger(SaasPluginExecutorServiceProvider.class);


This should be private static final

Addressed it

dlvenable · 2024-10-22T15:53:39Z

...ensearch/dataprepper/plugins/source/saas/crawler/base/SaasPluginExecutorServiceProvider.java

+        executorService = Executors.newFixedThreadPool(DEFAULT_THREAD_COUNT);
+    }
+
+    //Constructor for testing


Use Javadoc comments here, not //.

/* * Constructor for testing */

dlvenable · 2024-10-22T16:57:01Z

.../main/java/org/opensearch/dataprepper/plugins/source/saas/crawler/base/SaasSourceConfig.java

+ */
+public interface SaasSourceConfig {
+
+    int DEFAULT_NUMBER_OF_WORKERS = 1;


What is the motivation for having this here? Why would we have a default worker count of 1 for all SaaS connectors?

Initial thinking was that we will take this input from the pipeline yaml configuration itself and let the user define it based on their own enterprise package they have with their service provider. Because, this decides the concurrency (pressure) we create on their service. Default value 1 meaning, that we start with least pressure possible but that also means the total data extraction takes a lot of time.

For jira case, we are not planning to take this input from the customer. We will start with a reasonable default value suitable for Jira.

dlvenable · 2024-10-22T16:58:20Z

settings.gradle

@@ -186,3 +186,7 @@ include 'data-prepper-plugins:aws-lambda'
 include 'data-prepper-plugin-schema'
 include 'data-prepper-plugins:kinesis-source'
 include 'data-prepper-plugins:opensearch-api-source'
+include 'data-prepper-plugins:saas-source-plugins'
+include 'data-prepper-plugins:saas-source-plugins:saas-crawler'


I wonder if saas-crawler is the right name here. What guides the use of SaaS here? It seems more like this is a crawler capability and not necessarily related to a SaaS product per-se.

It is under saas-source-plugins and also not sure if any other source will make use of something like this.

I think my point is that this is not directly tied to SaaS, which can vary and is not even an entirely clear term. I was suggesting that we rename this to something like crawler-source-plugins.

Or are these API crawlers? Or REST crawlers?

crawler-source-plugins

rest-crawler-source-plugins

api-crawler-source-plugins

based on the discussion, renamed it to source_crawler

… the review input Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

graytaylor0 · 2024-10-22T20:32:31Z

...awler/src/main/java/org/opensearch/dataprepper/plugins/source/saas/crawler/base/Crawler.java

+        Iterator<ItemInfo> itemInfoIterator = client.listItems();
+        log.info("Starting to crawl the source");
+        long updatedPollTime = 0;
+        log.info("Creating Partitions");


There are a log of logs throughout the PR that are info logs that look like they should be debug

Adjusted the log level for some and removed a few log statements.

graytaylor0 · 2024-10-22T20:34:34Z

...awler/src/main/java/org/opensearch/dataprepper/plugins/source/saas/crawler/base/Crawler.java

+                updatedPollTime = Math.max(updatedPollTime, niUpdated);
+                log.info("updated poll time {}", updatedPollTime);
+            }
+            createPartition(itemInfoList, coordinator);


Why are we passing a full list to this method? Can we just create each partition inline or do they all need to be stored first?

We are passing maxItemsPerPage number of items to this method. i.e. the page size in our paginated crawling. All the items in this page will go into one partition (or a work item). Like a partition per page.

graytaylor0 · 2024-10-22T20:37:01Z

...va/org/opensearch/dataprepper/plugins/source/saas/crawler/coordination/PartitionFactory.java

+        } else {
+            // Unable to acquire other partitions.
+            // Probably we will introduce Global state in the future but for now, we don't expect to reach here.
+             throw new RuntimeException("Unable to acquire other partitions. " +


Maybe print out the partitionType here if we get to this point.

Added to the exception message.

graytaylor0 · 2024-10-22T20:51:33Z

...ensearch/dataprepper/plugins/source/saas/crawler/coordination/scheduler/LeaderScheduler.java

+                if(leaderPartition != null) {
+                    // Extend the timeout
+                    // will always be a leader until shutdown
+                    coordinator.saveProgressStateForPartition(leaderPartition, Duration.ofMinutes(DEFAULT_EXTEND_LEASE_MINUTES));


Should catch exceptions that can come from this call so your thread doesn't shut down.

Wrapped this statement around try catch now. I don't see that this method is throwing any exception though!

It won't most of the time but it can. This was a bug in Dynamo at one time (#4850)

dynamo store hit a 5xx

Thank you for clarifying 👍

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

graytaylor0 · 2024-10-23T17:53:41Z

.../main/java/org/opensearch/dataprepper/plugins/source/saas_crawler/base/SaasSourcePlugin.java

+    this.buffer = buffer;
+
+    boolean isPartitionCreated = coordinator.createPartition(new LeaderPartition());
+    log.info("Leader partition creation status: {}", isPartitionCreated);


This will result in one of these logs whenever a new data prepper instance starts

Leader partition creation status: false

and isn't really helpful. Can be debug

graytaylor0 · 2024-10-23T17:55:52Z

...ensearch/dataprepper/plugins/source/saas_crawler/coordination/scheduler/WorkerScheduler.java

+                    processPartition(partition.get(), buffer, sourceConfig);
+
+                } else {
+                    log.info("No partition available. Going to Sleep for a while ");


This may also be a little noisy. Maybe a metric tracking this would be better?

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

kkondaka · 2024-10-23T22:51:21Z

...er/src/main/java/org/opensearch/dataprepper/plugins/source/saas_crawler/base/SaasClient.java

+    Iterator<ItemInfo> listItems();
+
+
+    void setLastPollTime(long lastPollTime);


Please add javadoc comments for all API

kkondaka · 2024-10-23T22:54:03Z

...awler/src/main/java/org/opensearch/dataprepper/plugins/source/saas_crawler/base/Crawler.java

@@ -0,0 +1,81 @@
+package org.opensearch.dataprepper.plugins.source.saas_crawler.base;


Do you not see a need for common interface for all crawlers? I was thinking there would be an interface with some default implementation and so on.

Crawler relay on a source plugin specific iterator implementation and dispatch the work to source plugin specific client implementation. Crawler itself has generic logic for pagination.

kkondaka · 2024-10-23T22:56:44Z

...awler/src/main/java/org/opensearch/dataprepper/plugins/source/saas_crawler/base/Crawler.java

+                }
+                itemInfoList.add(nextItem);
+                Map<String, String> metadata = nextItem.getMetadata();
+                long niCreated = Long.parseLong(metadata.get(CREATED)!=null? metadata.get(CREATED):"0");


Should the fallback value of 0 or current time?

kkondaka · 2024-10-23T22:57:24Z

...awler/src/main/java/org/opensearch/dataprepper/plugins/source/saas_crawler/base/Crawler.java

+            createPartition(itemInfoList, coordinator);
+        }while (itemInfoIterator.hasNext());
+        log.debug("Crawling completed in {} ms", System.currentTimeMillis() - startTime);
+        return updatedPollTime != 0 ? updatedPollTime : startTime;


Oh, looks like it is falling back to current time here.

kkondaka · 2024-10-23T23:04:58Z

.../main/java/org/opensearch/dataprepper/plugins/source/saas_crawler/base/SaasSourcePlugin.java

+  private final Crawler crawler;
+
+
+  @DataPrepperPluginConstructor


I do not think you use @DataPrepperPluginConstructor for abstract source classes. see ./data-prepper-plugins/http-source-common/src/main/java/org/opensearch/dataprepper/http/BaseHttpSource.java

Agree. No use of this annotation here. Removed it.

kkondaka · 2024-10-23T23:06:31Z

.../main/java/org/opensearch/dataprepper/plugins/source/saas_crawler/base/SaasSourcePlugin.java

+
+  @Override
+  public boolean areAcknowledgementsEnabled() {
+    return Source.super.areAcknowledgementsEnabled();


Not a good idea. This should be left to the derived classes.

Agree. Removed this method. Each source plugin will implement their own version.

kkondaka · 2024-10-23T23:10:16Z

...ler/src/main/java/org/opensearch/dataprepper/plugins/source/saas_crawler/model/ItemInfo.java

+     * contents itself which can be used to apply regex filtering, change data capture etc. general
+     * assumption here is that fetching metadata should be faster than fetching entire Item
+     */
+    Map<String, String> metadata;


Probably better to make it Map<String, Object>, it is probably unrealistic to expect all of metadata values to be Strings.

Considering each source may have different needs, agree to make it more generic. Converted the map as suggested.

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

dlvenable · 2024-10-24T17:54:38Z

...ensearch/dataprepper/plugins/source/saas_crawler/coordination/state/LeaderProgressState.java

+    private boolean initialized = false;
+
+    @JsonProperty("last_poll_time")
+    private Long lastPollTime;


Make this a Java Instant.

I had to introduce a custom deserializer and additional jackson dependencies but Yes, I change the type to Instant now

dlvenable · 2024-10-24T17:56:23Z

settings.gradle

@@ -186,3 +186,7 @@ include 'data-prepper-plugins:aws-lambda'
 include 'data-prepper-plugin-schema'
 include 'data-prepper-plugins:kinesis-source'
 include 'data-prepper-plugins:opensearch-api-source'
+include 'data-prepper-plugins:saas-source-plugins'
+include 'data-prepper-plugins:saas-source-plugins:saas-crawler'


I think my point is that this is not directly tied to SaaS, which can vary and is not even an entirely clear term. I was suggesting that we rename this to something like crawler-source-plugins.

Or are these API crawlers? Or REST crawlers?

crawler-source-plugins

rest-crawler-source-plugins

api-crawler-source-plugins

dlvenable · 2024-10-24T17:58:27Z

...awler/src/main/java/org/opensearch/dataprepper/plugins/source/saas_crawler/base/Crawler.java

+            }
+            createPartition(itemInfoList, coordinator);
+        }while (itemInfoIterator.hasNext());
+        log.debug("Crawling completed in {} ms", System.currentTimeMillis() - startTime);


Should we be reporting a metric for these times? You can use Micrometer's Timer for this.

We are already capturing in the Jira source level. but anyway, added Timer here as well now.

dlvenable · 2024-10-24T17:59:23Z

...ler/src/main/java/org/opensearch/dataprepper/plugins/source/saas_crawler/model/ItemInfo.java

+     * log events to keep and which ones to discard.
+     */
+    @NonNull
+    Long eventTime;


Use an Instant here.

dlvenable · 2024-10-24T18:00:19Z

...ler/src/main/java/org/opensearch/dataprepper/plugins/source/saas_crawler/model/ItemInfo.java

+
+
+@Getter
+public abstract class ItemInfo {


This should be a Java interface. It is more flexible and allows for adapting rather than having to convert.

public interface ItemInfo { String getItemId(); Map<String, Object> getMetadata(); ... }

dlvenable · 2024-10-24T18:19:57Z

...ensearch/dataprepper/plugins/source/saas_crawler/base/SaasPluginExecutorServiceProvider.java

+@Named
+public class SaasPluginExecutorServiceProvider {
+    private static final Logger  log = LoggerFactory.getLogger(SaasPluginExecutorServiceProvider.class);
+    public static final int DEFAULT_THREAD_COUNT = 50;


Suggested change

public static final int DEFAULT_THREAD_COUNT = 50;

private static final int DEFAULT_THREAD_COUNT = 50;

changed to private

dlvenable · 2024-10-24T18:20:19Z

data-prepper-plugins/saas-source-plugins/saas-crawler/build.gradle

+    annotationProcessor 'org.projectlombok:lombok:1.18.30'
+}
+
+test {


You don't need these lines.

dlvenable · 2024-10-24T18:20:31Z

data-prepper-plugins/saas-source-plugins/saas-crawler/build.gradle

@@ -0,0 +1,30 @@
+plugins {


You don't need these lines.

I thought we need these line when it is a library. I don't see any issue after removing them so removed 👍

dlvenable · 2024-10-24T18:20:55Z

...ira-source/src/main/java/org/opensearch/dataprepper/plugins/source/saas/jira/JiraSource.java

+        pluginType = Source.class,
+        packagesToScan = {SaasCrawlerApplicationContextMarker.class, JiraSource.class}
+)
+public class JiraSource implements Source<Record<Event>> {


Is this supposed to inherit from SaasSourcePlugin?

As we discussed, yes it will. But I didn't do that to limit the size of this PR.

dlvenable · 2024-10-24T18:21:21Z

data-prepper-plugins/saas-source-plugins/jira-source/build.gradle

+    implementation 'com.fasterxml.jackson.core:jackson-databind'
+    implementation 'javax.inject:javax.inject:1'
+    implementation 'org.springframework:spring-web:5.3.39'
+    implementation 'org.springframework.retry:spring-retry:1.3.4'


Do we need this dependency? I don't see it in use.

Like we discussed, I am relaying on this library in the code. I removed in this PR and I will add it back when I add jira source code 👍

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

san81 added 2 commits October 21, 2024 12:04

Introducing SaaS sources gradle module and SaaS crawler as a common m…

bc24971

…odule for all of the gradle sources Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

test classes

e257f31

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

san81 requested review from chenqi0805, engechas, graytaylor0, dinujoh, kkondaka, KarstenSchnitter, dlvenable and oeyh as code owners October 21, 2024 19:15

san81 and others added 8 commits October 21, 2024 12:32

Plain empty Jira Source plugin

b238ae9

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

Parition Factory Tests

696c1a5

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

additional tests

439ef92

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

additional tests

1a76aac

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

full test coverage for base folder, spotless fixes

44828f8

Signed-off-by: Maxwell Brown <mxwelwbr@amazon.com>

additional tests

0881b43

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

Merge pull request #6 from Galactus22625/Saas-Crawler-Base-UnitTests

26137ea

full test coverage for base folder, spotless fixes

additional test coverage

f535b32

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

san81 changed the title ~~Saas sources module~~ Saas Crawler Module Oct 22, 2024

san81 changed the title ~~Saas Crawler Module~~ SaaS Crawler Module Oct 22, 2024

san81 added 3 commits October 22, 2024 07:30

Merge branch 'opensearch-project:main' into saas-sources-module

7096f1c

merging main

b90b929

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

Merge remote-tracking branch 'origin/saas-sources-module' into saas-s…

eccabb8

…ources-module

dlvenable requested changes Oct 22, 2024

View reviewed changes

addressing review comments and also package name refactoring based on…

86338bf

… the review input Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

graytaylor0 reviewed Oct 22, 2024

View reviewed changes

more review comments

65bc47b

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

san81 requested a review from sb2k16 as a code owner October 22, 2024 22:40

san81 added 2 commits October 22, 2024 16:21

adjusted the log level and removed unwanted log messages

31ab7b9

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

small clean ups

8af4240

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

test case assertion fix

9a6fda7

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

san81 requested a review from dlvenable October 23, 2024 00:18

better coverage

cb3d94f

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

graytaylor0 previously approved these changes Oct 23, 2024

View reviewed changes

step down the log level based on the review comments

c5af55a

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

san81 dismissed graytaylor0’s stale review via c5af55a October 23, 2024 18:00

san81 added 2 commits October 23, 2024 11:00

Merge branch 'opensearch-project:main' into saas-sources-module

647ba10

taking the coverage to 100%

5a4e966

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

kkondaka reviewed Oct 23, 2024

View reviewed changes

addressing review comments

cf35364

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

dlvenable requested changes Oct 24, 2024

View reviewed changes

san81 added 6 commits October 24, 2024 12:53

module name renamed to source-crawler

6894d24

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

converting last_poll_time to java Instant type

b020a34

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

we are now capturing Crawling times

6e213d0

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

ItemInfo long timestamp is now using Instant type

7b5cfcd

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

addressing review comments

ec98d4c

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

Instant conversion

38a6890

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

		@@ -0,0 +1,7 @@
		package org.opensearch.dataprepper.plugins.source.saas.crawler;

		Iterator<ItemInfo> listItems();


		void setLastPollTime(long lastPollTime);

		@@ -0,0 +1,81 @@
		package org.opensearch.dataprepper.plugins.source.saas_crawler.base;

	public static final int DEFAULT_THREAD_COUNT = 50;
	private static final int DEFAULT_THREAD_COUNT = 50;

SaaS Crawler Module #5095

Are you sure you want to change the base?

SaaS Crawler Module #5095

Conversation

san81 commented Oct 21, 2024 • edited Loading

Description

Issues Resolved

Check List

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

san81 Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

san81 commented Oct 21, 2024 •

edited

Loading

san81 Oct 25, 2024 •

edited

Loading