Add draft of S3 avatars implemented as JiraHomeSource wrapper #168

pczuj · 2023-05-26T12:25:37Z

This doesn't really work without FilestoreConfigAppender and s3StorageProvider implementations. It also wasn't really tested, however it's a demonstration of how the feature could be implemented using polymorphism on top of JiraHomeSource.

FilestoreConfigAppender interface is rather poor. Should probably be refined into something better modeled.

I used aws s3 sync, as it should better fit our use case where we actually want to ensure all files are present in S3 storage.
I used StorageLocation as it knows how the URI should look like and stores related AWS region.
Not sure if aws configure set default.s3.max_concurrent_requests 300 can replace the usage of xargs to split the processes. I think it's worth testing, as this approach may be simpler.

Feel free to rebuild this code and reuse it's elements elsewhere.

pczuj · 2023-05-26T12:35:46Z

Context: In Hadrons team during today's standup @dagguh suggested to show the suggestion as a code.

I didn't manage to create complete code as I had some gaps in my knowledge about the feature:

What is "target" in the <association> of filestore-config.xml? Is it always the same as the directory name under <jirahome>/data?
Can there be reasonably more than one <filestores> in filestore-config.xml?
Is the filestore-config.xml used for other features?
Could we just upload everything in <jirahome>/data directory to S3?
What are the differences between avatars and attachments that are affecting the upload+config logic for them? From the code analysis I wasn't able to spot anything but the directory name.

MatthewCochrane · 2023-05-29T04:50:39Z

src/main/kotlin/com/atlassian/performance/tools/awsinfrastructure/api/jira/AvatarsInS3.kt

+        ssh: SshConnection
+    ) = delegate
+        .download(ssh)
+        .also { jiraHomeLocation ->


How would we ensure that uploadToS3 runs only on the shared home (NFS) and that updateFilestoreConfig runs only on application nodes?

Could we not ensure that? I was thinking if aws s3 sync wouldn't resolve that problem for us. sync executed twice shouldn't duplicate any work.

MatthewCochrane · 2023-05-29T05:40:11Z

Thanks for taking a look and writing this example!
Just so I understand, in the DataCenterFormula interface that I proposed, we pass an EnumSet of options into DataCenterFormula when it's constructed. In this polymorphic approach does that mean we'd define combinations of JiraHomeSource's? For example AvatarsInS3, AttachmentsInS3 and AvatarsAndAttachmentsInS3 and pass that into the DataCenterFormula constructor?

In this approach, how would DataCenterFormula would know whether or not to create the new stack for the bucket? At the moment it checks to see if the EnumSet is empty.

To answer your questions:

What is "target" in the of filestore-config.xml? Is it always the same as the directory name under /data?

Target is the the target of that association. It can be either avatars or attachments and indicates to jira that it should look for that target on the file-store specified. https://hello.atlassian.net/wiki/spaces/SYNERGY/pages/2480715686/Simplified+Configuration+Format

Can there be reasonably more than one in filestore-config.xml?

Yep. Though I don't see any need to configure a second one for the JPT tests at this point.

Is the filestore-config.xml used for other features?

It's only used to tell Jira where to look for avatars/attachments at this point. Could be extended if we store other data there in the future, though there are no projects lined up to do that at this point (nor have I heard discussion to do it).

Could we just upload everything in /data directory to S3?

That is possible. Not sure of the advantage of that though.

What are the differences between avatars and attachments that are affecting the upload+config logic for them? From the code analysis I wasn't able to spot anything but the directory name.

They get configured differently in filestores-config.xml (they have different targets). Also the uploading logic is different. To get reasonable performance I had to split up avatars into different directories so I can run parallel jobs with XARGS.

Also note that I did try using aws s3 sync and max_concurrent_requests but it was still too slow. I was getting about ~1.6MiB/s and the attachments in ENTERPRISE_UNIMODAL are about 5.9GB so that would take about an hour to copy whereas the xargs copy took about 3.5 mins. I didn't set the max_concurrent_requests quite as high though.

If the issue with the Enum is that it's in API and adding new items can break backward compatibility, just brainstorming, some other approaches could be:

use two booleans (a bit ugly but adding new ones won't break API compatibility)
limit the scope of this enum to just avatars/attachments (this could mean renaming it to be more specific). If we did move other items to S3 we'd use a different parameter (note: although it's possible to see an extension to what is stored in S3, it is unlikely to happen for quite a long time (ie years) as there's be no discussion of it so far).

pczuj · 2023-05-29T10:12:36Z

Not sure of the advantage of (uploading everything in /data directory to S3) though.

The desired advantage would be simplicity of the implementation which would also affect maintainability of the code around the feature. In the future we'd like to refactor DataCenterFormula and having every component of it as simple as possible will definitely help.

I didn't set the max_concurrent_requests quite as high though.

In the original bash script we are using 30 xarg processes and the default max_concurrent_requests is 10. This is why I used 30 * 10 = 300.

Not sure if 300 is allowed, as I can't find any documented boundry for that property. I think it's worth checking if setting it to some high value wouldn't be good enough, because the complexity of the code is drastically reduced and we are relying on dedicated feature for the purpose, so we can easily rely on automated tests done by AWS cli devs.

(...) other approaches could be

use two booleans (a bit ugly but adding new ones won't break API compatibility)

I think that doesn't need to be ugly and it may be the simple solution that we need. We could have a single object which would hold the configuration of the feature including 2 booleans. Not very OOP approach, but it should be good enough for me.

pczuj · 2023-05-29T10:30:06Z

Just so I understand, in the DataCenterFormula interface that I proposed, we pass an EnumSet of options into DataCenterFormula when it's constructed. In this polymorphic approach does that mean we'd define combinations of JiraHomeSource's? For example AvatarsInS3, AttachmentsInS3 and AvatarsAndAttachmentsInS3 and pass that into the DataCenterFormula constructor?

Yes, the usage would be to wrap JiraHomeSource passed to DataCenterFormula in our "S3 storing" wrapper. We could have different wrappers for each artifact (AvatarsInS3, AttachmentsInS3) as I suggested in example code I made, however we could also just have a single class for the whole feature, e.g. S3JiraHome. This single class could have the 2 booleans enabled in constructor/builder.

You can check JiraUserPasswordOverridingDatabase for similar usage pattern. We are using it like this, e.g. here

The problem with this approach would be that the API discoverability is not the best. When we model DataCenterFormula as facade of features it's easy to just browse though the available fields of it's Builder, while for the wrapper approach you need to browse though the whole API package. This is why I think the simple "two booleans" approach you suggested may be better (at least until we'd refactor DataCenterFormula to not be an ever growing facade)

MatthewCochrane · 2023-05-30T06:25:46Z

Not sure if 300 is allowed, as I can't find any documented boundry for that property. I think it's worth checking if setting it to some high value wouldn't be good enough, because the complexity of the code is drastically reduced and we are relying on dedicated feature for the purpose, so we can easily rely on automated tests done by AWS cli devs.

I'll give it a try using aws sync and aws cp with max_concurrent_requests set to 300 and see how it goes!

I think that doesn't need to be ugly and it may be the simple solution that we need. We could have a single object which would hold the configuration of the feature including 2 booleans. Not very OOP approach, but it should be good enough for me.

Ok, sounds good!

MatthewCochrane · 2023-05-30T07:30:10Z

Yes, the usage would be to wrap JiraHomeSource passed to DataCenterFormula in our "S3 storing" wrapper. We could have different wrappers for each artifact (AvatarsInS3, AttachmentsInS3) as I suggested in example code I made, however we could also just have a single class for the whole feature, e.g. S3JiraHome. This single class could have the 2 booleans enabled in constructor/builder.
You can check JiraUserPasswordOverridingDatabase for similar usage pattern. We are using it like this, e.g. here

Ok thanks for the explanation! That makes sense. I see most of the pieces to make this work with the approach you outlined. There are still a couple missing though. The main one is probably determining which node we're on. Since you mentioned the two-boolean approach looks good-enough I'll update to that for now.

Add draft of S3 avatars implemented as JiraHomeSource wrapper

d013872

pczuj requested a review from MatthewCochrane May 26, 2023 12:25

pczuj mentioned this pull request May 26, 2023

Allow storing Avatars/Attachments in S3 #166

Open

MatthewCochrane reviewed May 29, 2023

View reviewed changes

pczuj closed this May 30, 2023

pczuj deleted the fsm-s3-polymorphic-suggestion branch May 30, 2023 09:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add draft of S3 avatars implemented as JiraHomeSource wrapper #168

Add draft of S3 avatars implemented as JiraHomeSource wrapper #168

pczuj commented May 26, 2023 •

edited

Loading

pczuj commented May 26, 2023

MatthewCochrane May 29, 2023 •

edited

Loading

pczuj May 30, 2023

MatthewCochrane commented May 29, 2023

pczuj commented May 29, 2023

pczuj commented May 29, 2023 •

edited

Loading

MatthewCochrane commented May 30, 2023

MatthewCochrane commented May 30, 2023

Add draft of S3 avatars implemented as JiraHomeSource wrapper #168

Add draft of S3 avatars implemented as JiraHomeSource wrapper #168

Conversation

pczuj commented May 26, 2023 • edited Loading

pczuj commented May 26, 2023

MatthewCochrane May 29, 2023 • edited Loading

Choose a reason for hiding this comment

pczuj May 30, 2023

Choose a reason for hiding this comment

MatthewCochrane commented May 29, 2023

pczuj commented May 29, 2023

pczuj commented May 29, 2023 • edited Loading

MatthewCochrane commented May 30, 2023

MatthewCochrane commented May 30, 2023

pczuj commented May 26, 2023 •

edited

Loading

MatthewCochrane May 29, 2023 •

edited

Loading

pczuj commented May 29, 2023 •

edited

Loading