Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add draft of S3 avatars implemented as JiraHomeSource wrapper #168

Closed
wants to merge 1 commit into from

Conversation

pczuj
Copy link
Contributor

@pczuj pczuj commented May 26, 2023

This doesn't really work without FilestoreConfigAppender and s3StorageProvider implementations. It also wasn't really tested, however it's a demonstration of how the feature could be implemented using polymorphism on top of JiraHomeSource.

FilestoreConfigAppender interface is rather poor. Should probably be refined into something better modeled.

I used aws s3 sync, as it should better fit our use case where we actually want to ensure all files are present in S3 storage.
I used StorageLocation as it knows how the URI should look like and stores related AWS region.
Not sure if aws configure set default.s3.max_concurrent_requests 300 can replace the usage of xargs to split the processes. I think it's worth testing, as this approach may be simpler.

Feel free to rebuild this code and reuse it's elements elsewhere.

@pczuj pczuj requested a review from MatthewCochrane May 26, 2023 12:25
@pczuj
Copy link
Contributor Author

pczuj commented May 26, 2023

Context: In Hadrons team during today's standup @dagguh suggested to show the suggestion as a code.

I didn't manage to create complete code as I had some gaps in my knowledge about the feature:

  • What is "target" in the <association> of filestore-config.xml? Is it always the same as the directory name under <jirahome>/data?
  • Can there be reasonably more than one <filestores> in filestore-config.xml?
  • Is the filestore-config.xml used for other features?
  • Could we just upload everything in <jirahome>/data directory to S3?
  • What are the differences between avatars and attachments that are affecting the upload+config logic for them? From the code analysis I wasn't able to spot anything but the directory name.

ssh: SshConnection
) = delegate
.download(ssh)
.also { jiraHomeLocation ->
Copy link

@MatthewCochrane MatthewCochrane May 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we ensure that uploadToS3 runs only on the shared home (NFS) and that updateFilestoreConfig runs only on application nodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we not ensure that? I was thinking if aws s3 sync wouldn't resolve that problem for us. sync executed twice shouldn't duplicate any work.

@MatthewCochrane
Copy link

Thanks for taking a look and writing this example!
Just so I understand, in the DataCenterFormula interface that I proposed, we pass an EnumSet of options into DataCenterFormula when it's constructed. In this polymorphic approach does that mean we'd define combinations of JiraHomeSource's? For example AvatarsInS3, AttachmentsInS3 and AvatarsAndAttachmentsInS3 and pass that into the DataCenterFormula constructor?

In this approach, how would DataCenterFormula would know whether or not to create the new stack for the bucket? At the moment it checks to see if the EnumSet is empty.

To answer your questions:

What is "target" in the of filestore-config.xml? Is it always the same as the directory name under /data?

Target is the the target of that association. It can be either avatars or attachments and indicates to jira that it should look for that target on the file-store specified. https://hello.atlassian.net/wiki/spaces/SYNERGY/pages/2480715686/Simplified+Configuration+Format

Can there be reasonably more than one in filestore-config.xml?

Yep. Though I don't see any need to configure a second one for the JPT tests at this point.

Is the filestore-config.xml used for other features?

It's only used to tell Jira where to look for avatars/attachments at this point. Could be extended if we store other data there in the future, though there are no projects lined up to do that at this point (nor have I heard discussion to do it).

Could we just upload everything in /data directory to S3?

That is possible. Not sure of the advantage of that though.

What are the differences between avatars and attachments that are affecting the upload+config logic for them? From the code analysis I wasn't able to spot anything but the directory name.

They get configured differently in filestores-config.xml (they have different targets). Also the uploading logic is different. To get reasonable performance I had to split up avatars into different directories so I can run parallel jobs with XARGS.

Also note that I did try using aws s3 sync and max_concurrent_requests but it was still too slow. I was getting about ~1.6MiB/s and the attachments in ENTERPRISE_UNIMODAL are about 5.9GB so that would take about an hour to copy whereas the xargs copy took about 3.5 mins. I didn't set the max_concurrent_requests quite as high though.

If the issue with the Enum is that it's in API and adding new items can break backward compatibility, just brainstorming, some other approaches could be:

  • use two booleans (a bit ugly but adding new ones won't break API compatibility)
  • limit the scope of this enum to just avatars/attachments (this could mean renaming it to be more specific). If we did move other items to S3 we'd use a different parameter (note: although it's possible to see an extension to what is stored in S3, it is unlikely to happen for quite a long time (ie years) as there's be no discussion of it so far).

@pczuj
Copy link
Contributor Author

pczuj commented May 29, 2023

Not sure of the advantage of (uploading everything in /data directory to S3) though.

The desired advantage would be simplicity of the implementation which would also affect maintainability of the code around the feature. In the future we'd like to refactor DataCenterFormula and having every component of it as simple as possible will definitely help.

I didn't set the max_concurrent_requests quite as high though.

In the original bash script we are using 30 xarg processes and the default max_concurrent_requests is 10. This is why I used 30 * 10 = 300.

Not sure if 300 is allowed, as I can't find any documented boundry for that property. I think it's worth checking if setting it to some high value wouldn't be good enough, because the complexity of the code is drastically reduced and we are relying on dedicated feature for the purpose, so we can easily rely on automated tests done by AWS cli devs.

(...) other approaches could be

  • use two booleans (a bit ugly but adding new ones won't break API compatibility)

I think that doesn't need to be ugly and it may be the simple solution that we need. We could have a single object which would hold the configuration of the feature including 2 booleans. Not very OOP approach, but it should be good enough for me.

@pczuj
Copy link
Contributor Author

pczuj commented May 29, 2023

Just so I understand, in the DataCenterFormula interface that I proposed, we pass an EnumSet of options into DataCenterFormula when it's constructed. In this polymorphic approach does that mean we'd define combinations of JiraHomeSource's? For example AvatarsInS3, AttachmentsInS3 and AvatarsAndAttachmentsInS3 and pass that into the DataCenterFormula constructor?

Yes, the usage would be to wrap JiraHomeSource passed to DataCenterFormula in our "S3 storing" wrapper. We could have different wrappers for each artifact (AvatarsInS3, AttachmentsInS3) as I suggested in example code I made, however we could also just have a single class for the whole feature, e.g. S3JiraHome. This single class could have the 2 booleans enabled in constructor/builder.

You can check JiraUserPasswordOverridingDatabase for similar usage pattern. We are using it like this, e.g. here

The problem with this approach would be that the API discoverability is not the best. When we model DataCenterFormula as facade of features it's easy to just browse though the available fields of it's Builder, while for the wrapper approach you need to browse though the whole API package. This is why I think the simple "two booleans" approach you suggested may be better (at least until we'd refactor DataCenterFormula to not be an ever growing facade)

@MatthewCochrane
Copy link

Not sure if 300 is allowed, as I can't find any documented boundry for that property. I think it's worth checking if setting it to some high value wouldn't be good enough, because the complexity of the code is drastically reduced and we are relying on dedicated feature for the purpose, so we can easily rely on automated tests done by AWS cli devs.

I'll give it a try using aws sync and aws cp with max_concurrent_requests set to 300 and see how it goes!

I think that doesn't need to be ugly and it may be the simple solution that we need. We could have a single object which would hold the configuration of the feature including 2 booleans. Not very OOP approach, but it should be good enough for me.

Ok, sounds good!

@MatthewCochrane
Copy link

Yes, the usage would be to wrap JiraHomeSource passed to DataCenterFormula in our "S3 storing" wrapper. We could have different wrappers for each artifact (AvatarsInS3, AttachmentsInS3) as I suggested in example code I made, however we could also just have a single class for the whole feature, e.g. S3JiraHome. This single class could have the 2 booleans enabled in constructor/builder.
You can check JiraUserPasswordOverridingDatabase for similar usage pattern. We are using it like this, e.g. here

Ok thanks for the explanation! That makes sense. I see most of the pieces to make this work with the approach you outlined. There are still a couple missing though. The main one is probably determining which node we're on. Since you mentioned the two-boolean approach looks good-enough I'll update to that for now.

@pczuj pczuj closed this May 30, 2023
@pczuj pczuj deleted the fsm-s3-polymorphic-suggestion branch May 30, 2023 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants