-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow using event fields in s3 sink object_key #3310
Comments
@cameronattard , Thank you for this suggestion. I think this could be a useful feature and could allow for Hive-style partitioning which is useful with use-cases such as Amazon Athena. https://docs.aws.amazon.com/athena/latest/ug/partitions.html One difficulty with this solution is that we would also need to route events to the desired object and have multiple objects "in-flight". This could work quite nicely with the new multipart buffer. Would you be interested in taking this up? |
@dlvenable thanks for the feedback. Unfortunately I have neither the expertise nor the bandwidth to implement this. |
@dlvenable, it looks like the ask here is that we make the |
@kkondaka , That is the basic ask yes. However, it is somewhat more complicated because the S3 sink will need to have multiple S3 objects and group events to go into those objects. For example, if the pattern includes the timestamp's year, month, and date, then we must group the events into different objects corresponding to the event's timestamp - not the current timestamp. Also, we should consider how this intersects with the thresholds. Should the thresholds be applied per group? Or for the entire sink? The per-group approach is natural, but could lead to memory issues as the sink could have dozens of groups. |
Also, Data Prepper should support Hadoop file system partitioning. For example, you can partition by a timestamp:
The example above will partition by the current time. But, we really want to partition by the timestamp. We will need some additional capability in Data Prepper to get part of a timestamp. Perhaps a date-time format method?
|
I created #3434 for the timestamp formatting. @cameronattard, If you are looking to use time formatting, please take a look and provide any feedback on that proposal. Thanks! |
I should clarify that hostname is just a generic example. Ideally we should be able to inject any arbitrary event field into the object key. |
@cameronattard of course. That's why I was suggesting adding a support for expression, so that any field and functions can be part of the object name |
@dlvenable using expressions in the s3 sink config is a feature our project really needs. can it also be applied to the s3 bucket name to support dynamic buckets extracted or constructed from the event? |
@graytaylor0 , Is this resolved by #4346 and #4385? |
Yes those add dynamic path_prefix and dynamic bucket support. They do not add support to configure the |
Hello, Can you please add additional documentation and pipeline examples how one could utilize this functionality? It is a very useful one, but I cannot understand the correct syntax, nor paid AWS support knows how to write one. |
I'm using it at the moment, here is an example:
|
Is your feature request related to a problem? Please describe.
Currently it seems like all objects from the s3 sink are sent using the same prefix, with only date-time being configurable. This means in order to retrieve a subset of events, e.g. logs from a specific hostname, you need to query all events for the time period.
Describe the solution you'd like
We would like to send events to different s3 object prefixes based on specific event fields, for example, hostname. This makes searching events in s3 simpler and cheaper as you can directly query the relevant subset of events.
Describe alternatives you've considered (Optional)
We could potentially use separate sinks for each subset of logs but this is not really dynamic or scalable.
Additional context
N/A
The text was updated successfully, but these errors were encountered: