Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Mapping Array to string #3963

Closed
Tracked by #3967
adplotzk opened this issue Jan 16, 2024 · 10 comments
Closed
Tracked by #3967

[Feature Request] Mapping Array to string #3963

adplotzk opened this issue Jan 16, 2024 · 10 comments
Labels
plugin - processor A plugin to manipulate data in the data prepper pipeline.

Comments

@adplotzk
Copy link

adplotzk commented Jan 16, 2024

This is a feature request which is requested a processor to implement the following ETL functionality to handle Mapping Array to string

#SOURCE

        "some_source": [
                {
                    "src_key_1": "val_1"
                },
                {
                    "src_key_1": "val_2"
                }
            ]
        }, 

#TO

        }
        "some_dest": {
            "dest_key_1": "val_1, val_2"
        }

#MAPPING

    - from_key: "some_source[]/src_key_1"
      to_key: "some_dest/dest_key1"
      dest_type: "string" #Can also be "Array"
      delim: ","
@dlvenable dlvenable added plugin - processor A plugin to manipulate data in the data prepper pipeline. and removed untriaged labels Jan 16, 2024
@oeyh
Copy link
Collaborator

oeyh commented Jan 19, 2024

I think this can be done in steps:

  1. Use a processor to access fields from a list of objects and put them under the original key. See Combine values under the same key from a list of key-value pairs #3867 for proposed solution.
    From:
"some_source": [
  {
    "src_key_1": "val_1"
  },
  {
    "src_key_1": "val_2"
  }
]

to:

"some_dest": {
  "src_key_1": ["val_1", "val_2"]
}
  1. Rename src_key_1 to dest_key_1 with existing rename_keys processor, and we get:
"some_dest": {
  "dest_key_1": ["val_1", "val_2"]
}
  1. If necessary, join the array of strings with specified delimiter to form a new list. This requires a new processor. And we get:
"some_dest": {
  "dest_key_1": "val_1, val2"
}

@oeyh
Copy link
Collaborator

oeyh commented Jan 25, 2024

Regarding the new processor mentioned above to convert list to string, a proposed solution is:
A new list processor with join action. Given a source key to a list or a map of key-value pairs where values are all of list type, convert list to string.

The configuration would look like this:

  processor:
    - list:
        join:
          source: "mylist"
          delimiter: ","

@dlvenable
Copy link
Member

Maybe for the new processor, we should give it a name that correctly indicates what it does when used as a verb. So perhaps make join the processor. Or call it list_to_string.

Do we need these extra nesting? Can it just be either of these?

processor:
- join:
     source: mylist
     delimiter: ','

or

processor:
- list_to_string:
     source: mylist
     delimiter: ','

@oeyh
Copy link
Collaborator

oeyh commented Jan 29, 2024

That's my original thought as well. Had some discussion offline with @kkondaka and we're thinking about putting different list operations under the hood of a list processor. In this issue, it's a join operation for lists. In #3962, there's a copy operation for lists.

@dlvenable
Copy link
Member

@oeyh , @kkondaka

What is the value of having different list operations if there is no significant common configuration shared? It seems to result in extra layering for users in the YAML.

I'd prefer to start with a cleaner pipeline YAML. Then we can work from there to a technical design. There might be some shared code, but we could handle that by sharing the same Gradle project.

@oeyh
Copy link
Collaborator

oeyh commented Jan 31, 2024

@dlvenable I think the main value is to make it easier for users to locate a processor to use. If one wants to operate on lists/arrays, go to list processor to find it. It also makes processor naming cleaner, e.g. list_to_string, copy_list_items, append_to_list, delete_list_items can be join, copy, append, delete operations under list processor.

Now is also a good time to do it because we lack operations on lists and will likely add a couple to it soon.

@oeyh
Copy link
Collaborator

oeyh commented Jan 31, 2024

Looking at some existing processor configs (eg, aggregate), the list processor configuration would look like this:

processor:
    - list:
        source: "source-list"
        target: "target"  
        action:
          join:
            delimiter: ","
        process_when: ...

Does it make more sense? It'll have some common options like source, target, when, while having one of the actions like join (this issue), copy (#3962), etc., similar to aggregate processor.

@dlvenable
Copy link
Member

@oeyh , I do think we can improve the experience for discovering processors, but that seems best accomplished by changes to the documentation page. Perhaps we can have a page to discover operations by specific types.

Right now, we already support list mutations in other processors. Add items, and delete items are available for lists right? And other processors are often action-orient - what are they doing. I think this could cause user confusion.

The update proposal for list with the source and target in the root is an improvement. But, I don't know if all list operations will operate this way.

It may be useful to outline all the expected operations of this "list" processor to see where they have commonality.

@dlvenable
Copy link
Member

Also, the goal of the aggregate processor was somewhat unique. We very intentionally wanted pluggable actions because of requests to create custom aggregations. So users can define their own action for the aggregate processor without having to know a great deal about the internal implementation. I don't think that necessarily applies here because each action is doing something very different with lists.

@oeyh
Copy link
Collaborator

oeyh commented Feb 12, 2024

Closed via #3867 and #4075

@oeyh oeyh closed this as completed Feb 12, 2024
@github-project-automation github-project-automation bot moved this from Unplanned to Done in Data Prepper Tracking Board Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plugin - processor A plugin to manipulate data in the data prepper pipeline.
Projects
Development

No branches or pull requests

3 participants