Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: parquet support #334

Merged
merged 22 commits into from
Nov 20, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -16,6 +16,7 @@ Types of changes

## [1.28.0]

- `Added` new action for parquet files (experimental feature)
- `Added` mock command to intercept HTTP requests/responses to a web service and apply maskings
- `Added` time in JSON logs generated by `--log-json` flag

71 changes: 71 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1439,6 +1439,77 @@ After executing the command with the correct configuration, here is the expected

[Return to list of masks](#possible-masks)

### Parsing Parquet files

Warning: parquet support is still an experimental feature, we are currently considering to migrate this feature to a new dataconnector type in LINO (might be dropped from PIMO in future releases)

To mask data in a Parquet file using PIMO with the correct configuration option, follow this updated approach:

```bash
pimo parquet data.parquet maskedData.parquet --config masking.yml
```

#### Example

Assume the Parquet file `data.parquet` has the following table structure:

| agency | agency_number | name | account_type | account_number | annual_income |
|--------------|---------------|--------|--------------|----------------|---------------|
| NewYork | 0032 | Doe | classic | 12345 | 50000 |
| SanFrancisco | 7894 | Smith | saving | 67890 | 60000 |

#### Masking Configuration (`masking.yml`)

```yaml
version: "1"
seed: 42

masking:
- selector:
jsonpath: "agency_number" # mask agency_number column
mask:
template: '{{MaskRegex "[0-9]{4}$"}}'

- selector:
jsonpath: "name" # mask name column
mask:
randomChoiceInUri: "pimo://nameFR"

- selector:
jsonpath: "account_type" # mask account_type column
mask:
randomChoice:
- "classic"
- "saving"
- "securitie"

- selector:
jsonpath: "account_number" # mask account_number column
masks:
- incremental:
start: 1
increment: 1
- template: "{{.account_number}}"
```

#### Resulting Masked Parquet File

After executing the command:

```bash
pimo parquet data.parquet maskedData.parquet --config masking.yml
```

The `maskedData.parquet` file will contain the following masked data:

| agency | agency_number | name | account_type | account_number | annual_income |
|--------------|---------------|----------|--------------|----------------|---------------|
| NewYork | 2308 | Rolande | saving | 1 | 50000 |
| SanFrancisco | 9724 | Matéo | securitie | 2 | 60000 |

This example demonstrates how to mask specific columns using PIMO, applying random choices, regular expressions, and incremental masking.

[Return to list of masks](#possible-masks)

## `pimo://` scheme

24 changes: 24 additions & 0 deletions cmd/pimo/main.go
Original file line number Diff line number Diff line change
@@ -74,6 +74,8 @@ var (
serve string
maxBufferCapacity int
profiling string
parquetInput string
parquetOutput string
)

func main() {
@@ -187,6 +189,26 @@ There is NO WARRANTY, to the extent permitted by law.`, version, commit, buildDa
xmlCmd.Flags().Int64VarP(&seedValue, "seed", "s", 0, "set seed")
rootCmd.AddCommand(xmlCmd)

// Add command for parquet transformer
parquetCmd := &cobra.Command{
Use: "parquet input_parquet_file output_parquet_file",
Short: "Parsing and masking a parquet file",
Args: cobra.ExactArgs(2),
Run: func(cmd *cobra.Command, args []string) {
initLog()
if len(catchErrors) > 0 {
skipLineOnError = true
skipLogFile = catchErrors
}
parquetInput = args[0]
parquetOutput = args[1]

run(cmd)
},
}
parquetCmd.Flags().Int64VarP(&seedValue, "seed", "s", 0, "set seed")
rootCmd.AddCommand(parquetCmd)

rootCmd.AddCommand(&cobra.Command{
Use: "flow",
Run: func(cmd *cobra.Command, args []string) {
@@ -254,6 +276,8 @@ func run(cmd *cobra.Command) {
CachesToDump: cachesToDump,
CachesToLoad: cachesToLoad,
XMLCallback: len(serve) > 0,
ParquetInput: parquetInput,
ParquetOutput: parquetOutput,
}

var pdef model.Definition
25 changes: 22 additions & 3 deletions go.mod
Original file line number Diff line number Diff line change
@@ -6,6 +6,7 @@ require (
github.com/CGI-FR/xixo v0.1.8
github.com/Masterminds/sprig/v3 v3.3.0
github.com/adrienaury/zeromdc v0.1.1
github.com/apache/arrow/go/v12 v12.0.1
github.com/capitalone/fpe v1.2.1
github.com/goccy/go-json v0.10.3
github.com/goccy/go-yaml v1.12.0
@@ -28,35 +29,53 @@ require (

require (
dario.cat/mergo v1.0.1 // indirect
github.com/JohnCGriffin/overflow v0.0.0-20211019200055-46fa312c352c // indirect
github.com/Masterminds/goutils v1.1.1 // indirect
github.com/Masterminds/semver/v3 v3.3.0 // indirect
github.com/andybalholm/brotli v1.1.0 // indirect
github.com/apache/thrift v0.16.0 // indirect
github.com/bahlo/generic-list-go v0.2.0 // indirect
github.com/buger/jsonparser v1.1.1 // indirect
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/fatih/color v1.13.0 // indirect
github.com/felixge/fgprof v0.9.3 // indirect
github.com/golang-jwt/jwt v3.2.2+incompatible // indirect
github.com/golang/protobuf v1.5.2 // indirect
github.com/golang/snappy v0.0.4 // indirect
github.com/google/flatbuffers v2.0.8+incompatible // indirect
github.com/google/gxui v0.0.0-20151028112939-f85e0a97b3a4 // indirect
github.com/google/pprof v0.0.0-20211214055906-6f57359322fd // indirect
github.com/google/uuid v1.6.0 // indirect
github.com/huandu/xstrings v1.5.0 // indirect
github.com/inconshreveable/mousetrap v1.1.0 // indirect
github.com/klauspost/asmfmt v1.3.2 // indirect
github.com/klauspost/compress v1.17.9 // indirect
github.com/klauspost/cpuid/v2 v2.0.9 // indirect
github.com/labstack/gommon v0.4.2 // indirect
github.com/mailru/easyjson v0.7.7 // indirect
github.com/mattn/go-colorable v0.1.13 // indirect
github.com/minio/asm2plan9s v0.0.0-20200509001527-cdd76441f9d8 // indirect
github.com/minio/c2goasm v0.0.0-20190812172519-36a3d3bbc4f3 // indirect
github.com/mitchellh/copystructure v1.2.0 // indirect
github.com/mitchellh/reflectwalk v1.0.2 // indirect
github.com/pierrec/lz4/v4 v4.1.21 // indirect
github.com/pmezard/go-difflib v1.0.0 // indirect
github.com/shopspring/decimal v1.4.0 // indirect
github.com/smartystreets/goconvey v1.6.4 // indirect
github.com/spf13/pflag v1.0.5 // indirect
github.com/valyala/bytebufferpool v1.0.0 // indirect
github.com/valyala/fasttemplate v1.2.2 // indirect
github.com/wk8/go-ordered-map/v2 v2.1.8 // indirect
golang.org/x/net v0.24.0 // indirect
github.com/zeebo/xxh3 v1.0.2 // indirect
golang.org/x/mod v0.19.0 // indirect
golang.org/x/net v0.27.0 // indirect
golang.org/x/sync v0.8.0 // indirect
golang.org/x/sys v0.25.0 // indirect
golang.org/x/time v0.5.0 // indirect
golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1 // indirect
gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127 // indirect
golang.org/x/tools v0.23.0 // indirect
golang.org/x/xerrors v0.0.0-20220609144429-65e65417b02f // indirect
google.golang.org/genproto v0.0.0-20200526211855-cb27e3aa2013 // indirect
google.golang.org/grpc v1.49.0 // indirect
google.golang.org/protobuf v1.34.2 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
)
Loading