Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add mask segments #385

Merged
merged 3 commits into from
Jan 15, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@ Types of changes

## [1.30.0]

- `Added` mask `partition` to handle fields containing different types of values by applying distinct transformations
- `Added` mask `partitions` to handle fields containing different types of values by applying distinct transformations
- `Added` mask `segments` to allow transformations on specific parts of a field's value using regular expressions to capture subgroups

## [1.29.1]

Expand Down
26 changes: 26 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,7 @@ The following types of masks can be used :
* [`pipe`](#pipe) is a mask to handle complex nested array structures, it can read an array as an object stream and process it with a sub-pipeline.
* [`apply`](#apply) process selected data with a sub-pipeline.
* [`partitions`](#partitions) will rely on conditions to identify specific cases.
* [`segments`](#segments) allow transformations on specific parts of a field's value using regular expressions subgroups captures.
* [`luhn`](#luhn) can generate valid numbers using the Luhn algorithm (e.g. french SIRET or SIREN).
* [`markov`](#markov) can generate pseudo text based on a sample text.
* [`findInCSV`](#findincsv) get one or multiple csv lines which matched with Json entry value from CSV files.
Expand Down Expand Up @@ -1099,6 +1100,31 @@ The partition mask will rely on conditions to identify specific cases and apply

[Return to list of masks](#possible-masks)

### Segments

[![Try it](https://img.shields.io/badge/-Try%20it%20in%20PIMO%20Play-brightgreen)](https://cgi-fr.github.io/pimo-play/#c=G4UwTgzglg9gdgLgAQCICMKBQBbAhhAayjgHMFNMkkBaJCEAGxAGMAXGMcq7pAKwngAHXKwAWyFFAAmWHnkJcedECWwg4rCIqVIwKkAA8JAPQAKACgD8pgDxNWrcBAB8AbQCC1AFoBdAN4AzAC+AJRWtlJQJFCabgAM1ACc-sEhACSyOrogggy4zCDaWfaOkEVZNEgAZlVo5RXcBCAAngBiYDDYAKJwwBKtrWgA+l0AcgDCAEoAmqYAKgCSAPKjQwDSXdOZDUpSnbjEEu4AQuMAIl2tAOIAEgsAUmsAMgCyo0umAIqTAMpzAKoANQA6gANaZebZZSLRTT1HS0Gp1Sg7JRNNodbq9fqDEYTGbzZarDZbFGo7h7PCHVBxNAAJgCABYAKwANgA7AAORJYIA&i=N4KABGBECWAmkC4oAUCCAhAwgRgEwGZIQBfIA)

The segments mask allow transformations on specific parts of a field's value. This mask will use regular expressions to capture subgroups and apply transformations to them individually. Example configuration:

```yaml
- selector:
jsonpath: "id"
mask:
segments:
regex: "^P(?P<letters>[A-Z]{3})(?P<digits>[0-9]{3})$"
replace:
letters:
- ff1:
keyFromEnv: "FF1_ENCRYPTION_KEY"
domain: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
digits:
- ff1:
keyFromEnv: "FF1_ENCRYPTION_KEY"
domain: "0123456789"
```

[Return to list of masks](#possible-masks)

### FindInCSV

[![Try it](https://img.shields.io/badge/-Try%20it%20in%20PIMO%20Play-brightgreen)](https://cgi-fr.github.io/pimo-play/#c=G4UwTgzglg9gdgLgAQCICMKBQBbAhhAayjgHMFMkkBaJCEAGxAGMAXGMcyrpAKwngAOuFgAtkKYgDMYWbnkIRO3aklwATNUnGzlNScTUBJOAGEAygDUlyrgFcwUcSJYsBigPTuSUCCwB03qK2AEa2dGBM8CwgcP6R2O64YNje9IwQ7mgAnAAswUySkgDMAKwADGVoIADsIMElRbgATLgAHME5OVXtTUwAbO5guADu7llNTRX5Zbh91UVqJUwgTWhoM7jqOSWdIGp9TDmVZZJ9rdXuAjAEINjwfkwQwDo2lCAAHrisALLCTGIUV7cR7AZAAcgA3hCABQGD5IPyoAAqAE8BCAkBgAJRIAA+SHoMGG4CQAF9SWDAUC3rEwCjxFC-Cw0SAAPpockvV48L5MJJqazUkHgxkAOVw2Ax+MJxLAZIpVOpMRYdIZEL8cAlUpl4E51K4ipsH3RrD24mEVEY+BYVHgIC5NhEIHU4GQKtsIENyhVUGwbrAHswQA&i=N4KABGBEAuCeAOBTA+gRkgLigMwJYCdFIAacKAOwEMBbIrSAY0v1vIBNF9IQBfIA)
Expand Down
2 changes: 2 additions & 0 deletions internal/app/pimo/pimo.go
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ import (
"github.com/cgi-fr/pimo/pkg/regex"
"github.com/cgi-fr/pimo/pkg/remove"
"github.com/cgi-fr/pimo/pkg/replacement"
"github.com/cgi-fr/pimo/pkg/segment"
"github.com/cgi-fr/pimo/pkg/sequence"
"github.com/cgi-fr/pimo/pkg/sha3"
"github.com/cgi-fr/pimo/pkg/statistics"
Expand Down Expand Up @@ -345,6 +346,7 @@ func injectMaskFactories() []model.MaskFactory {
sha3.Factory,
apply.Factory,
partition.Factory,
segment.Factory,
}
}

Expand Down
8 changes: 7 additions & 1 deletion pkg/model/model.go
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,11 @@ type PartitionType struct {
Then []MaskType `yaml:"then" json:"then" jsonschema_description:"list of masks to execute if the condition is active"`
}

type SegmentType struct {
Regex string `yaml:"regex" json:"regex" jsonschema_description:"regex used to create segments using group captures, groups must be named"`
Replace map[string][]MaskType `yaml:"replace" json:"replace" jsonschema_description:"list of masks to execute for each group"`
}

type MaskType struct {
Add Entry `yaml:"add,omitempty" json:"add,omitempty" jsonschema:"oneof_required=Add,title=Add Mask,description=Add a new field in the JSON stream"`
AddTransient Entry `yaml:"add-transient,omitempty" json:"add-transient,omitempty" jsonschema:"oneof_required=AddTransient,title=Add Transient Mask" jsonschema_description:"Add a new temporary field, that will not show in the JSON output"`
Expand Down Expand Up @@ -286,7 +291,8 @@ type MaskType struct {
Sequence SequenceType `yaml:"sequence,omitempty" json:"sequence,omitempty" jsonschema:"oneof_required=Sequence,title=Sequence Mask" jsonschema_description:"Generate a sequenced ID that follows specified format"`
Sha3 Sha3Type `yaml:"sha3,omitempty" json:"sha3,omitempty" jsonschema:"oneof_required=Sha3,title=Sha3 Mask" jsonschema_description:"Generate a variable-length crytographic hash (collision resistant)"`
Apply ApplyType `yaml:"apply,omitempty" json:"apply,omitempty" jsonschema:"oneof_required=Apply,title=Apply Mask" jsonschema_description:"Call external masking file"`
Partition []PartitionType `yaml:"partitions,omitempty" json:"partitions,omitempty" jsonschema:"oneof_required=Partition,title=Partition Mask" jsonschema_description:"Identify specific cases and apply a defined list of masks for each case"`
Partition []PartitionType `yaml:"partitions,omitempty" json:"partitions,omitempty" jsonschema:"oneof_required=Partition,title=Partitions Mask" jsonschema_description:"Identify specific cases and apply a defined list of masks for each case"`
Segment SegmentType `yaml:"segments,omitempty" json:"segments,omitempty" jsonschema:"oneof_required=Segment,title=Segments Mask" jsonschema_description:"Allow transformations on specific parts of a field's value"`
}

type Masking struct {
Expand Down
2 changes: 1 addition & 1 deletion pkg/partition/partition.go
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ func execPipeline(pipeline model.Pipeline, e model.Entry) (model.Entry, error) {
}

func (me MaskEngine) Mask(e model.Entry, context ...model.Dictionary) (model.Entry, error) {
log.Info().Msg("Mask partition")
log.Info().Msg("Mask partitions")

// exec all partitions
for _, partition := range me.partitions {
Expand Down
138 changes: 138 additions & 0 deletions pkg/segment/segment.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
package segment

import (
"hash/fnv"
"regexp"
"strings"
tmpl "text/template"

"github.com/cgi-fr/pimo/pkg/model"
"github.com/rs/zerolog/log"
)

type MaskEngine struct {
re *regexp.Regexp
pipelines map[string]model.Pipeline
seed int64
seeder model.Seeder
}

func buildDefinition(masks []model.MaskType, globalSeed int64) model.Definition {
definition := model.Definition{
Version: "1",
Seed: globalSeed,
Functions: nil,
Masking: []model.Masking{},
Caches: nil,
}

for _, mask := range masks {
definition.Masking = append(definition.Masking, model.Masking{
Selector: model.SelectorType{Jsonpath: "."},
Mask: mask,
})
}

return definition
}

// NewMask return a MaskEngine from a value
func NewMask(segment model.SegmentType, caches map[string]model.Cache, fns tmpl.FuncMap, seed int64, seeder model.Seeder, seedField string) (MaskEngine, error) {
var err error

pipelines := map[string]model.Pipeline{}

for groupname, masks := range segment.Replace {
definition := buildDefinition(masks, seed)
pipeline := model.NewPipeline(nil)
pipeline, _, err = model.BuildPipeline(pipeline, definition, caches, fns, "", "")
if err != nil {
return MaskEngine{}, err
}

pipelines[groupname] = pipeline
}

return MaskEngine{
re: regexp.MustCompile(segment.Regex),
pipelines: pipelines,
seed: seed,
seeder: seeder,
}, nil
}

// replace captured groups named in the `value` string using the values ​​calculated by the `replacements` map
func replace(value string, re *regexp.Regexp, replacements map[string]func(string) (string, error)) (string, error) {
result := &strings.Builder{}

matchIndexes := re.FindStringSubmatchIndex(value)
groupNames := re.SubexpNames()

writeCount := 0
for i := 2; i < len(matchIndexes); i += 2 {
groupNumber := i / 2
groupName := groupNames[groupNumber]
startIndex := matchIndexes[i]
endIndex := matchIndexes[i+1]
capturedValue := value[startIndex:endIndex]

result.WriteString(value[writeCount:startIndex])
writeCount = endIndex

if replacement, exists := replacements[groupName]; exists {
if masked, err := replacement(capturedValue); err != nil {
return value, err
} else {
result.WriteString(masked)
}
}
}
result.WriteString(value[writeCount:])

return result.String(), nil
}

func (me MaskEngine) Mask(e model.Entry, context ...model.Dictionary) (model.Entry, error) {
log.Info().Msg("Mask segments")

replacements := map[string]func(string) (string, error){}

for groupname, pipeline := range me.pipelines {
replacements[groupname] = func(match string) (string, error) {
var result []model.Entry
err := pipeline.
WithSource(model.NewSourceFromSlice([]model.Dictionary{model.NewDictionary().With(".", match)})).
AddSink(model.NewSinkToSlice(&result)).
Run()
if err != nil {
return match, err
}
return result[0].(string), nil
}
}

result, err := replace(e.(string), me.re, replacements)
if err != nil {
return e, err
}

return result, nil
}

// Factory create a mask from a configuration
func Factory(conf model.MaskFactoryConfiguration) (model.MaskEngine, bool, error) {
if len(conf.Masking.Mask.Segment.Regex) > 0 {
seeder := model.NewSeeder(conf.Masking.Seed.Field, conf.Seed)

// set differents seeds for differents jsonpath
h := fnv.New64a()
h.Write([]byte(conf.Masking.Selector.Jsonpath))
conf.Seed += int64(h.Sum64()) //nolint:gosec
mask, err := NewMask(conf.Masking.Mask.Segment, conf.Cache, conf.Functions, conf.Seed, seeder, conf.Masking.Seed.Field)
if err != nil {
return mask, true, err
}
return mask, true, nil
}
return nil, false, nil
}
37 changes: 36 additions & 1 deletion schema/v1/pimo.schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -590,6 +590,12 @@
"partitions"
],
"title": "Partition"
},
{
"required": [
"segments"
],
"title": "Segment"
}
],
"properties": {
Expand Down Expand Up @@ -790,8 +796,13 @@
"$ref": "#/$defs/PartitionType"
},
"type": "array",
"title": "Partition Mask",
"title": "Partitions Mask",
"description": "Identify specific cases and apply a defined list of masks for each case"
},
"segments": {
"$ref": "#/$defs/SegmentType",
"title": "Segments Mask",
"description": "Allow transformations on specific parts of a field's value"
}
},
"additionalProperties": false,
Expand Down Expand Up @@ -1030,6 +1041,30 @@
"additionalProperties": false,
"type": "object"
},
"SegmentType": {
"properties": {
"regex": {
"type": "string",
"description": "regex used to create segments using group captures, groups must be named"
},
"replace": {
"additionalProperties": {
"items": {
"$ref": "#/$defs/MaskType"
},
"type": "array"
},
"type": "object",
"description": "list of masks to execute for each group"
}
},
"additionalProperties": false,
"type": "object",
"required": [
"regex",
"replace"
]
},
"SelectorType": {
"properties": {
"jsonpath": {
Expand Down
30 changes: 30 additions & 0 deletions test/suites/masking_segment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
name: segment mask
testcases:
- name: simple segmentation test
steps:
- script: |-
cat > masking.yml <<EOF
version: "1"
seed: 42
masking:
- selector:
jsonpath: "id"
mask:
segments:
regex: "^P(?P<letters>[A-Z]{3})(?P<digits>[0-9]{3})$"
replace:
letters:
- ff1:
keyFromEnv: "FF1_ENCRYPTION_KEY"
domain: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
digits:
- ff1:
keyFromEnv: "FF1_ENCRYPTION_KEY"
domain: "0123456789"
EOF
- script: |-
echo '{"id": "PABC123"}' | FF1_ENCRYPTION_KEY="70NZ2NWAqk9/A21vBPxqlA==" pimo
assertions:
- result.code ShouldEqual 0
- result.systemoutjson.id ShouldEqual PVBR675
- result.systemerr ShouldBeEmpty
Loading