Fix data anonymizer failures by updating legacy parser. #82

Yury-Fridlyand · 2022-06-28T19:50:31Z

Signed-off-by: Yury Fridlyand yuryf@bitquilltech.com

Description

Don't print query if error occurred:

opensearch-project-sql/legacy/src/main/java/org/opensearch/sql/legacy/utils/QueryDataAnonymizer.java

Lines 40 to 44 in d9d25ad

    
           } catch (Exception e) { 
        
               LOG.warn("Caught an exception when anonymizing sensitive data"); 
        
               resultQuery = ""; 
        
           } 
        
           return resultQuery;

Don't validate tokens' type in [], fixes processing functions with multiple fields:

opensearch-project-sql/legacy/src/main/java/org/opensearch/sql/legacy/parser/ElasticSqlExprParser.java

Lines 149 to 155 in d9d25ad

    
                           /* 
        
                           if (lexer.token() != Token.IDENTIFIER && lexer.token() != Token.INDEX 
        
                                   && lexer.token() != Token.LITERAL_CHARS && lexer.token() != Token.LITERAL_ALIAS) { 
        
                               throw new ParserException("All items between Brackets should be identifiers , got:" 
        
                                       + lexer.token()); 
        
                           } 
        
                           */

Don't process MATCH ... AGAINST clause, because it is overlayed by MATCH function. Fixes failure on MATCH function:

opensearch-project-sql/legacy/src/main/java/org/opensearch/sql/legacy/parser/ElasticSqlExprParser.java

Lines 501 to 550 in d9d25ad

    
                       }/* else if ("MATCH".equalsIgnoreCase(ident)) { 
        
                           lexer.nextToken(); 
        
                           MySqlMatchAgainstExpr matchAgainstExpr = new MySqlMatchAgainstExpr(); 
        
                           if (lexer.token() == Token.RPAREN) { 
        
                               lexer.nextToken(); 
        
                           } else { 
        
                               exprList(matchAgainstExpr.getColumns(), matchAgainstExpr); 
        
                               accept(Token.RPAREN); 
        
                           } 
        
                           acceptIdentifier("AGAINST"); 
        
                           accept(Token.LPAREN); 
        
                           SQLExpr against = primary(); 
        
                           matchAgainstExpr.setAgainst(against); 
        
                           if (lexer.token() == Token.IN) { 
        
                               lexer.nextToken(); 
        
                               if (identifierEquals("NATURAL")) { 
        
                                   lexer.nextToken(); 
        
                                   acceptIdentifier("LANGUAGE"); 
        
                                   acceptIdentifier("MODE"); 
        
                                   if (lexer.token() == Token.WITH) { 
        
                                       lexer.nextToken(); 
        
                                       acceptIdentifier("QUERY"); 
        
                                       acceptIdentifier("EXPANSION"); 
        
                                       matchAgainstExpr.setSearchModifier( 
        
                                               MySqlMatchAgainstExpr.SearchModifier.IN_NATURAL_LANGUAGE_MODE_WITH_QUERY_EXPANSION); 
        
                                   } else { 
        
                                       matchAgainstExpr.setSearchModifier( 
        
                                               MySqlMatchAgainstExpr.SearchModifier.IN_NATURAL_LANGUAGE_MODE); 
        
                                   } 
        
                               } else if (identifierEquals("BOOLEAN")) { 
        
                                   lexer.nextToken(); 
        
                                   acceptIdentifier("MODE"); 
        
                                   matchAgainstExpr.setSearchModifier(MySqlMatchAgainstExpr.SearchModifier.IN_BOOLEAN_MODE); 
        
                               } else { 
        
                                   throw new ParserException("Syntax error: " + lexer.token()); 
        
                               } 
        
                           } else if (lexer.token() == Token.WITH) { 
        
                               throw new ParserException("Syntax error: " + lexer.token()); 
        
                           } 
        
                           accept(Token.RPAREN); 
        
                           expr = matchAgainstExpr; 
        
                           return primaryRest(expr); 
        
                       }*/ else if ("CONVERT".equalsIgnoreCase(ident)) {

Issues Resolved

[WARN ][o.o.s.l.u.QueryDataAnonymizer] [...] Caught an exception when anonymizing sensitive data

TODO

Discuss
Confirm that we can remove support of MATCH ... AGAINST (Is it even supported BTW?)
Test it - WIP

Check List

New functionality includes testing.
- All tests pass, including unit test, integration test and doctest
New functionality has been documented.
- New functionality has javadoc added
- New functionality has user manual doc added
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Yury Fridlyand <yuryf@bitquilltech.com>

Signed-off-by: Sean Kao <seankao@amazon.com>

Yury-Fridlyand · 2022-06-28T22:30:20Z

Don't process MATCH ... AGAINST clause...

This fix damages support of this clause, but it was working ever. I failed to run it on OpenSearch 1.1 with plugin pre-match support (commit 012cc03).
According to the article MATCH ... AGAINST provides full-text search for MySQL. Meanwhile, OpenSearch provides a set of full text search (relevance based search) functions which are much more flexible, see opensearch-project#182.
I don't think that we need to keep support for MATCH ... AGAINST especially in the legacy parser.

Signed-off-by: Yury Fridlyand <yuryf@bitquilltech.com>

codecov · 2022-06-28T22:49:07Z

Codecov Report

❗ No coverage uploaded for pull request base (integ-fix-anonymizer@9e2a9ff). Click here to learn what that means.
The diff coverage is n/a.

@@                   Coverage Diff                   @@
##             integ-fix-anonymizer      #82   +/-   ##
=======================================================
  Coverage                        ?   97.72%           
  Complexity                      ?     2816           
=======================================================
  Files                           ?      271           
  Lines                           ?     6934           
  Branches                        ?      439           
=======================================================
  Hits                            ?     6776           
  Misses                          ?      157           
  Partials                        ?        1

Flag	Coverage Δ
sql-engine	`97.72% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e2a9ff...ebe978b. Read the comment docs.

MaxKsyunz · 2022-06-29T07:14:47Z

We need an anonymizer that works with the new engine. At some point the legacy engine goes will go away.

Until that happens, I think it's best if QueryDataAnonymizer.anonymizeData works similarly to how queries are processed -- try new engine first, if it fails -- try legacy engine, if it fails -- error out.

This way we we gain an anonymizer for the new SQL engine, avoid changing the legacy parser, and avoid regressions in anonymizer when legacy engine is removed.

Anonymizer for the new SQL engine would have a lot in common with PPL anonymizer. Probably will make sense to have a common subclass.

What do you think?

Yury-Fridlyand · 2022-06-29T16:01:00Z

We need an anonymizer that works with the new engine. At some point the legacy engine goes will go away.

Until that happens, I think it's best if QueryDataAnonymizer.anonymizeData works similarly to how queries are processed -- try new engine first, if it fails -- try legacy engine, if it fails -- error out.

This way we we gain an anonymizer for the new SQL engine, avoid changing the legacy parser, and avoid regressions in anonymizer when legacy engine is removed.

Anonymizer for the new SQL engine would have a lot in common with PPL anonymizer. Probably will make sense to have a common base class.

What do you think?

Great idea, thanks. I will try it out.

acarbonetto · 2022-07-13T22:40:36Z

legacy/src/main/java/org/opensearch/sql/legacy/utils/QueryDataAnonymizer.java

@@ -39,7 +39,7 @@ public static String anonymizeData(String query) {
                    .replaceAll("[\\n][\\t]+", " ");
        } catch (Exception e) {
            LOG.warn("Caught an exception when anonymizing sensitive data");
-            resultQuery = query;
+            resultQuery = "";


The problem with returning an empty string is that it's not solving the root of the problem. The anonymizer is repressing ALL exceptions from the anonymizer, and returning a valid state.
We should, instead, be returning a failure state in the response or throwing some exception. The function that's calling the anonymizer is completely unaware that an error occurred.

Yury-Fridlyand and others added 2 commits June 28, 2022 12:42

Fix data anonymizer failures by updating legacy parser.

d9d25ad

Signed-off-by: Yury Fridlyand <yuryf@bitquilltech.com>

Fixup for opensearch-project#646 (opensearch-project#664)

058159d

Signed-off-by: Sean Kao <seankao@amazon.com>

Merge remote-tracking branch 'upstream/main' into dev-fix-anonymizer

ebe978b

Signed-off-by: Yury Fridlyand <yuryf@bitquilltech.com>

acarbonetto reviewed Jul 13, 2022

View reviewed changes

Yury-Fridlyand closed this Sep 12, 2022

Yury-Fridlyand deleted the dev-fix-anonymizer branch September 12, 2022 23:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix data anonymizer failures by updating legacy parser. #82

Fix data anonymizer failures by updating legacy parser. #82

Yury-Fridlyand commented Jun 28, 2022 •

edited

Loading

Yury-Fridlyand commented Jun 28, 2022

codecov bot commented Jun 28, 2022 •

edited

Loading

MaxKsyunz commented Jun 29, 2022

Yury-Fridlyand commented Jun 29, 2022 •

edited by MaxKsyunz

Loading

acarbonetto Jul 13, 2022

	} catch (Exception e) {
	LOG.warn("Caught an exception when anonymizing sensitive data");
	resultQuery = "";
	}
	return resultQuery;

	/*
	if (lexer.token() != Token.IDENTIFIER && lexer.token() != Token.INDEX
	&& lexer.token() != Token.LITERAL_CHARS && lexer.token() != Token.LITERAL_ALIAS) {
	throw new ParserException("All items between Brackets should be identifiers , got:"
	+ lexer.token());
	}
	*/

	}/* else if ("MATCH".equalsIgnoreCase(ident)) {
	lexer.nextToken();
	MySqlMatchAgainstExpr matchAgainstExpr = new MySqlMatchAgainstExpr();

	if (lexer.token() == Token.RPAREN) {
	lexer.nextToken();
	} else {
	exprList(matchAgainstExpr.getColumns(), matchAgainstExpr);
	accept(Token.RPAREN);
	}

	acceptIdentifier("AGAINST");

	accept(Token.LPAREN);
	SQLExpr against = primary();
	matchAgainstExpr.setAgainst(against);

	if (lexer.token() == Token.IN) {
	lexer.nextToken();
	if (identifierEquals("NATURAL")) {
	lexer.nextToken();
	acceptIdentifier("LANGUAGE");
	acceptIdentifier("MODE");
	if (lexer.token() == Token.WITH) {
	lexer.nextToken();
	acceptIdentifier("QUERY");
	acceptIdentifier("EXPANSION");
	matchAgainstExpr.setSearchModifier(
	MySqlMatchAgainstExpr.SearchModifier.IN_NATURAL_LANGUAGE_MODE_WITH_QUERY_EXPANSION);
	} else {
	matchAgainstExpr.setSearchModifier(
	MySqlMatchAgainstExpr.SearchModifier.IN_NATURAL_LANGUAGE_MODE);
	}
	} else if (identifierEquals("BOOLEAN")) {
	lexer.nextToken();
	acceptIdentifier("MODE");
	matchAgainstExpr.setSearchModifier(MySqlMatchAgainstExpr.SearchModifier.IN_BOOLEAN_MODE);
	} else {
	throw new ParserException("Syntax error: " + lexer.token());
	}
	} else if (lexer.token() == Token.WITH) {
	throw new ParserException("Syntax error: " + lexer.token());
	}

	accept(Token.RPAREN);

	expr = matchAgainstExpr;

	return primaryRest(expr);
	}*/ else if ("CONVERT".equalsIgnoreCase(ident)) {

Fix data anonymizer failures by updating legacy parser. #82

Fix data anonymizer failures by updating legacy parser. #82

Conversation

Yury-Fridlyand commented Jun 28, 2022 • edited Loading

Description

Issues Resolved

TODO

Check List

Yury-Fridlyand commented Jun 28, 2022

codecov bot commented Jun 28, 2022 • edited Loading

Codecov Report

MaxKsyunz commented Jun 29, 2022

Yury-Fridlyand commented Jun 29, 2022 • edited by MaxKsyunz Loading

acarbonetto Jul 13, 2022

Choose a reason for hiding this comment

Yury-Fridlyand commented Jun 28, 2022 •

edited

Loading

codecov bot commented Jun 28, 2022 •

edited

Loading

Yury-Fridlyand commented Jun 29, 2022 •

edited by MaxKsyunz

Loading