Draft PR for Pysa Fuzzer #886

esohel30 · 2024-06-19T07:46:39Z

Draft PR for the pysa fuzzer project 🚀

arthaud · 2024-06-19T09:33:05Z

Cool, this is a good start.

A few comments:

Could we use type annotations in the script? See https://docs.python.org/3/library/typing.html. We try to annotate most of our python code at Meta. It should be fairly easy here, most arguments are str and the return value of most functions is also str
I think this is not generating valid code, since a variable might be used before it is defined. Could you find a way to prevent that problem?
There are many patterns that cannot be generated because generate_complex_expression is not truly recursive. For instance, generate_if_statement uses generate_assignment which itself uses generate_variable_expression. It will never generate a if condition that contains a function call, for instance. Do you have ideas on how to improve that? I can help if needed.
Instead of implementing your own indent, just use textwrap.indent from the standard library
Finally, most importantly: in the end, this always generate sink(source()) (where source and sink are one of the 6 functions). Pysa will likely find the flow 99.99% of the time since the whole code before that doesn't really matter. What we need is the source to actually flow through the randomly generated code. Then, the hard part is to either: only generate code where the flow is valid, or somehow have a way to execute the code to verify if the flow is valid (to see if Pysa agrees with it).

esohel30 · 2024-06-24T08:25:08Z

Enhanced Pysa fuzzer by adding type annotations, ensuring variables are defined before use, making expression generation truly recursive, and using textwrap.indent for better code formatting. Added a defined_variables set to track declared variables, improving code validity. Despite these improvements, the fuzzer is still far from perfect and requires further refinement to enhance code generation diversity and flow validation. Will continue to work on it extensively! Might even approach it in a different manner now that I have been playing around with a it for a bit.

arthaud · 2024-06-24T11:48:51Z

Cool, another round of feedback:

I believe this never generates a valid flow, since sink_var is always the result of calling get_next_variable called after generating the expression, so that variable is never assigned. To fix that, you need the whole code generation to know the source_var and sink_var and have an assignment at some point.
There is a bug where this can generate x = x since to_var is created before the right hand side expression is generated
What if get_next_variable runs out of variable? We should probably use variable names such as a, .., z, aa, .., az, etc.
Can we find a way to avoid using mutable global variables (e.g, defined_variables and current_var_index)? This is a problem if we want to generate several tests by calling generate_source_to_sink multiple times. A solution would be to use a class with attributes. All the functions to generate expressions could be methods of that class.
It is still not fully recursive: generate_arithmetic_expression is never called, and generate_assignment only uses generate_variable_expression. This will never generate x = foo() or x = y + z + 1. Here is an idea: you should first differentiate between expressions (return a value) and statements. The current generate_complex_expression generates statements, not expressions. Then, the actual generate_expression function should be recursive and accept a maximum depth. Pick a number, then if it's 0: generate a constant, if it's 1, generate an addition, by calling generate_expression recursively, if it's 2, generate a substraction, etc. if it's some number N, generate a function call, etc.
It would be interesting to also generate expressions with lists (creating lists, adding an element to a list, accessing a list item) and dictionaries (dict).

I believe @alexkassil also has a different idea to generate code that always has a valid flow, feel free to ask him if you are interested.

scripts/pysa_fuzzer/fuzzer2.py

scripts/pysa_fuzzer/fuzzer.py

arthaud · 2024-06-26T11:12:04Z

My feedback:

This is definitely going into the right direction. fuzzer2 always generates code with a valid flow, which is great.
The generated code is still not valid, unfortunately. Variables are sometimes integers, and sometimes strings, used interchangeably. This would lead to a runtime error since 'a' + 1 is not valid python code.
While we do generate interesting code, the code we generate would always have all variables tainted. It would be interesting to have more flows with variables that are not tainted. We need to inject variables that are not constructed from "prev_var".
The main downside of the approach in fuzzer2 is that we generate a limited set of patterns, and those are only combined by being concatenated. If pysa handles those patterns individually, it is likely that it will handle correctly all generated code. To fix that, I think we need to combine fuzzer.py and fuzzer2.py: each generate_xxx (e.g, generate_while_loop , generate_for_loop , etc.) should be recursive so it can have nested statemenets. Right now we only generate while loops that have {curr_var} += {prev_var}\ncounter += 1. It would be more interesting to have nested while loops, while loops with if conditions, etc.
Have you tried running pysa on the generated code? Does it find the issue? The next step is to do that automatically in the script, and run until we find an example where pysa doesn't find the flow.

…stead of just prev

…ing after too many slices. we do not want that

…g outputs in places

scripts/pysa_fuzzer/fuzzer2.py

alexkassil · 2024-06-27T19:40:03Z

Hey @esohel30 , so the idea for this project is to automatically find false negatives - ie flows that should be security issues, but for whatever reason pysa doesn't find it.

The way to do this is to generate increasingly complex valid flows for pysa to find. Everything generated should be a valid security issue.

https://github.com/facebook/pyre-check/tree/main/source/interprocedural_analyses/taint/test/integration take a look at the tests here (in the .py files).

Here's an example: https://github.com/facebook/pyre-check/blob/main/source/interprocedural_analyses/taint/test/integration/source_sink_flow.py

from builtins import _test_sink, _test_source


def bar():
    return _test_source()


def qux(arg):
    _test_sink(arg)


def bad(ok, arg):
    qux(arg)


def some_source():
    return bar()


def match_flows():
    x = some_source()
    bad(5, x)

One issue in this file is the flow in match_flows() -> bad()

One way to always generate valid issues is to start with _test_sink(_test_source()), and then pick operations that make it so the flow is one more hop away.

For example, let's say you add 3 functionalities of mutations to the fuzzer:

Extra variable
Function call
if statement else clause

And now you randomly pick from those 3 elements 4 times to get [1, 2, 3, 2].

Applying those mutations step by step gets you:

f1():
  x = test_source()
  _test_sink(x)

def f2():
  x = test_source()
  f2(x)

def f1(x):
  _test_sink(x)

def f2(cond):
  x = test_source()
  if cond:
    pass
  else:
    f2(x)

def f1(x):
  _test_sink(x)

def f3(x, cond):
    x = test_source()
    f2(x, cond)

def f2(x, cond):
  if cond:
    pass
  else:
    f2(x)

def f1(x):
  _test_sink(x)

Continuing adding more and more single hop transformations that preserve the flow will make the fuzzer be able to generate all the valid flows present in https://github.com/facebook/pyre-check/tree/main/source/interprocedural_analyses/taint/test/integration - I think for simplicity do not worry about any modelling other than _test_source and _test_sink

alexkassil

Looking at the code, here is how I recommend you change it high level going forward:

Look through https://github.com/facebook/pyre-check/tree/main/source/interprocedural_analyses/taint/test/integration and pick a few simpler cases to try to model with the fuzzer. Say somewhere around 5 cases
If you exhaustively generate all 4 length mutations, that's 555*5, or 625 valid but different pysa flows.
run pysa on all of them and validate pysa catches those errors

scripts/pysa_fuzzer/explanation.md

scripts/pysa_fuzzer/example_generation.py

scripts/pysa_fuzzer/fuzzer.py

scripts/pysa_fuzzer/code_generator.py

scripts/pysa_fuzzer/run2.py

scripts/pysa_fuzzer/code_generator2.py

…tion genration functions can be more easily understood

s

…tion

facebook-github-bot · 2024-09-09T09:37:58Z

@arthaud has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-09-09T17:24:42Z

@arthaud merged this pull request in 99a07a2.

alexkassil · 2024-09-09T18:35:27Z

Congrats and well done @esohel30 !

arthaud · 2024-09-09T19:02:23Z

Thanks for the hard work! We have finally merged this and found a few false negatives (see upcommit commits).

elementary draft pysa fuzzer

96d3dea

facebook-github-bot added the CLA Signed label Jun 19, 2024

esohel30 changed the title ~~elementary draft pysa fuzzer~~ Draft PR for Pysa fuzzer Jun 19, 2024

updated pysa fuzzer

e4c772b

worked on pysa fuzzer which generatres proper flow now

013c3ed

arthaud requested changes Jun 26, 2024

View reviewed changes

arthaud reviewed Jun 26, 2024

View reviewed changes

scripts/pysa_fuzzer/fuzzer.py Outdated Show resolved Hide resolved

scripts/pysa_fuzzer/fuzzer.py Outdated Show resolved Hide resolved

scripts/pysa_fuzzer/fuzzer.py Outdated Show resolved Hide resolved

esohel30 added 12 commits June 27, 2024 00:19

removed global variables. Created methods to reduce code redundacy

c449555

fixed issue with flow and sink function getting prev prev variable in…

f7a3aba

…stead of just prev

fixed issues with slicing method. Before it would return an empty str…

f13e4ca

…ing after too many slices. we do not want that

fixed errors with tuple generator function

8042257

added small change to concat method to decrease the string output

d2ad2e2

fixed bug with nested loops where the flow was being interupted.

7a6f598

fixed randomized data structure fuction and reduced unnessesary strin…

a4af7c3

…g outputs in places

fixed naming error

8c92265

updateed example generation py

0925170

added another example generation

377258d

added example inputs and outputs

66d6459

got rid of useless import

ea8fd03

arthaud requested changes Jun 27, 2024

View reviewed changes

scripts/pysa_fuzzer/fuzzer2.py Outdated Show resolved Hide resolved

scripts/pysa_fuzzer/fuzzer2.py Outdated Show resolved Hide resolved

scripts/pysa_fuzzer/fuzzer2.py Outdated Show resolved Hide resolved

scripts/pysa_fuzzer/fuzzer2.py Outdated Show resolved Hide resolved

esohel30 added 2 commits June 27, 2024 14:53

fixed bug with generate previous variable using -2 instead of -1

945f0fb

made addition, for loop and while loop recursive

e3311de

made generate list recursive

d638a0e

alexkassil requested changes Jun 27, 2024

View reviewed changes

scripts/pysa_fuzzer/explanation.md Outdated Show resolved Hide resolved

scripts/pysa_fuzzer/explanation.md Outdated Show resolved Hide resolved

scripts/pysa_fuzzer/example_generation.py Outdated Show resolved Hide resolved

scripts/pysa_fuzzer/fuzzer.py Outdated Show resolved Hide resolved

made generate dictionary recursive

fca42db

esohel30 added 5 commits July 28, 2024 21:07

added the exmaple

ae3acc5

added improvements to script

2e09236

finished working on the script

3194012

worked on the backus normus form generation for code

25b29e3

added 22 new mutations

61030fa

arthaud requested changes Aug 14, 2024

View reviewed changes

esohel30 added 17 commits September 7, 2024 00:12

testing commit

b8cf531

Merge branch 'fuzzer' of github.com:esohel30/pyre-check into fuzzer

a849f95

got rid of example generation

d1de53a

make output statements and code formatting more readable so that fucn…

f39ae47

…tion genration functions can be more easily understood

updated if else elif functionality

d8b7c6a

s

updated if else elif

120fbee

removed unnecesary demostration files

090540f

returned directly instead of using a wrapper varaible

80c9cf9

added functionality to exclude known flase negatives from code genera…

f5af568

…tion

created new code generator instead of just reseting the genrator

410ffec

applied feedback to use white space instead of indents

85c76b0

added white space improvbements instead of using indent space variable

dd62df4

combined the python scirpts into one

eb1fe1e

updated file names

b59ded6

updated seed mechanics

e8433ad

updated run.py to reflect changes in the file names

f1a8fc1

removed unneeded logic:

daffc56

facebook-github-bot closed this in 99a07a2 Sep 9, 2024

facebook-github-bot added the Merged label Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft PR for Pysa Fuzzer #886

Draft PR for Pysa Fuzzer #886

esohel30 commented Jun 19, 2024 •

edited

Loading

arthaud commented Jun 19, 2024 •

edited

Loading

esohel30 commented Jun 24, 2024

arthaud commented Jun 24, 2024 •

edited

Loading

arthaud commented Jun 26, 2024

alexkassil commented Jun 27, 2024

alexkassil left a comment

facebook-github-bot commented Sep 9, 2024

facebook-github-bot commented Sep 9, 2024

alexkassil commented Sep 9, 2024

arthaud commented Sep 9, 2024

Draft PR for Pysa Fuzzer #886

Draft PR for Pysa Fuzzer #886

Conversation

esohel30 commented Jun 19, 2024 • edited Loading

arthaud commented Jun 19, 2024 • edited Loading

esohel30 commented Jun 24, 2024

arthaud commented Jun 24, 2024 • edited Loading

arthaud commented Jun 26, 2024

alexkassil commented Jun 27, 2024

alexkassil left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Sep 9, 2024

facebook-github-bot commented Sep 9, 2024

alexkassil commented Sep 9, 2024

arthaud commented Sep 9, 2024

esohel30 commented Jun 19, 2024 •

edited

Loading

arthaud commented Jun 19, 2024 •

edited

Loading

arthaud commented Jun 24, 2024 •

edited

Loading