Welcome to the ESEC/FSE'19 artifact for the ESEC/FSE paper "Why Aren’t Regular Expressions a Lingua Franca? An Empirical Study on the Re-use and Portability of Regular Expressions", by J.C. Davis, L.G. Michael IV, C.A Coghlan, F. Servant, and D. Lee, all of Virginia Tech.
This paper describes our study into regex portability practices and problems. In this empirical work, we:
- surveyed 158 professional software developers about their regex beliefs and re-use practices
- extracted regular expression-like entities from Stack Overflow and RegExLib to understand re-use practices
- extracted regular expressions from about 200,000 software projects written in 8 programming languages
- analyzed these production regular expressions for portability problems: syntactic, semantic, and performance
Our artifact includes the following:
Item | Description | Corresponding content in the paper | Scientific interest | Relation to prior work |
---|---|---|---|---|
Survey instrument | Survey of developers' regex practices | Section 4.1 | Second survey of developer regex practices, cf. Chapman & Stolee ISSTA'16 | |
Internet Sources collectors | Tools to extract regexes from Internet Sources | Section 6.2.1 | ||
Internet Sources corpus | Entities that look like regexes across Stack Overflow and RegExLib | Section 6.2 | First snapshot of regexes in Internet forums | No prior work has examined the regexes from these Internet sources. Our analysis was in the spirit of work on more general code re-use from Stack Overflow to GitHub. |
Regex extractors | Tools to statically extract regexes from software written in 8 programming languages | Section 5 | Adds 6 programming languages to the tools from our FSE'18 paper | |
Regex corpus | A polyglot regex corpus of 537,806 unique regexes extracted from 193,524 projects written in 8 programming languages | Collection: Section 5, esp. Table 1. Experiments: Section 7 | This is the largest and most diverse regex corpus ever collected. It should be useful for future regex analysis purposes, e.g. in testing a visualization or input generation tool. | Our FSE'18 paper included a corpus of about 400,000 regexes extracted from about 670,000 npm and pypi modules (See Table 1 in that paper, and that artifact). This new corpus covers 6 more programming languages. |
Regex analyses: Semantic | Drivers for 5 input generators | Section 7.1 | Collects, improves, and unifies existing input generators | |
Regex analyses: Performance | Drivers for 3 super-linear regex detectors | Section 7.2 | Extends existing super-linear regex detectors to partial-match semantics | Builds on the tooling from our FSE'18 paper |
In addition to this directory's README.md
, each sub-tree comes with one or more READMEs describing its contents.
Export the following environment variables to ensure the tools know how to find each other.
ECOSYSTEM_REGEXP_PROJECT_ROOT
VULN_REGEX_DETECTOR_ROOT
(dependency, set it toECOSYSTEM_REGEXP_PROJECT_ROOT/analysis/performance/vuln-regex-detector
)
See .env
for examples.
To install, execute the script ./configure.sh
on an Ubuntu 16.04 machine with root privileges.
This will obtain and install the various dependencies (e.g. OS packages, REDOS detectors) and compile all analysis tools.
The final line of this script is echo "Configuration complete. I hope everything works!"
.
If you see this printed to the console, great!
Otherwise...alas.
!!!!!!!!!!!!!! !Not yet done! !!!!!!!!!!!!!!
(However, see containerized/Dockerfile -- this may work?)
To facilitate replication, we have published a containerized version of this project on hub.docker.com. The container is based on an Ubuntu 16.04 image so it is fairly large.
For example, you might run:
docker pull jamiedavis/davismichaelcoghlanservantlee-fse19-regexartifact
docker run -ti jamiedavis/davismichaelcoghlanservantlee-fse19-regexartifact
> vim .env
# Set ECOSYSTEM_REGEXP_PROJECT_ROOT=/davis-fse19-artifact/LinguaFranca-FSE19
> . .env
> # Proceed to use our tools, see some examples below
We have prepared a simple script to drive the analysis on a single node on a subset of the regexes. The actual analyses were performed on a compute cluster, as detailed in the paper.
To use this script, run the following command and wait about 10 minutes for all of the phases to complete. (The performance analysis is the expensive part and takes more than half the time).
$ECOSYSTEM_REGEXP_PROJECT_ROOT/analysis/run-analyses.pl 50
This command will produce output like this:
...
Analysis complete. Performed syntax, semantic, and performance analyses.
/tmp/LF-826/results: data files
/tmp/LF-826/logs: reports and logs
/tmp/LF-826/vis: visualizations
Clean up with:
rm -rf /tmp/LF-826
The directory name will vary in each run, but you can find:
- The various data files in the
results
subdir - The program logs and tabular reports in the
logs
subdir - Sample visualizations akin to those in the paper in the
vis
subdir
Of course, none of the results will match those in the paper when performed on a subset of the data, but hopefully this is enough to give you a sense of how you could replicate the results in the paper.
See the following sections for sample commands for each step of the analysis.
Our analyses work on a set of regexes. You can use the tail of the full corpus to see how things go.
tail -10 $ECOSYSTEM_REGEXP_PROJECT_ROOT/data/production-regexes/uniq-regexes-8.json > 10-regexes.json
$ECOSYSTEM_REGEXP_PROJECT_ROOT/bin/test-for-syntax-portability.py --regex-file 10-regexes.json --out-file 10-syntax.json 2>10-syntax.log
This should run quickly.
If you examine the tail of 10-syntax.log
, you'll see output like this:
11/06/2019 02:05:13 woody/22034: Generating a quick summary of regex syntax support
11/06/2019 02:05:13 woody/22034: Number of supporting languages Number of regexes
11/06/2019 02:05:13 woody/22034: 7 1
11/06/2019 02:05:13 woody/22034: 8 9
11/06/2019 02:05:13 woody/22034:
11/06/2019 02:05:13 woody/22034: Language Number of supported regexes
11/06/2019 02:05:13 woody/22034: javascript 10
11/06/2019 02:05:13 woody/22034: rust 9
11/06/2019 02:05:13 woody/22034: php 10
11/06/2019 02:05:13 woody/22034: python 10
11/06/2019 02:05:13 woody/22034: ruby 10
11/06/2019 02:05:13 woody/22034: perl 10
11/06/2019 02:05:13 woody/22034: java 10
11/06/2019 02:05:13 woody/22034: go 10
11/06/2019 02:05:13 woody/22034:
Apparently one regex was unsupported in Rust, while the other 9 regexes were supported in all 8 languages.
If you examine 10-syntax.json
, you'll see enhanced libLF.Regex objects -- they now have the supportedLangs
member populated.
The row with the pattern containing the string "avatar" lists 7 of the languages but not Rust, because Rust does not support the escaped forward slash notation as a valid construct.
The semantics test should be run on the result of the syntax test, since it needs to know the supportedLangs
of the libLF.Regex objects.
$ECOSYSTEM_REGEXP_PROJECT_ROOT/bin/test-for-semantic-portability.py --regex-file 10-syntax.json --out-file 10-semantic.json 2>10-semantic.log
This may take a few minutes.
Once it's done, you can look at the tail of 10-semantic.log
.
These 10 regexes are fairly dull from a semantic perspective:
0 (0.00%) of the 10 completed regexes had at least one witness for different behavior
If you examine 10-semantic.json
, you'll see that the libLF.Regex objects have been enhanced in a different way:
- They have the
nUniqueInputsTested
member set to the number of inputs that were attempted for each regex - They have the
semanticDifferenceWitnesses
member set, though since none were found all of those lists are empty
For demonstration purposes, we have prepared a regex file that has semantic difference witnesses. (cf. the final row of Table 4).
$ECOSYSTEM_REGEXP_PROJECT_ROOT/bin/test-for-semantic-portability.py --regex-file demo/semantic-difference-witness-regex.json --out-file demo-semantic.json 2>demo-semantic.log
The log file now ends more enticingly:
1 (100.00%) of the 1 completed regexes had at least one witness for different behavior
If you examine demo-semantic.json
, you'll see the inputs that triggered semantic differences, with a breakdown of the distinct behaviors observed and the languages that evinced each behavior.
Run a performance analysis on the 10-regexes.json
file like this:
$ECOSYSTEM_REGEXP_PROJECT_ROOT/bin/test-for-SL-behavior.py --regex-file 10-regexes.json --out-file 10-performance.json --sl-timeout 10 --power-pumps 100000 2>10-performance.log
This takes a few minutes parallelized across my 8-core desktop. If you're desperate, you can just run it on a 1-regex file instead of the 10-regex file we've been using.
Once complete, take a look at the end of 10-performance.log
. It says:
11/06/2019 01:56:43 woody/20094: Successfully performed SLRegexAnalysis on 10 regexes, 0 exceptions
11/06/2019 01:56:43 woody/20094: 1 of 10 successful analyses timed out in some language
11/06/2019 01:56:43 woody/20094: 1 of the regexes had different performance in different languages
11/06/2019 01:56:43 woody/20094: 0 of the regexes had different performance in the languages they actually appeared in
If you examine 10-performance.json
, you should see enhanced libLF.Regex objects.
According to the log, one of these exhibited super-linear behavior
If you search the 10-performance.json
file for the string '100000": true', you will see that a regex pattern beginning proxy.*fooo
timed out on an input of 100000 pumps in the following programming languages:
- javascript
- php
- python
- ruby
- java
File or Directory/ | Description |
---|---|
README.md | You're in it |
PAPER.pdf | Non-anonymized manuscript we submitted for review (not camera-ready) |
LICENSE | Terms of software release |
STATUS | Claims of artifact quality |
INSTALL | "Install instructions" |
----------------------- | ----------------------------------------------------------------- |
containerized/ | Dockerfile for building container |
configure.sh | One-stop-shop for configuration |
----------------------- | ----------------------------------------------------------------- |
survey/ | Survey instrument |
data/ | Corpuses (internet and production) and tools to reproduce them |
analysis/ | Experimental analyses (syntax, semantic, performance) |
full-analysis/ | Run each analysis step on a regex |
lib/ | Python libraries -- utility routines, serializers and parsers for types expressed in JSON |
bin/ | Symlinks to the tools scattered throughout the tree, easing access from analysis scripts |
regex-engine-bugs/ | README with links to the bugs we opened against regex engines or documentation |
Each directory contains its own README with additional details.
Most of the scripts in this repository are written in Python. They tend to write status updates to STDERR, and write their output to an NDJSON-formatted --out-file of serialized libLF objects.
If you have dependencies on other scripts in the repo, require the invoker to define ECOSYSTEM_REGEXP_PROJECT_ROOT
.
This environment variable should name the location of your clone of this repository.
This project uses JSON to describe research data. Files named *.json
are generally NDJSON-formatted files that contain one JSON object per line.
Why giant flat files?
- Makes it easy to do a line-by-line streaming analysis on the objects in the file, even if the file is large.
- Makes it easy to divide work amongst the nodes in a compute cluster.
- Makes it easy to share data with other researchers.
Contact J.C. Davis at davisjam@vt.edu with any questions.