-
Notifications
You must be signed in to change notification settings - Fork 0
rofl0r/perplex
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
========== Overview ========== Perplex is a simple tool to simplify the creation of scanners using re2c. It generates an input for the re2c tool from a perplex input file. This fork converted the sources from C++ to C, removed proprietary getopt code, replaced CMAKE with a Makefile and adds an example. The code derives from a US govt project called BRL-CAD, which can be found on sourceforge. a standalone, but outdated version of the code is available on https://github.com/stepcode/baffledCitrus. Main Sources ------------- perplex.c main for perplex. scanner.re and parser.y Inputs for re2c scanner-generator and lemon parser-generator respectively. These files implement the perplex input file parser. perplex_template.c Template file with a basic re2c scanner implementation. Used as the basis for generated scanner sources. Licensing and Copyrights ------------------------- mbo_getopt.cpp and mbo_getopt.h are in the Public Domain, written by Marcus Boerger. scanner.re and perplex_template.c include code taken from the flex project, and are released under a BSD License with joint copyright held by the U.S. Government and The Regents of the University of California. All other source files are released under a BSD License with U.S. Government copyright. ===== API ===== All scanner data is stored in a perplex_t object. The template implements the following public functions: perplex_t perplexFileScanner(FILE *input) Creates a perplex_t object initialized to scan input from the specified file stream. The scanner will stop scanning when EOF is encountered. perplex_t perplexStringScanner(char *firstChar, size_t numChars) Creates a perplex_t object initialized to scan input from the specified string. The scanner will stop scanning when numChars characters have been scanned. void perplexFree(perplex_t scanner) Frees all memory associated with the given perplex_t object, except for any memory referenced by scanner->extra, which must be freed by the user. int yylex(perplex_t scanner) Returns the value of the next recognized token, or YYEOF when the end of the input has been reached. The name of this routine can be changed by defining the pre-processor symbol PERPLEX_LEXER in section 1 of the perplex input file. void perplexSetExtra(perplex_t scanner, void *extra) Set scanner's application data. void *perplexGetExtra(perplex_t scanner) Get scanner's application data. void perplexUnput(peprlex_t scanner, char c) Inserts a character on the specified scanner's input buffer so that it is the next character scanned. Configuration Macros --------------------- These macros can be defined in section 1 of the perplex input file: PERPLEX_LEXER Set this symbol to specify an alternate name for the generated lexer. Default is yylex. PERPLEX_ON_ENTER Use this symbol to specify code to run at the beginning of each call to the lexer. A common use is to simplify access to application data: #define PERPLEX_ON_ENTER appData_t *appData = (appData_t*)yyextra; Macros Available Inside Rule Code Blocks ----------------------------------------- These macros can only be used inside rule code blocks: YYGETCONDITION and YYSETCONDITION(condition) Get or set the current start conditions, (requires running perplex and re2c with '-c' option flag). yytext A dynamically allocated null-terminated string holding the input which was matched. yytext is guaranteed to exist until the end of the code block it is used in, but will be automatically freed afterwards. If you need to store the token text, you'll need to make a copy. yyextra Application data (void*). =============== Using Perplex =============== 1) Write a perplex input file (see Perplex Input Format). 2) Run perplex on the input to generate an re2c input file. perplex -t /path/to/perplex_template.c -h header.h -o output.re input.l * Input defaults to stdin and output defaults to stdout. * The generated header contains the perplex_t definition and public function prototypes. This header should be manually included in input.l. If no output header path is specified, the definitions/declarations will appear at the top of the output source. 3) Run re2c on the re2c input to generate the final scanner source. re2c -o /path/to/output.c /path/to/output.re 4) Scan input. int tokenID; perplex_t scanner = perplexFileScanner(inFile); perplexSetExtra(scanner, (void*)appData); while ((tokenID = yylex(scanner)) != YYEOF) { ... } /* do something with appData */ perplexFree(scanner); ====================== Perplex Input Format ====================== Perplex takes a three-section file as input: /* section 1 - code copied to output */ %% /* section 2 - scanner rules (see Perplex Rule Section Syntax) */ <regular-expression> { <code> } <regular-expression> { <code> } %% /* section 3 (optional) - code copied to output */ Sections are separated by "%%" appearing on a line by itself, with no leading or trailing whitespace. The second "%%" separator and following code section are optional. ============================= Perplex Rule Section Syntax ============================= The basic form of the rule section is a series of regular expressions followed by C/C++ code to be executed when the current input string matches the expression: <regular-expression> { <code> } <regular-expression> { <code> } The code block spans between { and }, not including C-braces or braces appearing within strings or comments. The code-block is optional. If missing, the default behavior is to ignore the matched input string and continue scanning. * See the re2c documentation for a list of valid regular expressions. Named Definitions ------------------ Regular expressions can be given aliases using re2c's named definition syntax. All named definitions should appear before the first rule. %% /* named definitions */ alpha = [A-Za-z]; num = [0-9]; /* rules */ alpha { /* code */ } num { /* code */ } /* to match something like "A1" */ alpha(num) { /* code */ } Exiting From Code Blocks ------------------------- When the end of a code block is reached, the default behavior is to advance past the matched input, and continue scanning: [ \t]+ { /* ignore and continue scanning */ } If you have matched a token, you will probably return a token identifier rather than immediately resume scanning: [A-Za-z_]+ { return TOKEN_NAME; } It is also possible that you will match part of a token, and need to continue scanning to find the end of it. Use the continue keyword to avoid throwing out the text matched so far: /* start of list */ '{' { continue; } /* intermediate list item */ [^,}]',' { continue; } /* end of list */ [^,}]'}' { return TOKEN_LIST; } ================== Start Conditions ================== Start conditions allow you to specify that certain rules only be applied in certain states. Using start conditions requires that perplex and re2c are run with the '-c' option flag. Conditions are simply ints, defined in whatever manner is convenient (e.g. in an enum). When using conditions, all rules must either be prefixed with a comma-separated list of conditions appearing between < and >, or must be specified inside a condition scope (a condition list followed by rules in between { and }): enum {conditionOne, conditionTwo, conditionThree}; %% /* condition scope */ <conditionOne> { "center" { /* code */ } "view" { /* code */ } } /* simple condition-prefixed rules */ <conditionOne,conditionTwo>[a-zA-Z]+ { /* code */ } <conditionThree>'v' { /* code */ } 0 is the initial condition, and can be referrenced as the empty list "<>" in the rule section. "<*>" is the shorthand for specifying all conditions. The current condition can be determined by calling YYGETCONDITION(). A new condition can be set by calling YYSETCONDITION(<condition>), or by using one of the re2c transition operators: "=>" or ":=>". Take the example of changing from a "code" condition to a "comment" condition: enum {INITIAL, code, comment}; %% /* scanner's first order of business is to change from "INITIAL" * to "code" condition */ <> => code /* this... */ <code>"/*" => comment { /* code */ } /* ...is equivalent to this */ <code>"/*" { YYSETCONDITION(comment); /* code */ } /* and this... */ <code>"/*" :=> comment /* ...is equivalent to this */ <code>"/*" :=> comment { YYSETCONDITION(comment); continue; }