Skip to content

Commit

Permalink
Adding StreamTest. Modified how newline and tab matching is handled
Browse files Browse the repository at this point in the history
  • Loading branch information
lukewatts committed May 29, 2022
1 parent f4ea8e6 commit 629f38c
Show file tree
Hide file tree
Showing 6 changed files with 454 additions and 88 deletions.
42 changes: 25 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,37 +114,32 @@ There are constants defined for all of these to help you avoid using the switche
## preg_match_all(): Compilation failed: missing closing parenthesis at offset x
Attempting to match backslashes, or newline chars (e.g. \r|\n|\r\n) is most likely the cause of your troubles here
Attempting to match backslashes, or newline chars (e.g. \r|\n|\r\n) is most likely the cause of your troubles.
You will need to double escape backslashes. To help you avoid needing to figure this out I have provided 2 constants which contain the correct regex patterns for T_ESCAPE_CHAR and T_NEWLINE_ALL
You will need to double escape backslashes. To help you avoid needing to figure this out I have provided the correct regex patterns for T_ESCAPE_CHAR.
```php
$lexicon = [
// ...
Tokenize::T_ESCAPE_CHAR => 'T_ESCAPE', // '\\\\'
Tokenize::T_NEWLINE_ALL => 'T_NEWLINE', // '\n|\r\n|\r'
Tokenize::T_NEWLINE => 'T_NEWLINE', // ';T_NEWLINE;'
// ...
]
```
If you need to match individual newline characters for a specific environment you can use the following constants
### Newlines
Newlines will need to be replaced with a token before they can be matched. By default the T_NEWLINE_ALL constant will match `;T_NEWLINE;`
```php
$lexicon = [
// ...
T_NEWLINE_UNIX => 'T_NEWLINE_UNIX', // '\n'
T_NEWLINE_WIN => 'T_NEWLINE_WIN', // '\r\n'
/**
* Not only Mac, but everyone thinks it is.
* \r is actually a valid newline character on all systems
*/
T_NEWLINE_MAC => 'T_NEWLINE_MAC', // '\r'
Tokenize::T_NEWLINE => 'T_NEWLINE', // ';T_NEWLINE;'
// ...
];
]
```
**TIP:**
You should clean newlines before running input through the Tokenizer `$input = str_replace(["\r\n", "\r"], "\n", $input)`
If you need to match individual newline characters for a specific environment you can use the following constants
See the following section on Matching Backslashes and Special Characters if you want more info
Expand All @@ -154,8 +149,21 @@ As mentioned above, backslashes must be double escaped.
So to match a single backslash your must use the regex `'\\\\'` (I know, it sucks, but you have to)
To match special characters (tabs, newlines, cariage returns etc) you MUST NOT USE DOUBLE QUOTES. This is because a double quoted "\n" will just become an actual newline character in PHP.
To match special characters (tabs, newlines, cariage returns etc) you will need to replace them with another token first, and then add a token for the replacement string.
```php
$input = str_replace(["\r\n", "\r", "\n"], ";T_NEWLINE;", $input);
$input = str_replace("\t", ";T_TAB;", $input);
So always use single quotes for special charachters. In fact, things will work better for you if all your Lexixon patterns are surrounded with single quotes when ever possible. E.g `'\r|\n|\r\\n|\t'`
$lexicon = [
// ...
Token::T_NEWLINE => 'T_NEWLINE',
Token::T_TAB => 'T_TAB'
// ...
];
$Tokenizer = new Tokenizer($lexicon);
$Steam = $Tokenizer->tokenize($input);
```
I will work on some better detection internally for these patterns and attempt to provide better error messages when these errors are encountered *(I'll go real meta and regex the regex before it's ran or something)*
I am working on some better detection internally for these patterns and attempt to provide better error messages when these errors are encountered *(I'll go real meta and regex the regex before it's ran or something)*
4 changes: 2 additions & 2 deletions src/Stream.php
Original file line number Diff line number Diff line change
Expand Up @@ -163,13 +163,13 @@ public function hasNext(): bool
}

/**
* Reset
* Rewind
*
* @return self
*/
public function rewind(): self
{
$this->position = -1;
$this->position = 0;

return $this;
}
Expand Down
101 changes: 46 additions & 55 deletions src/Token.php
Original file line number Diff line number Diff line change
Expand Up @@ -9,73 +9,64 @@
class Token
{
// Special Characters
const T_ESCAPE_CHAR = '\\\\';
const T_NEWLINE_UNIX = '\n';
const T_NEWLINE_WIN = '\r\n';
/**
* Not only Mac, but everyone thinks it is.
* \r is actually a valid newline character on all systems
*/
const T_NEWLINE_MAC = '\r';
const T_CARRIAGE_RETURN = '\r';
const T_RETURN = '\r';
const T_NEWLINE_ALL = '\n|\r\n|\r';
const T_TAB = '\t';
const T_WHITESPACE = '\s+';
const T_ESCAPE_CHAR = '\\\\';
const T_NEWLINE = ';T_NEWLINE;';
const T_TAB = ";T_TAB;";
const T_WHITESPACE = '\s+';

// Miscellaneous Symbols
const T_STAR = '\*';
const T_SLASH = '\/';
const T_PERCENT_SIGN = '%';
const T_HYPHEN = '-';
const T_DOT = '\.';
const T_HASH = '#';
const T_AT = '@';
const T_TILDE = '~';
const T_COMMA = ',';
const T_BACKTICK = '`';
const T_STAR = '\*';
const T_SLASH = '\/';
const T_PERCENT_SIGN = '%';
const T_HYPHEN = '-';
const T_DOT = '\.';
const T_HASH = '#';
const T_AT = '@';
const T_TILDE = '~';
const T_COMMA = ',';
const T_BACKTICK = '`';

// Currency Symbols
const T_DOLLAR = '\$';
const T_EURO = '';
const T_POUND = '£';
const T_DOLLAR = '\$';
const T_EURO = '';
const T_POUND = '£';

// Common Arithmetic Symbols
const T_DECIMAL_POINT = '\.';
const T_EQUALS = '=';
const T_MULTIPLY = '\*';
const T_DIVIDE = '\/';
const T_PLUS = '\+';
const T_MINUS = '-';
const T_MODULOUS = '%';
const T_MOD = '%';
const T_EQUALS = '=';
const T_MULTIPLY = '\*';
const T_DIVIDE = '\/';
const T_PLUS = '\+';
const T_MINUS = '-';
const T_MODULOUS = '%';
const T_MOD = '%';

// Common Logical Operators
const T_OR = '\|\|';
const T_AND = '&&';
const T_NOT = '!';
const T_OR = '\|\|';
const T_AND = '&&';
const T_NOT = '!';

// Common programing symbols
const T_VAR = '\$';
const T_UNDERSCORE = '_';
const T_COLON = ':';
const T_SEMICOLON = ';';
const T_PIPE = '\|';
const T_AMPERSAND = '&';
const T_CARET = '\^';
const T_EXCLAIMATION_MARK = '!';
const T_QUESTION_MARK = '\?';
const T_OPEN_PARENTHESIS = '\(';
const T_CLOSE_PARENTHESIS = '\)';
const T_OPEN_CURLY = '\{';
const T_CLOSE_CURLY = '\}';
const T_OPEN_SQUARE = '\[';
const T_CLOSE_SQUARE = '\]';
const T_DOUBLE_QOUTE = '"';
const T_SINGLE_QUOTE = "'";
const T_VAR = '\$';
const T_UNDERSCORE = '_';
const T_COLON = ':';
const T_SEMICOLON = ';';
const T_PIPE = '\|';
const T_AMPERSAND = '&';
const T_CARET = '\^';
const T_EXCLAIMATION_MARK = '!';
const T_QUESTION_MARK = '\?';
const T_OPEN_PARENTHESIS = '\(';
const T_CLOSE_PARENTHESIS = '\)';
const T_OPEN_CURLY = '\{';
const T_CLOSE_CURLY = '\}';
const T_OPEN_SQUARE = '\[';
const T_CLOSE_SQUARE = '\]';
const T_DOUBLE_QUOTE = '"';
const T_SINGLE_QUOTE = "'";

const T_STRING = '\w+';
const T_NUMBER = '\d+';
const T_STRING = '\w+';
const T_NUMBER = '\d+';

/**
* Value
Expand Down
Loading

0 comments on commit 629f38c

Please sign in to comment.