Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Span.Until to Superpower.Parsers #129

Merged
merged 4 commits into from
Apr 13, 2021
Merged

Added Span.Until to Superpower.Parsers #129

merged 4 commits into from
Apr 13, 2021

Conversation

Ellested
Copy link
Contributor

I rely on the compiler to inline functions as necessary, so I have not done any optimizations.

@nblumhardt
Copy link
Member

Thanks! We'll take a look as soon as we have some time free to think it through 👍

@nblumhardt
Copy link
Member

Sorry about the slow turnaround on this. I have a few thoughts but haven't been through in enough detail to be able to offer a useful review just yet - I'll add some comments anyway in case they're useful to move things forwards.

{
TextSpan x = input;

while (!x.IsAtEnd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be nice to look at using string.IndexOf here to speed up the search, since it'll be cleverly implemented at the low level? There's a similar algorithm in the regex matcher at: https://github.com/datalust/superpower/blob/dev/src/Superpower/Parsers/Span.cs#L219, I think.

Copy link
Contributor Author

@Ellested Ellested Mar 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understood the indexOf reference you link to - is it the right place?
Anyway, I made a new version that I believe should be faster, as it only scans the input once.
Maybe it's better that you do the implementation, I think you are far more into all the details :-)

x = x.ConsumeChar().Remainder;
}

return Result.Empty<TextSpan>(input, $"Until expected {text}");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would need to pull out the formatting here and put it in a variable outside the returned closure, so that we don't allocate strings for every non-matching result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it :-)

@@ -83,5 +84,33 @@ public void RegexMatches()
var r = parser(input);
Assert.Equal("Foo", r.Value.ToStringValue());
}

[Theory]
[InlineData("", "")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't Until("STOP") fail if `"STOP" is missing from the input string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I moved it to the fail expecting test set, and removed the OptionalOrDefault()

[InlineData("123STOP", "123")]
[InlineData("STOP123", "")]
[InlineData("123STOP123STOP123", "123")]
public void UntilMatches(string text, string expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could rename to be more descriptive - UntilMatchesWhenStopwordIsPresent() perhaps? (Similar for the tests below?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

[InlineData("12345")]
public void UntilParseFails(string text)
{
Assert.Throws<ParseException>(() => Span.Until("STOP").Parse(text));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to check that the error message and position are what we expect, if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@Ellested
Copy link
Contributor Author

Is it that bad :-)

The only speed improvement I currently see is to avoid the remainder pattern. I'm actually a little puzzled on why you can only consume one char on the TextSpan. Why not just advance the position and keep consuming?
Of course the struct is allocated on the stack, but there's still some copying that needs to through, and the only difference is the position. Well, I guess that's just my limited knowledge of not working with the details.

@nblumhardt
Copy link
Member

Hehe 😄 no it's not bad at all! Just my week that got out of control 😅

I think this is good, I just need to give it another look over and hit the button, but getting back into the right head-space takes a bit of time. Sorry it's a a slow process right now!

I haven't had a chance to check out what you mean regarding the character-by-character consumption, but I suspect it's because of the line number tracking Superpower does. There are definitely still opportunities to improve all over the place, though.

@Ellested
Copy link
Contributor Author

Ellested commented Mar 20, 2021

I'm glad that you didn't press that button. There was a major issue with the calculation of the found position, as it didn't include the input offset. I found out, when I used the new version today in my other project :-)

Regarding my question consider the ConsumeChar:

public Result<char> ConsumeChar()
{
	EnsureHasValue();
	if (IsAtEnd)
	{
		return Result.Empty<char>(this);
	}
	char ch = Source[Position.Absolute];
	return Result.Value(ch, this, new TextSpan(Source, Position.Advance(ch), Length - 1));
}

My question is, why it's returning a new TextSpan everytime we consume a char.
Why not just keep the current TextSpan, and advance the position until something interesting happens?
I'm only interested in the character, but I get a complete result struct where only the position and length is changed.
So even though it's a struct, it is still memory initialization/copying of bytes, that weighs in tight loops.

I'm only asking because I'm guessing and haven't spend any time on stuff like this before. But I'm super happy about starting to get deeper into it, as it opens up a wealth of new opportunities.

PS. How about a PeekChar method :-)

@nblumhardt
Copy link
Member

Glad you were able to spot the issue!

TextSpan is immutable so that parsers can treat them like bookmarks to try different options while moving backwards and forwards within the string. HTH!

Copy link
Member

@nblumhardt nblumhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looked over this in a bit more detail, HTH!

{
if (text == null) throw new ArgumentNullException(nameof(text));

var expectation = $"Until expected '{text}'";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be "(backtick){text}(backtick)" for consistency with other messages; see e.g. https://github.com/datalust/superpower/blob/dev/test/Superpower.Tests/ErrorMessageScenarioTests.cs#L115

We won't need this if we use the Except() option.

/// <param name="text">The text to match until. The text is not included in the result.</param>
/// <returns>A parser that will match everything until the text argument is matched.</returns>
/// <exception cref="ArgumentNullException">The text is null.</exception>
public static TextParser<TextSpan> Until(string text)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized that we call this Except when working with characters:

https://github.com/datalust/superpower/blob/dev/src/Superpower/Parsers/Character.cs#L95

Character.Except('c') and Span.Except("stop") would be similar in design; the advantage of Except() over Until() is that the user can choose whether or not the stopword must be present.

from content in Span.Except("stop")
from _ in Span.EqualTo("stop")

would be functionally identical to Until(), while the second part of that construction could be omitted if the stopword is optional (say, where an end-of-file might have the same meaning).

There would be some reprocessing to match the stopword, but the position of it would be passed through via the remainder so minimal searching would need to be done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until() should leave the stopword in the reminder, so I think in it's current state it is actually identical to Span.Except()
I also like that naming better, more concise. So I think it's just a matter of renaming it.

@@ -258,5 +258,39 @@ public static TextParser<TextSpan> MatchedBy<T>(TextParser<T> parser)
result.Remainder);
};
}

/// <summary>
/// Parse as much of the input as matches <paramref name="text" />.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matches text from the source string until is seen, or the end of input is reached. ? (See below for the variation regarding EOF.)

TextSpan remainder = input;
var matchIndex = 0;

while (remainder.Length>0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still planning to use IndexOf() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By IndexOf(), you mean the string method right?
If so, I was wondering if allocating new strings every time we visit, would be a better choice than just using the TextSpan consume/reminder pattern. As a general thought and related to my last question, I think I would have chosen a design with a mutable TextSpan for performance reasons. I'm not sure that I understand the need for immutability without also knowing a lot more of the details behind your design choices - but it's just what comes to mind when you start thinking about it as an outsider. If you create a new TextSpan you could still leave it untouched without advancing it (for use as a marker), so it still puzzles me a little that it has to be immutable. I would love to hear why, but don't write a long story - I'm just curious :-)

@SuperJMN
Copy link

When is this going to be merged? 🤔

@Ellested
Copy link
Contributor Author

I'm not sure whether it's me or Nicolas we are waiting for :-)

I would like to know if I should go for indexOf or keep the consume/remainder pattern before continuing. Maybe it could also be determined later. Might need a benchmark test to see which is better, as I think it will vary with the size of the scan.

Well, I will update according to the last comments, and then we can decide on the indexOf after that :-)

@SuperJMN
Copy link

@nblumhardt this looks OK to me!

@nblumhardt nblumhardt merged commit 21f53b2 into datalust:dev Apr 13, 2021
@nblumhardt
Copy link
Member

https://www.nuget.org/packages/Superpower/3.0.0-dev-00201 is out :-)

Thanks for sticking with this, @Ellested!

I made some tweaks on the way through - current version is at https://github.com/datalust/superpower/blob/dev/src/Superpower/Parsers/Span.cs#L275

  • Uses string.IndexOf() to simplify things a little
  • Disallows zero-length matches; this is important because Span.Except("STOP").Many() will otherwise loop forever if the match is zero-length (Many() is the only Superpower parser allowed to match a zero-length input, IIRC)
  • Adds ExceptIgnoreCase()

@Ellested
Copy link
Contributor Author

Looks great. I didn't realize the source string was available all this time, so I thought you meant I should use intput.ToStringValue() and then IndexOf() on the result. Now I get it :-)

@nblumhardt
Copy link
Member

Ah - great! 👍 Thanks again!

@nblumhardt nblumhardt mentioned this pull request Sep 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants