Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Overly?) Slow parsing #184

Closed
marcelstoer opened this issue Sep 19, 2018 · 17 comments · Fixed by #190
Closed

(Overly?) Slow parsing #184

marcelstoer opened this issue Sep 19, 2018 · 17 comments · Fixed by #190

Comments

@marcelstoer
Copy link
Contributor

marcelstoer commented Sep 19, 2018

  • PHP Version: 7.1.16
  • PHP date.timezone: UTC
  • ICS Parser Version: 2.1.5
  • Windows/Mac/Linux

Description of the Issue:

Parsing the attached exchange.ics.txt takes some 3s. For my use case this is way too slow but I have no idea if this is considered "normal" with this library. Before I download & configure Xdebug (PHP is not my home turf) to see where those milliseconds are lost I would like to know how you assess this case.

File size: 160KB
Number of events: 242
When expanded with span 1: 512

Steps to Reproduce:

<?php
date_default_timezone_set("UTC");
require_once "./vendor/autoload.php";

use ICal\ICal;

$calendarString = "";

try {
    $ical = new ICal(array(
        "defaultSpan"                 => 1,     // Default value: 2
        "defaultWeekStart"            => "MO",  // Default value
        "disableCharacterReplacement" => false, // Default value
        "skipRecurrence"              => false, // Default value
        "useTimeZoneWithRRules"       => false, // Default value
    ));
    $millis = getMillis();
    $ical->initString($calendarString);
    echo "ical->initString() took " . (getMillis() - $millis) . "ms\n";
    $events = $ical->events();
    echo "calendar contains " . sizeof($events) . " events (recurring span: 1y)\n";
} catch (\Exception $e) {
    die($e);
}
function getMillis() {
    return round(microtime(true) * 1000);
}

I pasted the content of the file exchange.ics.txt into $calendarString.

@u01jmg3 u01jmg3 self-assigned this Sep 19, 2018
@u01jmg3
Copy link
Owner

u01jmg3 commented Sep 19, 2018

Without doing the same I have no idea what the bottleneck might be.

Let me know how you get on with Xdebug.

PRs always welcome

@marcelstoer
Copy link
Contributor Author

Without doing the same I have no idea what the bottleneck might be.

I guess if you did, I wouldn't be experiencing this (i.e. you would have fixed it) 😉

Would you say that 3s is nothing unusual for this library?

@u01jmg3
Copy link
Owner

u01jmg3 commented Sep 20, 2018

Would you say that 3s is nothing unusual for this library?

I wouldn't have said 3s was a bad processing time especially when your feed has 242 events including 22 recurrence events that converts to a total of 512 events.

If you can suggest any changes that would improve processing times I would be more than happy to add them.

Sorry I cannot be more help.

@marcelstoer
Copy link
Contributor Author

marcelstoer commented Sep 21, 2018

I managed to install & configure Xdebug on macOS. Then I ran the above test again with profiling enabled (took 15s instead of 3s due to profiling overhead).

There's no CacheGrind binary for Mac that would allow me to parse and visualize the 45MB profiling output file. So, I quickly ran it through WebGrind. You would have to be familiar with the code a bit to properly analyze all the data. You would be the man for that.

I am a bit irritated by the high invocation count for many of the functions, though.

Call list
screen shot 2018-09-21 at 11 39 28

Call graph
call-graph.svg.pdf

CC @squix78

@u01jmg3
Copy link
Owner

u01jmg3 commented Sep 21, 2018

Have you tried setting disableCharacterReplacement=true?

Appreciate the write up

@marcelstoer
Copy link
Contributor Author

I don't know what the flag actually does. The documentation says

Toggles whether to disable all character replacement

That much I deduced from the flag name already 😉Is this for unescaping iCalendar control chars like ; and :?

Either way, I played with this a bit (w/o profiling) but I don't see much difference in overall performance.

@u01jmg3
Copy link
Owner

u01jmg3 commented Sep 21, 2018

That much I deduced from the flag name already 😉Is this for unescaping iCalendar control chars like ; and :?

Nope, see the function the option toggles: https://github.com/u01jmg3/ics-parser/blob/master/src/ICal/ICal.php#L2241-L2242. This option was provided as a PR so I didn't choose its name.

As with anything, if there is an improvement with code or documentation I am more than happy to review a PR.

I am surprised it had little affect as mb_str_replace() had the highest cost in your report and won't run if disableCharacterReplacement=true. It also removes calls to mb_chr() which has the high invocation count.

@marcelstoer
Copy link
Contributor Author

marcelstoer commented Sep 21, 2018

I haven't dug into your code so far but will do so later today (PRs might come out of this). As said earlier, PHP isn't my home turf.
I was also surprised disableCharacterReplacement=true didn't make much difference mainly because cleanData() is featured so prominently in the call graph.

@u01jmg3
Copy link
Owner

u01jmg3 commented Sep 21, 2018

PRs might come out of this

:)

I was also surprised disableCharacterReplacement=true didn't make much difference mainly because cleanData() is featured so prominently in the call graph.

Me too

@marcelstoer
Copy link
Contributor Author

I am surprised it had little affect as mb_str_replace() had the highest cost in your report and won't run if disableCharacterReplacement=true.

First finding: cleanData() is always executed regardless of whether I set true or false in the constructor, at https://github.com/u01jmg3/ics-parser/blob/master/src/ICal/ICal.php#L328 it's always true.
Why? Because the constructor has defaults for both parameters and because I missed to understand that.

I took your example at https://github.com/u01jmg3/ics-parser/blob/master/examples/index.php#L9 and removed the first parameter (see my code above). It runs just fine but the effect is that $options at https://github.com/u01jmg3/ics-parser/blob/master/src/ICal/ICal.php#L234 is not my options but the default empty array 😞

More to come...

@marcelstoer
Copy link
Contributor Author

marcelstoer commented Sep 21, 2018

Before I continue allow me to explain my use case.

I run a script that is fed with remote iCalendar URLs (Google, iCloud, Office360, etc.) through a query string parameter. It will load and parse that calendar and return a new calendar with only the events from yesterday, today, and tomorrow. It's kind of a calendar filter or reducer.

I like your library a lot, it does so many things right. However, looking at a "clean" call graph it's obvious that its design doesn't go well with my requirements.

screen shot 2018-09-21 at 22 20 02

I can understand that for many cases it may make sense to first parse the entire calendar and build an internal representation. It is unfortunate that in eventsFromRange the library first instantiates a new Event for all events and only then reduces the collection to the desired range. For my case it would be best to reduce the list of events during parsing e.g. before processDateConversions as that's also costly.

The challenge is that before you do all the tricky date, time, timezone computations you don't really have "good" data to discard events based on a range predicate. If you allow the optimization to be fuzzy though you can achieve a significant performance boost. Idea:

  • while parsing keep events from -2 to +5 days ignoring time and timezone (still playing with the exact numbers), make the window large enough
  • use eventsFromRange to later reduce precisely to -1d to +1d

I have a working prototype that does

In the constructor

$now = time();
$this->min = $now - 172800; // -2d
$this->max = $now + 432000; // +5d

At the end of initLines

...
        $this->trim();
        $this->processDateConversions();
    }
}

private function trim(){
    $events = (isset($this->cal['VEVENT'])) ? $this->cal['VEVENT'] : array();
    if (empty($events)) {
        return false;
    }
    foreach ($events as $key => $anEvent) {
        if (!$this->isValidDate($anEvent['DTSTART']) || $this->outOfRange($anEvent['DTSTART'])) {
            unset($events[$key]);
            $this->eventCount--;
            continue;
        }
    }
    $this->cal['VEVENT'] = $events;
}

private function outOfRange($dtstart)
{
    // TODO handle date with time zone id e.g. DTSTART;TZID=US-Eastern:19980119T020000
    $start = strtotime(explode("T", $dtstart)[0]);
    return $start < $this->min || $start > $this->max;
}

WDYT? How would such a trimming-parser mode have to be implemented to stand a chance at being added to the library?

@u01jmg3
Copy link
Owner

u01jmg3 commented Sep 24, 2018

First finding: cleanData() is always executed regardless of whether I set true or false in the constructor, at https://github.com/u01jmg3/ics-parser/blob/master/src/ICal/ICal.php#L328 it's always true.
Why? Because the constructor has defaults for both parameters and because I missed to understand that.

I took your example at https://github.com/u01jmg3/ics-parser/blob/master/examples/index.php#L9 and removed the first parameter (see my code above). It runs just fine but the effect is that $options at https://github.com/u01jmg3/ics-parser/blob/master/src/ICal/ICal.php#L234 is not my options but the default empty array

Are you still having issues passing in your own options?


I like your library a lot, it does so many things right.

😃


Thanks for the detailed report. I think your idea sounds great and I know others have requested this before: #149, #182. Happy for you to proceed 👍

@marcelstoer
Copy link
Contributor Author

marcelstoer commented Sep 24, 2018

Are you still having issues passing in your own options?

No, once I set the first param as well (making it new ICal(false, array(...) it works as expected. This just goes to show that default values for constructor/method parameters can lead to undesired side effects - at least in weakly typed languages.

Happy for you to proceed

Sure, just tell me into which direction. My final fix added two additional constructor parameters like below. I didn't want to "pollute" your $options array.

    public function __construct($files = false, array $options = array(), $minEpoch = -1, $maxExpoch = -1)
    {
        ini_set('auto_detect_line_endings', '1');

        $this->minEpoch  = $minEpoch;
        $this->maxExpoch = $maxExpoch;

...
        // in initLines()
        if ($this->minEpoch > -1 && $this->maxExpoch > -1) {
            $this->reduceEventsToMinMaxRange();
        }

If you'd be happy to accept this as an upstream PR I would of course integrate the params into $options. Qs:

  • What should I call these two params? What they are effectively are "lower and upper dtstart date parse ranges respectively"...
  • What data type should they have? Epoch may not be practical for others.
  • Does it make sense for outOfRange (see above) to only consider the date and ignore the time? My reasoning is that as all the tricky date, time, timezone computations haven't run at this point yet it makes little sense to be too precise about the range filter.

Btw, for my use case I reduced total processing time (parsing and eventsFromRange) down to 1/3.

@u01jmg3
Copy link
Owner

u01jmg3 commented Sep 24, 2018

  • What should I call these two params? What they are effectively are "lower and upper dtstart date parse ranges respectively"...

I think we should go for $daysBefore and $daysAfter as two new options which can be optionally and independently changed from null.

  • What data type should they have? Epoch may not be practical for others.

I would go for days as a base unit and use Carbon for manipulation.

  • Does it make sense for outOfRange (see above) to only consider the date and ignore the time? My reasoning is that as all the tricky date, time, timezone computations haven't run at this point yet it makes little sense to be too precise about the range filter.

For a first pass I think it is fine to ignore time because as you say it makes little sense to be precise.


Btw, for my use case I reduced total processing time (parsing and eventsFromRange) down to 1/3.

Excellent

@u01jmg3
Copy link
Owner

u01jmg3 commented Oct 14, 2018

How is the PR going to limit the number of events processed?

@marcelstoer
Copy link
Contributor Author

marcelstoer commented Oct 14, 2018

It doesn't - or only partially.

As laid out above I kept initLines as-is almost until the very end when all "non-relevant" events are removed from $events before any further processing is done (in trim()). However, since the post processing and further data transformations are so expensive this rather small tweak yields great results.

Side note, it might take another 2+ weeks before I get around to doing a proper PR.

@u01jmg3
Copy link
Owner

u01jmg3 commented Oct 14, 2018

Side note, it might take another 2+ weeks before I get around to doing a proper PR.

No worries, there is no rush :)

@u01jmg3 u01jmg3 added the change label Nov 4, 2018
@u01jmg3 u01jmg3 added this to the v2.x.x milestone Jan 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants