Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests for big feeds #30

Open
barbeau opened this issue Jan 3, 2022 · 8 comments
Open

Add tests for big feeds #30

barbeau opened this issue Jan 3, 2022 · 8 comments

Comments

@barbeau
Copy link
Member

barbeau commented Jan 3, 2022

Issue by barbeau
Wednesday Apr 12, 2017 at 15:21 GMT
Originally opened as CUTR-at-USF#123


Summary:

We need to make sure that as we add new rules, the validator can continue to run in real-time on production-sized feeds for major cities.

I posted a question on the GTFS-rt list asking for examples of very large feeds:
https://groups.google.com/forum/#!topic/gtfs-realtime/mM8cQIIV_-Y

These have been suggested to me so far, with largest coming first:

We should add some unit tests that do basic benchmarking to ensure we're not exceeding a given duration when processing feeds. I think 2 seconds may be reasonable, but we'll need to test. We'll also need to figure out how this works for CI, as Travis is significantly underpowered when compared to a typical desktop.

@barbeau
Copy link
Member Author

barbeau commented Jan 3, 2022

Comment by barbeau
Wednesday Apr 12, 2017 at 19:15 GMT


If I try to run the Dutch feed with -Xmx8g parameter on my machine (dual Xeon @ 2.5 GHz w/ 16GB RAM), I get this exception after it runs for a very long time (I left and came back an hour later):

javax.servlet.ServletException: org.glassfish.jersey.server.ContainerException: java.lang.OutOfMemoryError: GC overhead limit exceeded
	at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:423)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:386)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:334)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:221)
	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:800)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
	at org.eclipse.jetty.server.Server.handle(Server.java:497)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:313)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
	at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:626)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:546)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.glassfish.jersey.server.ContainerException: java.lang.OutOfMemoryError: GC overhead limit exceeded
	at org.glassfish.jersey.servlet.internal.ResponseWriter.rethrow(ResponseWriter.java:256)
	at org.glassfish.jersey.servlet.internal.ResponseWriter.failure(ResponseWriter.java:238)
	at org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:486)
	at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:316)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
	at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)
	at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:291)
	at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1140)
	at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:403)
	... 17 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:68)
	at java.lang.StringBuilder.<init>(StringBuilder.java:89)
	at org.onebusaway.csv_entities.DelimitedTextParser.parse(DelimitedTextParser.java:65)
	at org.onebusaway.csv_entities.CSVLibrary.parse(CSVLibrary.java:131)
	at org.onebusaway.csv_entities.CsvTokenizerStrategy.parse(CsvTokenizerStrategy.java:34)
	at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:154)
	at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:120)
	at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:115)
	at org.onebusaway.gtfs.serialization.GtfsReader.run(GtfsReader.java:172)
	at org.onebusaway.gtfs.serialization.GtfsReader.run(GtfsReader.java:160)
	at com.conveyal.gtfs.validator.json.FeedProcessor.load(FeedProcessor.java:73)
	at com.conveyal.gtfs.validator.json.FeedProcessor.run(FeedProcessor.java:44)
	at edu.usf.cutr.gtfsrtvalidator.api.resource.GtfsFeed.postGtfsFeed(GtfsFeed.java:180)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)
	at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)
	at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:308)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
	at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)

So it looks like it's getting hung up in the static GTFS validation using the Conveyal gtfs-validator.

If I run the Dutch GTFS-rt feed with random GTFS data (I used HART in Tampa), then it processes each GTFS-rt iteration in about 1.1 seconds.

Using MBTA data, it processes each GTFS-rt iteration in about 1.1 seconds as well.

@barbeau
Copy link
Member Author

barbeau commented Jan 3, 2022

Comment by barbeau
Wednesday May 03, 2017 at 21:16 GMT


Here's a good list of GTFS-rt feeds from Transitfeeds.com:
http://transitfeeds.com/search?q=gtfsrt

@barbeau
Copy link
Member Author

barbeau commented Jan 3, 2022

Comment by barbeau
Monday May 08, 2017 at 19:40 GMT


Transitland issue for adding support for GTFS-rt feeds - https://github.com/transitland/transitland/issues/77.

@barbeau
Copy link
Member Author

barbeau commented Jan 3, 2022

Comment by barbeau
Tuesday Sep 19, 2017 at 17:55 GMT


We could use the batch processor for benchmarking feed processing times - see README "Configuration options ->Batch processing":
https://github.com/CUTR-at-USF/gtfs-realtime-validator#configuration-options

@barbeau
Copy link
Member Author

barbeau commented Jan 3, 2022

Comment by skjolber
Sunday Mar 11, 2018 at 10:01 GMT


@barbeau did you try running the out-of-memory dataset using a profiler?

@barbeau
Copy link
Member Author

barbeau commented Jan 3, 2022

Comment by barbeau
Sunday Mar 11, 2018 at 21:17 GMT


No, not yet.

@barbeau
Copy link
Member Author

barbeau commented Jan 3, 2022

Comment by barbeau
Tuesday Dec 14, 2021 at 15:24 GMT


A good approach for this might be to graph performance on each PR instead of imposing hard limits via a unit test - that's what OpenTripPlanner is doing here:
opentripplanner/OpenTripPlanner#3783

@barbeau
Copy link
Member Author

barbeau commented Jan 3, 2022

Comment by derhuerst
Tuesday Dec 14, 2021 at 16:41 GMT


DELFI e.V. is a non-profit that aggregates transit datasets of all the local transit authorities/providers to create a unified feed fir Germany. It's official role is to publish NeTeX as mandatory per the EU regulation.

But it also publishes a GTFS feed generated from the merged data, which is currently 333mb in size. Its official site doesn't provide a direct & script-friendly URL for it (🙄), but @juliuste kindly mirrors it to https://de.data.public-transport.earth/gtfs-germany.zip.

Currently, it is not much larger than the Dutch feed, but since over the coming months & years, missing regions as well as lots of stop/station & pathways.txt topologies will likely be added.

Edit: Unfortunately, to my knowledge, there are no realtime feeds available right now.

@isabelle-dr isabelle-dr added this to the v1.1 milestone Jan 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants