Skip to content
This repository has been archived by the owner on May 27, 2024. It is now read-only.

Add tests for big feeds #123

Open
barbeau opened this issue Apr 12, 2017 · 8 comments
Open

Add tests for big feeds #123

barbeau opened this issue Apr 12, 2017 · 8 comments
Assignees
Milestone

Comments

@barbeau
Copy link
Member

barbeau commented Apr 12, 2017

Summary:

We need to make sure that as we add new rules, the validator can continue to run in real-time on production-sized feeds for major cities.

I posted a question on the GTFS-rt list asking for examples of very large feeds:
https://groups.google.com/forum/#!topic/gtfs-realtime/mM8cQIIV_-Y

These have been suggested to me so far, with largest coming first:

We should add some unit tests that do basic benchmarking to ensure we're not exceeding a given duration when processing feeds. I think 2 seconds may be reasonable, but we'll need to test. We'll also need to figure out how this works for CI, as Travis is significantly underpowered when compared to a typical desktop.

@barbeau
Copy link
Member Author

barbeau commented Apr 12, 2017

If I try to run the Dutch feed with -Xmx8g parameter on my machine (dual Xeon @ 2.5 GHz w/ 16GB RAM), I get this exception after it runs for a very long time (I left and came back an hour later):

javax.servlet.ServletException: org.glassfish.jersey.server.ContainerException: java.lang.OutOfMemoryError: GC overhead limit exceeded
	at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:423)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:386)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:334)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:221)
	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:800)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
	at org.eclipse.jetty.server.Server.handle(Server.java:497)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:313)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
	at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:626)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:546)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.glassfish.jersey.server.ContainerException: java.lang.OutOfMemoryError: GC overhead limit exceeded
	at org.glassfish.jersey.servlet.internal.ResponseWriter.rethrow(ResponseWriter.java:256)
	at org.glassfish.jersey.servlet.internal.ResponseWriter.failure(ResponseWriter.java:238)
	at org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:486)
	at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:316)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
	at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)
	at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:291)
	at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1140)
	at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:403)
	... 17 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:68)
	at java.lang.StringBuilder.<init>(StringBuilder.java:89)
	at org.onebusaway.csv_entities.DelimitedTextParser.parse(DelimitedTextParser.java:65)
	at org.onebusaway.csv_entities.CSVLibrary.parse(CSVLibrary.java:131)
	at org.onebusaway.csv_entities.CsvTokenizerStrategy.parse(CsvTokenizerStrategy.java:34)
	at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:154)
	at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:120)
	at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:115)
	at org.onebusaway.gtfs.serialization.GtfsReader.run(GtfsReader.java:172)
	at org.onebusaway.gtfs.serialization.GtfsReader.run(GtfsReader.java:160)
	at com.conveyal.gtfs.validator.json.FeedProcessor.load(FeedProcessor.java:73)
	at com.conveyal.gtfs.validator.json.FeedProcessor.run(FeedProcessor.java:44)
	at edu.usf.cutr.gtfsrtvalidator.api.resource.GtfsFeed.postGtfsFeed(GtfsFeed.java:180)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)
	at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)
	at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:308)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
	at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)

So it looks like it's getting hung up in the static GTFS validation using the Conveyal gtfs-validator.

If I run the Dutch GTFS-rt feed with random GTFS data (I used HART in Tampa), then it processes each GTFS-rt iteration in about 1.1 seconds.

Using MBTA data, it processes each GTFS-rt iteration in about 1.1 seconds as well.

@barbeau barbeau added this to the v1.0 milestone Apr 12, 2017
@barbeau barbeau self-assigned this Apr 14, 2017
@barbeau
Copy link
Member Author

barbeau commented May 3, 2017

Here's a good list of GTFS-rt feeds from Transitfeeds.com:
http://transitfeeds.com/search?q=gtfsrt

@barbeau
Copy link
Member Author

barbeau commented May 8, 2017

Transitland issue for adding support for GTFS-rt feeds - https://github.com/transitland/transitland/issues/77.

@barbeau
Copy link
Member Author

barbeau commented Sep 19, 2017

We could use the batch processor for benchmarking feed processing times - see README "Configuration options ->Batch processing":
https://github.com/CUTR-at-USF/gtfs-realtime-validator#configuration-options

@skjolber
Copy link

@barbeau did you try running the out-of-memory dataset using a profiler?

@barbeau
Copy link
Member Author

barbeau commented Mar 11, 2018

No, not yet.

@barbeau
Copy link
Member Author

barbeau commented Dec 14, 2021

A good approach for this might be to graph performance on each PR instead of imposing hard limits via a unit test - that's what OpenTripPlanner is doing here:
opentripplanner/OpenTripPlanner#3783

@derhuerst
Copy link
Contributor

derhuerst commented Dec 14, 2021

DELFI e.V. is a non-profit that aggregates transit datasets of all the local transit authorities/providers to create a unified feed fir Germany. It's official role is to publish NeTeX as mandatory per the EU regulation.

But it also publishes a GTFS feed generated from the merged data, which is currently 333mb in size. Its official site doesn't provide a direct & script-friendly URL for it (🙄), but @juliuste kindly mirrors it to https://de.data.public-transport.earth/gtfs-germany.zip.

Currently, it is not much larger than the Dutch feed, but since over the coming months & years, missing regions as well as lots of stop/station & pathways.txt topologies will likely be added.

Edit: Unfortunately, to my knowledge, there are no realtime feeds available right now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants