Add tests for big feeds #123

barbeau · 2017-04-12T15:21:11Z

Summary:

We need to make sure that as we add new rules, the validator can continue to run in real-time on production-sized feeds for major cities.

I posted a question on the GTFS-rt list asking for examples of very large feeds:
https://groups.google.com/forum/#!topic/gtfs-realtime/mM8cQIIV_-Y

These have been suggested to me so far, with largest coming first:

Dutch feed (http://gtfs.openov.nl/ - apparently OpenTripPlanner instances with 24-32GB of memory are used for this)
- GTFS - http://gtfs.openov.nl/gtfs-rt/gtfs-openov-nl.zip (~261MB)
- TripUpdates - http://gtfs.openov.nl/gtfs-rt/tripUpdates.pb (~8.4MB)
- VehiclePositions - http://gtfs.openov.nl/gtfs-rt/vehiclePositions.pb (~617K)
MBTA
- GTFS - https://cdn.mbta.com/MBTA_GTFS.zip (~13.4MB)
- TripUpdates - https://cdn.mbta.com/realtime/TripUpdates.pb (~8.6MB)
- VehiclePositions - https://cdn.mbta.com/realtime/VehiclePositions.pb (~44KB)
SEQ (Translink)
- GTFS - https://gtfsrt.api.translink.com.au/GTFS/SEQ_GTFS.zip (~28MB)
- Combined (TripUpdates + VehiclePositions) feed - https://gtfsrt.api.translink.com.au/Feed/SEQ (~2.2MB)
BART (http://www.bart.gov/schedules/developers)
- GTFS - http://www.bart.gov/sites/default/files/docs/google_transit_20170325_v3.zip (427KB)
- TripUpdates - http://api.bart.gov/gtfsrt/tripupdate.aspx (3.1KB - it's small because only 1 stop_time_update per trip)
- VehiclePositions - BART doesn't have this
NYC (but they are likely split by borough)
LA Metro (not publicly shared)
MTC for SF Bay Area (http://511.org/developers/list/apis/) (According to http://assets.511.org/pdf/nextgen/developers/Open_511_Data_Exchange_Specification_v1.0_Transit.pdf, it doesn't seem that you can pull out more than one agency at a time, so no feed that includes all bay area transit agencies exists)
CTA (Doesn't seem to be public? http://www.transitchicago.com/developers/)
HART
- GTFS - http://gohart.org/google/google_transit.zip (~2KB)
- TripUpdates - http://api.tampa.onebusaway.org:8088/trip-updates (~9KB)
- VehiclePositions - http://api.tampa.onebusaway.org:8088/vehicle-positions (~9KB)

We should add some unit tests that do basic benchmarking to ensure we're not exceeding a given duration when processing feeds. I think 2 seconds may be reasonable, but we'll need to test. We'll also need to figure out how this works for CI, as Travis is significantly underpowered when compared to a typical desktop.

barbeau · 2017-04-12T19:15:35Z

If I try to run the Dutch feed with -Xmx8g parameter on my machine (dual Xeon @ 2.5 GHz w/ 16GB RAM), I get this exception after it runs for a very long time (I left and came back an hour later):

javax.servlet.ServletException: org.glassfish.jersey.server.ContainerException: java.lang.OutOfMemoryError: GC overhead limit exceeded
	at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:423)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:386)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:334)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:221)
	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:800)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
	at org.eclipse.jetty.server.Server.handle(Server.java:497)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:313)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
	at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:626)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:546)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.glassfish.jersey.server.ContainerException: java.lang.OutOfMemoryError: GC overhead limit exceeded
	at org.glassfish.jersey.servlet.internal.ResponseWriter.rethrow(ResponseWriter.java:256)
	at org.glassfish.jersey.servlet.internal.ResponseWriter.failure(ResponseWriter.java:238)
	at org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:486)
	at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:316)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
	at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)
	at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:291)
	at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1140)
	at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:403)
	... 17 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:68)
	at java.lang.StringBuilder.<init>(StringBuilder.java:89)
	at org.onebusaway.csv_entities.DelimitedTextParser.parse(DelimitedTextParser.java:65)
	at org.onebusaway.csv_entities.CSVLibrary.parse(CSVLibrary.java:131)
	at org.onebusaway.csv_entities.CsvTokenizerStrategy.parse(CsvTokenizerStrategy.java:34)
	at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:154)
	at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:120)
	at org.onebusaway.csv_entities.CsvEntityReader.readEntities(CsvEntityReader.java:115)
	at org.onebusaway.gtfs.serialization.GtfsReader.run(GtfsReader.java:172)
	at org.onebusaway.gtfs.serialization.GtfsReader.run(GtfsReader.java:160)
	at com.conveyal.gtfs.validator.json.FeedProcessor.load(FeedProcessor.java:73)
	at com.conveyal.gtfs.validator.json.FeedProcessor.run(FeedProcessor.java:44)
	at edu.usf.cutr.gtfsrtvalidator.api.resource.GtfsFeed.postGtfsFeed(GtfsFeed.java:180)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)
	at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160)
	at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
	at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)
	at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:308)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
	at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)

So it looks like it's getting hung up in the static GTFS validation using the Conveyal gtfs-validator.

If I run the Dutch GTFS-rt feed with random GTFS data (I used HART in Tampa), then it processes each GTFS-rt iteration in about 1.1 seconds.

Using MBTA data, it processes each GTFS-rt iteration in about 1.1 seconds as well.

barbeau · 2017-05-03T21:16:53Z

Here's a good list of GTFS-rt feeds from Transitfeeds.com:
http://transitfeeds.com/search?q=gtfsrt

barbeau · 2017-05-08T19:40:32Z

Transitland issue for adding support for GTFS-rt feeds - https://github.com/transitland/transitland/issues/77.

barbeau · 2017-09-19T17:55:22Z

We could use the batch processor for benchmarking feed processing times - see README "Configuration options ->Batch processing":
https://github.com/CUTR-at-USF/gtfs-realtime-validator#configuration-options

skjolber · 2018-03-11T10:01:19Z

@barbeau did you try running the out-of-memory dataset using a profiler?

barbeau · 2018-03-11T21:17:11Z

No, not yet.

barbeau · 2021-12-14T15:24:17Z

A good approach for this might be to graph performance on each PR instead of imposing hard limits via a unit test - that's what OpenTripPlanner is doing here:
opentripplanner/OpenTripPlanner#3783

derhuerst · 2021-12-14T16:41:47Z

DELFI e.V. is a non-profit that aggregates transit datasets of all the local transit authorities/providers to create a unified feed fir Germany. It's official role is to publish NeTeX as mandatory per the EU regulation.

But it also publishes a GTFS feed generated from the merged data, which is currently 333mb in size. Its official site doesn't provide a direct & script-friendly URL for it (🙄), but @juliuste kindly mirrors it to https://de.data.public-transport.earth/gtfs-germany.zip.

Currently, it is not much larger than the Dutch feed, but since over the coming months & years, missing regions as well as lots of stop/station & pathways.txt topologies will likely be added.

Edit: Unfortunately, to my knowledge, there are no realtime feeds available right now.

barbeau added the enhancement label Apr 12, 2017

barbeau added this to the v1.0 milestone Apr 12, 2017

barbeau self-assigned this Apr 14, 2017

barbeau mentioned this issue May 11, 2017

Change static GTFS validator #193

Open

This was referenced Sep 27, 2017

Analyze process currently hangs on Netherlands huge GTFS feed CUTR-at-USF/transit-feed-quality-calculator#1

Closed

Add batch command line option to disable shapes.txt-based metadata #284

Closed

barbeau modified the milestones: v1.0, v1.1 Mar 9, 2018

barbeau mentioned this issue Mar 11, 2018

Microoptimalization: Avoid creating new calendars in ServiceDate OneBusAway/onebusaway-gtfs-modules#98

Open

AntoineGrapperon mentioned this issue Sep 26, 2018

Dead end hyperlink for Dutch GTFS #339

Closed

barbeau mentioned this issue Jan 3, 2022

Add tests for big feeds MobilityData/gtfs-realtime-validator#30

Open

isabelle-dr mentioned this issue Jan 7, 2022

Change static GTFS validator MobilityData/gtfs-realtime-validator#43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tests for big feeds #123

Add tests for big feeds #123

barbeau commented Apr 12, 2017 •

edited

Loading

barbeau commented Apr 12, 2017 •

edited

Loading

barbeau commented May 3, 2017

barbeau commented May 8, 2017

barbeau commented Sep 19, 2017 •

edited

Loading

skjolber commented Mar 11, 2018

barbeau commented Mar 11, 2018

barbeau commented Dec 14, 2021

derhuerst commented Dec 14, 2021 •

edited

Loading

Add tests for big feeds #123

Add tests for big feeds #123

Comments

barbeau commented Apr 12, 2017 • edited Loading

barbeau commented Apr 12, 2017 • edited Loading

barbeau commented May 3, 2017

barbeau commented May 8, 2017

barbeau commented Sep 19, 2017 • edited Loading

skjolber commented Mar 11, 2018

barbeau commented Mar 11, 2018

barbeau commented Dec 14, 2021

derhuerst commented Dec 14, 2021 • edited Loading

barbeau commented Apr 12, 2017 •

edited

Loading

barbeau commented Apr 12, 2017 •

edited

Loading

barbeau commented Sep 19, 2017 •

edited

Loading

derhuerst commented Dec 14, 2021 •

edited

Loading