Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests for the new KML validator #71

Open
GretaCB opened this issue Dec 12, 2016 · 10 comments
Open

Add tests for the new KML validator #71

GretaCB opened this issue Dec 12, 2016 · 10 comments

Comments

@GretaCB
Copy link
Contributor

GretaCB commented Dec 12, 2016

@who8mycakes Awesome job with debugging here, and a successful #FridayDeploy :) Here are some of my thoughts going forward.

Tests

Now that we have a new kml validation script, we'll need tests for the new script.
At bare minimum, some the tests I'm seeing needed would recreate the following error cases:

  • x layers found. Maximum of y layers allowed.
  • No duplicate layers allowed.
  • A success case
  • An invalid/corrupt file that fails gdal.open()

Remove redundant code

Also, looks like some of the same logic is within two other places:

We should remove the excess layer-validation logic to avoid future confusion about where exactly this takes place in tileset-unpacker. Let's be kind to our future selves :)

Optimize use of dependencies

Also, curious if gdal is needed in order to find layer count. Is it possible to use mapnik-omnivore since it's already a dependency? This would help avoid adding a pretty substantial dependency (gdal). Not sure if this was part of your convo with @mapsam last week.

cc @mapbox/unpacker

@mapsam
Copy link
Contributor

mapsam commented Dec 12, 2016

Hey @GretaCB thanks for the thoughts! Lots of this came up during our push on Friday with @springmeyer as well.

I believe the usage of Mapnik omnivore is where we were hitting the KML drivers - would be interesting to see if there's a light way to use it. Worth noting @BergWerkGIS has a PR open tha introduces GDAlL as well.

Let's remove redundant code!

@springmeyer
Copy link

Also, looks like some of the same logic is within two other places:
preprocessorcer
lib/validators/omnivore

@GretaCB I agree. That is check in lib/validators/omnivore is now duplicated. I created #69 as another approach. The idea there is to avoid a direct dependency on gdal and keep using mapnik-omnivore by having the existing check in lib/validators/omnivore run first. /cc @mapsam @who8mycakes

@who8mycakes
Copy link
Contributor

who8mycakes commented Dec 14, 2016

@springmeyer Thanks for digging into this. I'm not as deeply familiar with the dependency trees. Does avoiding direct dependency on gdal, preclude/nullify an update allowing patch updates to gdal?

@who8mycakes
Copy link
Contributor

Pausing here while we consider a different approach to KML validation: #69.

@springmeyer
Copy link

springmeyer commented Dec 15, 2016

Below I summarize my thoughts. The current solution in #68 feels not ideal to be. Below I've rehashed the problem and put forth a new solution as a straw dog that I hope @who8mycakes @GretaCB and @mapsam can help tune.

  • The top priority KML failures on large, > 15 layer files, and not "failing fast" has been solved 🎉 (thank you @mapsam and @who8mycakes)
  • Now we need to think about test coverage and maintainability (avoiding code duplication) as @GretaCB rightly points out (because Refine KML validation #68 landed without a regression test for the kml and introduced a new validator that is not specifically covered by tests).
  • But in looking at the code I see deeper issues (that have existing long before Refine KML validation #68) that feel to me worth investigating deeper. So, let's review and dive deeper....
  • The key idea behind the fix in Refine KML validation #68 is to open and read the # of layers with GDAL once and up-front before trying to open the file with the tilelive* modules.
  • The tilelive* modules are invoked currently by mapbox-upload-validate when validate.info and validate.source are called during validation
  • Currently validate.info and validate.source are called here before the validation at lib/validators/omnivore.js
  • Why can GDAL can open problematic KML fine while the tilelive* modules in question cannot cope and become very slow/meltdown?
    • In this case tilelive-omnivore is called internally here
    • The design of tilelive-omnivore is to loop over every layer and generate a Mapnik XML layer for it
    • Then tilelive-omnivore dishes this off to tilelive-bridge.
    • Then tilelive-bridge tries to open the Mapnik XML, which contains a Mapnik layer per KML layer.
    • Then Mapnik opens each layer by asking GDAL to open it (this invokes opening a file handle for the KML file on disk).
    • There is no caching so the same KML file is opened N number of times per N layers.
    • For large KML the entire KML needs to be parsed when opened based on the way the KML format and GDAL parsing works (no way to lazily decode just a single layer)
    • So the end result is that tilelive.info will just hang.... as the same file is opened N times.

Give the above my sense is that we should revisit #68 and apply a fix that addresses the core problem more directly. I feel like the core problem is that the inefficiency of KML compounds given the way that tilelive-omnivore works. So a way to attack the problem more deeply would be to restrict the number of layers tilelive-omnivore generates. Having KML validation in yet another place might feel odd but adding it to tilelive-omnivore would prevent the core problem that caused the hang. But it feels odd to impose limits on tilelive-omnivore that are specific to mapbox.com limits. So to avoid code duplication and keep concerns separate perhaps we could find a way to pass an option down to tilelive-omnivore? Here is one potential way:

Change mapbox-upload-validate like:

diff --git a/lib/validate.js b/lib/validate.js
index 4e55d55..01c6341 100644
--- a/lib/validate.js
+++ b/lib/validate.js
@@ -9,6 +9,9 @@ var TileJSON = require('tilejson');
 var Serialtiles = require('../lib/tilelive-serialtiles');
 var Mapbox = require('../lib/tilelive-mapbox');
 var invalid = require('./invalid');
+var uploadLimits = require('mapbox-upload-limits');
+var qs = require('querystring');
+var url = require('url');
 
 Vector.registerProtocols(tilelive);
 MBTiles.registerProtocols(tilelive);
@@ -36,7 +39,11 @@ module.exports.info = function validateInfo(uri, limits, callback) {
     limits = { max_metadata: 60 * 1024 };
   }
 
-  tilelive.info(uri, function(err, info) {
+  var opts = url.parse(uri, true);
+  opts.pathname = qs.unescape(opts.pathname);
+  opts.query.layerLimit = uploadLimits.kml.layers;
+
+  tilelive.info(opts, function(err, info) {
     if (err) return callback(invalid(err));
     if (info.prepare) return callback(invalid('Source cannot contain prepare key'));
 

Change tilelive-omnivore like:

diff --git a/index.js b/index.js
index a0b5c8c..0bf74f5 100644
--- a/index.js
+++ b/index.js
@@ -14,6 +14,7 @@ function Omnivore(uri, callback) {
   uri = url.parse(uri, true);
   var files = uri.pathname.split(',');
   var layerName = uri.query.layerName ? decodeURI(uri.query.layerName) : null;
+  var layerLimit = uri.query.layerLimit ? +decodeURI(uri.query.layerLimit) : null;
   var omnivore = this;
 
   var q = queue();
@@ -26,6 +27,9 @@ function Omnivore(uri, callback) {
         if (metadata.dstype === 'gdal' && metadata.raster.bands[0].rasterDatatype !== 'Byte') {
           return next('Only 8 bit TIFFs are supported');
         }
+        if (metadata.layers.length > layerLimit) {
+          return next(metadata.layers.length+ ' layers found. Maximum of ' + layerLimit + ' layers allowed.');
+        }
         metadata.filepath = filepath;
         next(null, metadata);
       });

@mapsam
Copy link
Contributor

mapsam commented Dec 15, 2016

@springmeyer thanks for clearly writing down the root cause here. Super helpful explanation and I like the direction of the solution. Happy to help update tilelive-omnivore if need be.

@GretaCB
Copy link
Contributor Author

GretaCB commented Dec 19, 2016

Awesome @springmeyer , thanks for this runthrough. Finally able to read it after a crazy week. I like the way you're thinking here:

This PR starts refactoring the code so that validation does not try to be generic for all filetypes. Rather we call into the filetype specific validator and inside each we can optimize the order and necessary calls.

Typically, I aim to genericize/abstract code out as much as possible, but in this case, there's a reason for implementing the opposite. Love the emphasis on optimize 👍 and definitely agree that having source-type-specific validation all the way at the end of the process (and after calling tilelive*) is not the optimal approach. Good eye.

Per chat:

The place I'd love your gut check @GretaCB is whether passing an object (rather than bare ⁠⁠⁠⁠uri⁠⁠⁠⁠ string) to tilelive.info and an option in the ⁠⁠⁠⁠query⁠⁠⁠⁠ to tilelive-omnivore is okay, or too non-standard and oddball.

Option passed to tilelive-omnivore

Even though this layer limit is mapbox-specific, it is still based on what we've seen Mapnik can handle. So seems relevant to include this option for other devs to use and know about.

Object passed to tilelive.info

Correct me if I'm wrong, but this could potentially be a breaking change in tilelive.js cc @rclark
Feels like we should refactor our validation wrapper around tilelive.info instead of changing tilelive (which might have some ripple effect outside of unpacker).

cc @who8mycakes @mapsam

@springmeyer
Copy link

Thanks for the review @GretaCB.

Correct me if I'm wrong, but this could potentially be a breaking change in tilelive.js cc @rclark

I don't think it is breaking: tilelive.info is https://github.com/mapbox/tilelive/blob/master/lib/tilelive.js#L130 and I see it calls tilelive.load and load() is smart to parse if a string and if not it accepts uri as if it has already been parsed into an object: https://github.com/mapbox/tilelive/blob/master/lib/tilelive.js#L77-L83.

Feels like we should refactor our validation wrapper around tilelive.info instead of changing tilelive (which might have some ripple effect outside of unpacker).

👍 Correct, that is what I'm proposing I think in the second diff at #71 (comment).

@who8mycakes
Copy link
Contributor

In the meantime, until we implement @springmeyer's approach to kml validation, tests to cover the kml validation currently in production have been added: #72.

A suggested above, the tests cover:

  • x layers found. Maximum of y layers allowed.
  • No duplicate layers allowed.
  • A success case
  • An invalid/corrupt file that fails gdal.open()

@GretaCB
Copy link
Contributor Author

GretaCB commented Jan 1, 2017

Awesome, thanks @who8mycakes . Next actions will be to close this ticket out since it relates to KML tests, and open a new ticket with a plan forward for optimizing mapbox-upload-validate and removing redundant code (per comments above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants