Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random Forest: Training/Regression, Classifier/Predicting... #295

Closed
m-mohr opened this issue Oct 26, 2021 · 8 comments · Fixed by #306
Closed

Random Forest: Training/Regression, Classifier/Predicting... #295

m-mohr opened this issue Oct 26, 2021 · 8 comments · Fixed by #306
Assignees
Milestone

Comments

@m-mohr
Copy link
Member

m-mohr commented Oct 26, 2021

We need two (or one?) new processes for Random Forest that support classification and regression.

Would training happen outside of openEO for now?

Implementations:

PS: That's a lot of parameters, wow!

-> Related: save_model / load_model with GLMLC metadata: #300

@m-mohr m-mohr added this to the 1.3.0 milestone Oct 26, 2021
@m-mohr m-mohr self-assigned this Oct 26, 2021
@jdries
Copy link
Contributor

jdries commented Oct 27, 2021

We'll need training as well, as the saved model formats may be specific to the implementation used?
I'm also not sure if the process should be limited to random forest, we're also thinking about using catboost, which supports nodata.

@m-mohr m-mohr changed the title Random Forest: Classifier and Regression Random Forest: Training, Classifier and Regression Oct 27, 2021
@m-mohr
Copy link
Member Author

m-mohr commented Oct 27, 2021

We'll need training as well, as the saved model formats may be specific to the implementation used?

Ok good. I wasn't sure whether this would be provided through file upload but that's actually not yet a thing in Platform.

I'm also not sure if the process should be limited to random forest, we're also thinking about using catboost, which supports nodata.

I guess that depends a lot on how the individual processes for training, classification and regression would look like afterwards. If you have a lot of parameters, they should probably be separate otherwise you end up in a mess with schemas. If they are just "choose a method and a file" or so, we might be able to merge them into a generic one. Let's see, I still need to do more research as I don't have a lot of experience with all this, unfortunately...

@m-mohr m-mohr changed the title Random Forest: Training, Classifier and Regression Random Forest: Training, Classifier, Regression, Predict... Oct 27, 2021
@m-mohr m-mohr changed the title Random Forest: Training, Classifier, Regression, Predict... Random Forest: Training/Regression, Classifier/Predicting... Nov 12, 2021
@m-mohr m-mohr modified the milestones: 1.3.0, 1.2.0 Nov 12, 2021
m-mohr added a commit that referenced this issue Nov 18, 2021
@m-mohr m-mohr linked a pull request Nov 18, 2021 that will close this issue
@m-mohr m-mohr modified the milestones: 1.2.0, 1.3.0 Nov 29, 2021
@mattia6690
Copy link

Recap of today's meeting on the randomForest process:

  • need to flatten the data from Vector cube to 2d (table-like) object as RF input
  • specifiy dimension(s) of a vector cube that act as predictors for the model
  • Two separate functions for training and prediction
  • New process for sampling might be useful in the future (new Issue already @m-mohr?)
  • Processes will be based on vector cubes instead of raster cubes to allow for more flexibility to the user (e.g. import of Polygons and Lines possible)

For more information, I put the Presentation here. This is a kickstarter for the UC8 implementation

@jdries
Copy link
Contributor

jdries commented Dec 13, 2021

Some feedback based on internal discussion at VITO:

  • the landcover use case will require prediction on raster cubes, training can happen on vector cubes. (We need to produce a map at the end.)
  • for training, we can convert our polygons into a set of points (offline), where we basically sample each polygon with a number of points. That would allow us to use a process like 'aggregate_spatial' for the raster to vector conversion, because the use of points has the effect that original pixel values are maintained.
  • in our case, the flattening has been taken care of by apply_dimension, but it's fine if another process is defined for that (doing the same thing)

This was referenced Dec 13, 2021
@m-mohr m-mohr added the ML label Dec 13, 2021
@m-mohr
Copy link
Member Author

m-mohr commented Dec 13, 2021

  • New process for sampling might be useful in the future (new Issue already @m-mohr?)

Yes, quickly opened one here: #313

@edzer
Copy link
Member

edzer commented Dec 13, 2021

Thanks, helpful! Here is a sketch of the process(es), as I see them, high-level (for pixel-wise ML methods, such as RF). Following the ML terminology, I use labels for the response (e.g. crop type; either a class variable or a continuous variabe) and features for the predictors (e.g. the bands, or bands x time, based on which a RF predicts a class given a model).

As @mattia6690 notes, there are two separate steps: A train model, B predict on new features

A train model

  • input: "locations" (points, pixels) with:
    • labels
    • features
  • input: hyper-parameters
  • output: "model"

See below for how we get to these input data, e.g. from polygons

B Predict (classify, regress)

  • input: data: raster data cube with features as a dimension
  • input: dimension: feature dimension name
  • input: context: "model"
  • two options:
    • B1: we only predict a class, or a scalar
      • input: reducer: needs to be defined: takes the model, returns the class
      • output: data cube with labels (class, or cont. variable)
      • this is (a special case for) reduce_dimension
    • B2: we want probabilities for each class (a standard option for any classifier)
      • input: process: needs to be defined: takes the model, returns the class probabilities
      • output data cube with probabilities per class (class is dimension, probability the attribute)
      • this is (a special case for) apply_dimension with target_dimension = "class"

data for A: train model

Typical steps needed before we can train the model (A3) are:

  • case A1: training data consist of polygons and their class values, where polygons are uniform in their class value. This needs a method to either:

    • A1.1 sample points within the polygons, given some sampling strategy (random? regular?) and sample size
    • A1.2 given a raster data cube, find all the pixel centers within the polygons: may require a new process ("extract?" - could be combined with A2):
      • input: polygon geometries
      • input: raster data cube
      • output: POINT geometries of pixel centers inside the polygon, with associated polygon ID
  • output of A1: Point locations + labels -> go to case A2

  • case A2: extract features at the training point locations: we think this should happen with aggregate_spatial when called with POINT geometries (although no aggregation takes place):

    • input: data: raster data cube with features
    • input: geometries: POINT locations + labels
    • input: reducer: array_element with index 0
    • output: point locations + features at these points -> go to case A3
  • case A3: train model:

    • input: point locations + features (output of A2)
    • input: hyper-parameters
    • output: model

Note that step A1.2 + A2: for a set of polygons and a raster (cube), return the raster pixel centers and all the associated pixel values, is a very common operation; in R it is usually called extract.

@jdries
Copy link
Contributor

jdries commented Dec 14, 2021

Nice overview!
For A1.1 we will first write a script that does this client side, where we have all flexibility to do that in whatever way we like, but I'm not opposed to also defining it as a process, same for A1.2, which seems even simpler.

For prediction (B1/B2), instead of having special cases of apply/reduce dimension, could a prediction process also simply be a callback? Wouldn't that integrate better in the whole processes framework?

@m-mohr
Copy link
Member Author

m-mohr commented Dec 14, 2021

For prediction (B1/B2), instead of having special cases of apply/reduce dimension, could a prediction process also simply be a callback? Wouldn't that integrate better in the whole processes framework?

Yes, that's actually what we discussed yesterday but Edzer did not mention it explicitly. So to visualize it with a bit of JS-like pseudo-code for B1:

p = new ProcessBuilder()
cube = p.load_collection('S2')
model = p.load_ml_model('my_model_job')
reducer = function(data, context) {
  return this.predict_rf(data = data, model = context)
}
x = p.reduce_dimension(data = cube, reducer = reducer, dimension = 'bands', context = model)
...

Not fully fleshed out yet, but to give an idea...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants