Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processes for Random Forest #306

Merged
merged 7 commits into from
Mar 9, 2022
Merged

Processes for Random Forest #306

merged 7 commits into from
Mar 9, 2022

Conversation

m-mohr
Copy link
Member

@m-mohr m-mohr commented Nov 18, 2021

The new RF processes, in a draft state. Descriptions still need to be improved a bit and we may want to consider merging them with #304

@m-mohr m-mohr added this to the 1.2.0 milestone Nov 18, 2021
@m-mohr m-mohr self-assigned this Nov 18, 2021
@m-mohr m-mohr linked an issue Nov 18, 2021 that may be closed by this pull request
@clausmichele
Copy link
Member

Could we create a copy of fit_regr_random_forest.json for classification? Like fit_class_random_forest.json ?

@m-mohr
Copy link
Member Author

m-mohr commented Nov 19, 2021

@clausmichele Sure, go ahead :-)

@clausmichele
Copy link
Member

Ok, I'll add them later.

@m-mohr
Copy link
Member Author

m-mohr commented Nov 19, 2021

The regression and classification processes look very similar. Would it make sense to merge them to a fit_random_forest process where you can have a parameter to decide whether you want a regression or classification model? Or does that not make so much sense?

@mattia6690
Copy link

The regression and classification processes look very similar. Would it make sense to merge them to a fit_random_forest process where you can have a parameter to decide whether you want a regression or classification model? Or does that not make so much sense?

Yeah they are quite similar but they work differently, especially for the split criterion. I think we could make both but programming them like in Pythin you'd always need two separate functions since they are based on different input parameters. Here are some infos about Classification and Regression.

If you think we have time and/or need a classifier in future it could be worthwile having two functions for the random Forest. I'd give the regression function implementation a higher priority though

@m-mohr
Copy link
Member Author

m-mohr commented Nov 22, 2021

In the sklearn example the only difference seems to be the criterion, right?

By the way, what are the "Attributes" in sklearn? Are that some kind of different parameters or is that what is returned?

@mattia6690
Copy link

In the sklearn example the only difference seems to be the criterion, right?

The algorithm behind classifier and regression is the same afaik. This spit criterion differentiation is very important though. While in a classification you work more on discrete classes (e.g. forest / non forests / impervious / agricutlural field) the regression is working with numerical values. Therefore the classifier ultimately leads to one class of the input datasets' being chosen and assigned to a pixel. In the UC8 the input and output will be numerical and therefore the regression is needed.

Nevertheless, we could include a criterion that does a classification but bear in mind that we would need two separate functions to be called (and I don't know whether this is widely implemented outside of R and Pythin) and I think also some additional errors to be thrown, if the input data does not correspond to the needed.

By the way, what are the "Attributes" in sklearn? Are that some kind of different parameters or is that what is returned

@clausmichele correct me if I am wrong but I think the attributes are the final output of the randomforest regression / classification. These attributes are usually the results connected to the "model-object" returned in Python

@clausmichele
Copy link
Member

Yes, the attributes are all the information about the model that we have created. Maybe they could be exposed as metadata with the stored model or in the logs. However, I don't think this has high priority now.

@clausmichele
Copy link
Member

I'm trying to figure out which data to pass to fit_regr_random_forest:

            "name": "data",
            "description": "The input data for the regression model. The raster images that will be used as predictors for the Random Forest. Aggregated to the features (vectors) of the target input variable.",
            "schema": {
                "type": "object",
                "subtype": "raster-cube"
            }
        },

This could be the result of aggregate_spatial using a geoJSON with the training areas as Geometries in a FeatureCollection right?
For example, we could compute the mean over each band for each training polygon.


        {
            "name": "target",
            "description": "The input data for the regression model. This will be vector cubes for each training site. This is associated with the target variable for the Random Forest Model. The Geometry has to associated with a value to predict (e.g. fractional forest canopy cover).",
            "schema": {
                "type": "object",
                "subtype": "vector-cube"
            }
        },

This is also the result of aggregate_spatial right?
But instead, we can now apply a reducer like the NDVI. This means that the regressor will predict the NDVI based on the input bands.

Maybe I misunderstood some stuff, please help me :)

@mattia6690
Copy link

            "name": "data",
            "description": "The input data for the regression model. The raster images that will be used as predictors for the Random Forest. Aggregated to the features (vectors) of the target input variable.",
            "schema": {
                "type": "object",
                "subtype": "raster-cube"
            }
        },

This could be the result of aggregate_spatial using a geoJSON with the training areas as Geometries in a FeatureCollection right? For example, we could compute the mean over each band for each training polygon.

Here the information about the target variable (in UC8 the fractional canopy cover) is needed for basically a point that could represent the center of the Pixel resulting from the upscaling from VHR data resolution to medium (20m) resolution. I think a GeoJSON with coordinates and FCC values is enough to later extract the predictors (below)


        {
            "name": "target",
            "description": "The input data for the regression model. This will be vector cubes for each training site. This is associated with the target variable for the Random Forest Model. The Geometry has to associated with a value to predict (e.g. fractional forest canopy cover).",
            "schema": {
                "type": "object",
                "subtype": "vector-cube"
            }
        },

This is also the result of aggregate_spatial right? But instead, we can now apply a reducer like the NDVI. This means that the regressor will predict the NDVI based on the input bands.

I am not sure what you mean by reducer in this case. The important part is that the target points are associated with a value from each of the predictor rasters. We also thought that aggregate_spatial should be the best openEO function to do so.

In synthesis This would mean that for ONE Pixel/Point X associated to ONE information about the target t ONE information for EACH of the predictors predictor_1...predictor_x will be extracted. This feature space (t,predictor_1,predictor_2,...,predictor_x) is the input for the random Forest model.

I hope this was somewhat clearer? Otherwise let me know and I will try to explain a bit more in detail

@jdries
Copy link
Contributor

jdries commented Nov 24, 2021

For our spark based implementation,
num_trees would be:https://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/regression/RandomForestRegressor.html#numTrees:org.apache.spark.ml.param.IntParam
mtry is less clear, perhaps: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/regression/RandomForestRegressor.html#maxBins:org.apache.spark.ml.param.IntParam

Looking at CatBoost: https://catboost.ai/en/docs/concepts/python-reference_catboostregressor_fit
API wise, this seems similar to random forest, and doesn't require lots of custom parameters, maybe even less.

@mattia6690
Copy link

mtry is less clear, perhaps: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/regression/RandomForestRegressor.html#maxBins:org.apache.spark.ml.param.IntParam

I think that it corresponts to featureSubsetstrategy. In this link it is described as follows:

featureSubsetStrategy: Number of features to use as candidates for splitting at each tree node. The number is specified as a fraction or function of the total number of features. Decreasing this number will speed up training, but can sometimes impact performance if too low.

@clausmichele
Copy link
Member

clausmichele commented Nov 25, 2021

@m-mohr why can't I connect the processes in this way?
image

Ok I see, we defined "data" as raster-cube, but in my opinion, after yesterday discussion, it should also be a vector-cube

@m-mohr
Copy link
Member Author

m-mohr commented Nov 25, 2021

Yes, that's probably the case. We did not update it since yesterday.
Main author of this PR is @mattia6690, by the way.

@clausmichele
Copy link
Member

So, I'm going on with the development/testing.
I've implemented aggregate_spatial and created a sample process graph with random forest regression:
image

However, now I would like to understand if we allow to have the time dimension in the input vector cube of the random forest. In my opinion it should be the case, since some use cases use pixel time series for classification purposes.

For example, the output of the two aggregate_spatial (108 polygons) in the previous graph look like this:

data

<xarray.DataArray 'stack-6abbaf77691bc61829d32510ea803f85' (result: 108, variable: 2, time: 5)>
array([[[1007.2267 , 1167.0667 , ..., 1281.4266 , 1962.5333 ],
        [1344.9734 , 1564.9333 , ..., 1660.52   , 3741.8533 ]],

       [[1056.1111 , 1088.9584 , ..., 1234.9305 , 1950.0139 ],
        [1405.0695 , 1409.3334 , ..., 1555.6945 , 2555.8472 ]],

       ...,

       [[ 458.89847,  563.08124, ...,  455.1066 ,  807.8528 ],
        [1487.7208 , 1728.0863 , ..., 2088.7158 , 3601.0964 ]],

       [[ 453.85074,  523.58704, ...,  435.3781 ,  737.34326],
        [1280.7164 , 1561.0547 , ..., 2105.6519 , 3810.005  ]]], dtype=float32)
Coordinates:
  * time         (time) datetime64[ns] 2018-04-02T10:24:35.002000 ... 2018-04...
  * variable     (variable) object 'B04_10m' 'B08_10m'
    spatial_ref  int64 ...
  * result       (result) int64 0 1 2 3 4 5 6 7 ... 101 102 103 104 105 106 107

target

<xarray.DataArray 'stack-065fdb173caa5bbd5deb47558637f172' (result: 108, variable: 1, time: 5)>
array([[[0.14536 , 0.148576, ..., 0.130495, 0.312038]],

       [[0.138939, 0.129487, ..., 0.115318, 0.141402]],

       ...,

       [[0.531848, 0.512823, ..., 0.643144, 0.630455]],

       [[0.483672, 0.502303, ..., 0.649596, 0.673399]]], dtype=float32)
Coordinates:
  * time         (time) datetime64[ns] 2018-04-02T10:24:35.002000 ... 2018-04...
  * variable     (variable) object 'NDVI'
    spatial_ref  int64 ...
  * result       (result) int64 0 1 2 3 4 5 6 7 ... 101 102 103 104 105 106 107

In this case I'm using the NDVI values as target and the B04 and B08 as source data for a simple regression.
The time dimension is still there, but it could be removed using a reduce_dimension (but we would lose the temporal information).

@clausmichele
Copy link
Member

I think this could be solved directly in the process, transforming the data into feature vectors depending of the input data, like:
(variable are bands)

Input dimensions: (result, variable, time)
Transformed into feature vector:

[variable1[time0],variable1[time1],...,variable1[timeN]]
[variable2[time0],variable2[time1],...,variable2[timeN]]
                               .
                               .
                               .
[variableN[time0],variableN[time1],...,variableN[timeN]]

where we have row = length of result dimension

Input dimensions: (result, variable)
Transformed into feature vector:

[variable1,variable2,...,variableN]
[variable1,variable2,...,variableN]
                               .
                               .
                               .
[variable1,variable2,...,variableN]

@mattia6690
Copy link

So, I'm going on with the development/testing. I've implemented aggregate_spatial and created a sample process graph with random forest regression:

Thak you Michele, that looks great already. This is a very interesting first structure of the process graph!

However, now I would like to understand if we allow to have the time dimension in the input vector cube of the random forest. In my opinion it should be the case, since some use cases use pixel time series for classification purposes.

The temporal dimension is difficult to deal with both in regression and classificiation. I could think of having the "time" or "time interval" as a predictor in the model. E.g.; In case of the forest the time itself is not that important but an aggregation to 2 (summer winter) or 4 (+spring and autumn) seasons is extremely useful for a precise prediction. Therefore we'd need to associate a target value to the season. This season would be then an additional predictor in the model.
Unfortunately this might be a bit difficult I assume?

@clausmichele
Copy link
Member

Aggregating over the seasons would be already possible and we could just rename the output bands with the period attached, something like:
B04_DJF, B04_MAM, B04_JJA, B04_SON

@clausmichele
Copy link
Member

  • the structure of fit_class_random_forest and fit_reg_random_forest is currently fine.

But we need to switch the "raster-cubes" in the first parameter to vector cubes, right?

Yes indeed

@m-mohr m-mohr linked an issue Dec 21, 2021 that may be closed by this pull request
@JeroenVerstraelen
Copy link

For the target vector cube of fit_class_random_forest, should the user be responsible for int encoding the labels ([0,1,2,...]) or can he provide labels of any type e.g. ["Wheat", "Barley", ...]? For the latter, the back end would be responsible for keeping track of the encoding in the model to provide the correct results when the predict_random_forest process is called.

@clausmichele
Copy link
Member

Good question. In my opinion handling integers would be enough, we can't anyway generate as output a raster containing strings. So the mapping can be done easily at the user side.

@m-mohr
Copy link
Member Author

m-mohr commented Feb 9, 2022

From a user POV, I think the back-end should try to handle this internally without user intervention. Not sure how feasible this is though.

@mattia6690
Copy link

mattia6690 commented Feb 10, 2022

For the target vector cube of fit_class_random_forest, should the user be responsible for int encoding the labels ([0,1,2,...]) or can he provide labels of any type e.g. ["Wheat", "Barley", ...]? For the latter, the back end would be responsible for keeping track of the encoding in the model to provide the correct results when the predict_random_forest process is called.

This is actually a very good point @JeroenVerstraelen. I think that it would be very beneficial to use just number representation since the RF regression is based on continuous numerical variables only. It cannot deal with characters/strings. I think that using just numbers we would meet the requirements of both approaches.
However, I think that the integer representation alone is not best since the RF regression often has to deal with floating point and decimals. Otherwise predictor-dependent scaling factors need to be introduced reducing the user experience and augmenting the needs for preprocessing.

EDIT: I misread that this comment is just related to the classification. In this case I agree with @m-mohr that the user would benefit from having both opportunities, if it is feasible in the implementation.

@jdries
Copy link
Contributor

jdries commented Feb 23, 2022

What's still a bit unclear perhaps: how is the prediction vector cube to be joined with the target vector cube, by spatial join, ordering along a given dimension, or some property?
Is it also an option to work with one vector cube, and specify the columns along a dimension that are to be used as prediction and target variables?

Note, in line with our discussion early today, this is a case where you don't really need vector-cube, but could use a more generic 'datacube' instead. I believe the most important requirement is that the cubes are 2-dimensional: samples x predictors and samples x target?

@clausmichele
Copy link
Member

What's still a bit unclear perhaps: how is the prediction vector cube to be joined with the target vector cube, by spatial join, ordering along a given dimension, or some property?

If the vector-cube contains the geometry property we could use that one for matching and ordering the values. Currently I suppose they are already ordered in the same way.

Is it also an option to work with one vector cube, and specify the columns along a dimension that are to be used as prediction and target variables?

We have seen that it's difficult define the behavior with vector-cubes, without an API definition, maybe we can keep it like this until we define them.

Note, in line with our discussion early today, this is a case where you don't really need vector-cube, but could use a more generic 'datacube' instead. I believe the most important requirement is that the cubes are 2-dimensional: samples x predictors and samples x target?

I agree about the input definition: if we don't let the user select particular dimensions or properties of the vector-cube and we keep data and target separated, this could be indeed a more general datacube (like a raster-cube reduced to 2 dimensions using reduce_spatial and/or other reducers).

@m-mohr m-mohr added the help wanted Extra attention is needed label Feb 24, 2022
@clausmichele
Copy link
Member

I report a comment from Lukas:

training here is a percentage, I think it should be required to be between 0-1 to stick to the conventions of the ML community (see https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). Allowing this to be an integer could suggest that this is the absolute number of samples in the test set. Not really in the scope of this PR, because that is defined in the function definition for UC8, but I think this wants to be changed, let's discuss whom to ping for this!

Should we change from percentage to float values between 0 and 1 for the train/test values?

@m-mohr
Copy link
Member Author

m-mohr commented Mar 8, 2022

Is it just scikit that does this or do we have more examples? If someone is talking about the "community" I'd expect at least two distinct examples (e.g. in R or so). But in principle I think we can also go with 0-1, too.

@clausmichele
Copy link
Member

@LukeWeidenwalker

@clausmichele
Copy link
Member

clausmichele commented Mar 9, 2022

@soxofaan @m-mohr @mattia6690 @jdries
should we add the random seed parameter to the random forest fitting processes ? It would be good for reproducibility.

@LukeWeidenwalker
Copy link
Contributor

Ah, didn't get the notification for being tagged here, sorry for the late response!
@m-mohr fair point, I've done a quick Google search to back this claim up:

A counterexample I've come across is Tensorflow, which uses strings to specify this: https://www.tensorflow.org/datasets/splits
ds = tfds.load('my_dataset', split='train[:75%]'), but even there the percentage sign is explicitly given and it's clear that it's not about the number of datapoints.

I'm less familiar with the ecosystem in R, but afaics caret's createDataPartition is widely used to produce train/test splits, and also uses 0-1 to represent the percentage: https://topepo.github.io/caret/data-splitting.html
trainIndex <- createDataPartition(iris$Species, p = .8, list = FALSE, times = 1)

@m-mohr
Copy link
Member Author

m-mohr commented Mar 9, 2022

@clausmichele If this is commonly supported in libraries, I'd say yes. (Edit: Added, please review)

@LukeWeidenwalker Thank you. That is far more evidence than I had hoped for. So I'll update this to be 0-1. (Edit: Changed, please review)

What other open points do we have for these processes? It's hard to follow all the discussion here so I'd like to collect open issues here. As far as I can see there are still some uncertainties regarding the definition of the vector cubes and how that influences certain aspects of the implementation: #306 (comment) I couldn't find any other open points, but I could have missed some.

}
},
{
"name": "mtry",
Copy link
Member Author

@m-mohr m-mohr Mar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GeoPySpark -> max_bins
scikit-learn -> max_features

@m-mohr m-mohr merged commit e9bbfa1 into draft Mar 9, 2022
@m-mohr m-mohr deleted the issue-295 branch March 9, 2022 17:35
@clausmichele
Copy link
Member

clausmichele commented Mar 14, 2022

I don't know if this is the right place to post @m-mohr.

Anyway, after a discussion with @ValentinaHutter @LukeWeidenwalker and @mattia6690 we concluded that:

  1. The fit_*_random_forest processes have vector-cubes as input data types
  2. The process should allow the user to select which property (or column, if we refer to a general DB) of the vector-cube to select to get the required data (@jdries mentioned this already in the last dev meeting)
  3. This is required if the target vector-cube is not the output of aggregate_spatial but it comes from a file (it is the case of UC8). Sorry if this didn't come up earlier but I wasn't fully aware on how the target vector-cube should look like.

We will proceed in implementing a draft version of the process supporting this via an additional property in the parameters.

@m-mohr
Copy link
Member Author

m-mohr commented Mar 14, 2022

@clausmichele Please open a new issue or PR :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed ML new process
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Random Forest: Training/Regression, Classifier/Predicting... Better ML/AI integration?
7 participants