Different dataset for each client #1156

FabioNotaro2001 · 2022-11-17T10:54:52Z

FabioNotaro2001
Nov 17, 2022

Hi everybody and thanks for the support.

I executed a simple example of real world NVFlare. In particular I run the example hello-pt-tb using a simple infrastructure made by a server, an overseer, an admin and two clients.

The example went well, but I was wondering if is it possible (not necessarily in hello-pt-tb, maybe in other examples) to run a training in which the two client have different dataset (for example client1 uses CIFAR10 and client2 uses CIFAR100).

Is that possible? If yes how can I do that? Is there a configuration file in which I can specify the dataset associated to each client? While studying the file pt_learner.py, I noticed that there are few instructions that specify the dataset to use, so can I just modify the dataset expressed in that lines in one of the two clients?

Thank you very much for your support and availability. Sorry for any possible error or imprecision, I'm trying to learn. I remain available for possible details requests.

Answered by chesterxgchen

Nov 19, 2022

Yes, it's totally possible from NVFLARE's perspective. But for Federal learning perspective, its might be slightly complicated.

The complications is related the features distribution in the dataset. For example, if both dataset has the same set of features, only different data. Then this is belong to the category of horizontal FL. In this case, two different datasets is no different from take a dataset and split into two datasets and copy the first dataset to site-1 and 2nd dataset to site-2.

If the features are different on two different sites such as site-1 with features A, B, C and site-2 with E, F, G, X, Y, Z, then the problem becomes vertical FL. The FL algorithm will need different …

View full answer

chesterxgchen · 2022-11-19T03:37:55Z

chesterxgchen
Nov 19, 2022
Maintainer

Yes, it's totally possible from NVFLARE's perspective. But for Federal learning perspective, its might be slightly complicated.

The complications is related the features distribution in the dataset. For example, if both dataset has the same set of features, only different data. Then this is belong to the category of horizontal FL. In this case, two different datasets is no different from take a dataset and split into two datasets and copy the first dataset to site-1 and 2nd dataset to site-2.

If the features are different on two different sites such as site-1 with features A, B, C and site-2 with E, F, G, X, Y, Z, then the problem becomes vertical FL. The FL algorithm will need different algorithms. We will provide vertical FL examples in next release

4 replies

FabioNotaro2001 Nov 19, 2022
Author

Hi, first of all thanks for your reply.
What I need is more similiar to the first case, as horizontal FL. As said, for example, I'd like to set up two clients with two different dataset but both with the same set of features. How can I do this? Where can I specify the two different dataset in each client?
Thanks

chesterxgchen Nov 20, 2022
Maintainer

From your question, I assume you are working on an prototype or experiment, not a production, as in production the datasets are naturally distributed among different clients.

For prototype or running experiment locally. You likely split dataset into multiple datasets for different datasets. How to split the dataset for different datasets for different clients is case by case.

For example, look at Federated Statistics examples. It used one UCI dataset, which has drain and test datasets. The example uses training dataset for one client and test dataset for 2nd client. Image stats example used COVID-19 dataset, split the dataset by category, one category for each client.

you can look at Cifar10 , xgboost examples, each has example showed how to split dataset for different clients

FabioNotaro2001 Nov 21, 2022
Author

I'm trying to understand the Cifar10 example and I have few questions.
The first concern the dataset split we were talking about. I noticed that into the file pt/utils/cifar10_data_splitter.py there'a a function named Split (see the attached screenshot) that defines how to split the dataset into the different clients. Is it correct? For what I see, after the download of the Cifar10 dataset I got into a tmp directory many files named data_batch_1 etc but I cannot understand how they are splitted in this function. I cannot find which file, if exist, specify that client1 takes a specific dataset portion, client2 another and so on.
Thank you for every suggest

chesterxgchen Nov 22, 2022
Maintainer

You might want to see some simple examples with both configurations and client side executor (or learner) so you can understand and trace the code usage. Once you understand the usage, then you can understand the rest.

For example, For fed stats
https://github.com/NVIDIA/NVFlare/blob/dev/examples/federated_statistics/df_stats/df_stats_job/custom/df_statistics.py#L29
specify the load the data from self.data_root_dir = "/tmp/nvflare/data"
and the data is spitted at: https://github.com/NVIDIA/NVFlare/blob/dev/examples/federated_statistics/df_stats/data_utils.py#L34

For Image stats:
The data is split with (https://github.com/NVIDIA/NVFlare/tree/dev/examples/federated_statistics/image_stats)
python3 data/prepare_data.py --input_dir /tmp/nvflare/data

Created 4 data lists for ['COVID', 'Lung_Opacity', 'Normal', 'Viral Pneumonia'].
Saved 3616 entries at /tmp/nvflare/data/site-1_COVID.json
Saved 6012 entries at /tmp/nvflare/data/site-2_Lung_Opacity.json
Saved 10192 entries at /tmp/nvflare/data/site-3_Normal.json
Saved 1345 entries at /tmp/nvflare/data/site-4_Viral Pneumonia.json

For each dataset, the data location is located in the site-N_xxx.json file, which contains location of the actual file paths.
and the configuration tell the executor the location of the data:
https://github.com/NVIDIA/NVFlare/blob/dev/examples/federated_statistics/image_stats/image_stats_job/config/config_fed_client.json#L38

@holgerroth can help you with the CIFAR10 question

holgerroth · 2022-11-22T16:49:46Z

holgerroth
Nov 22, 2022
Maintainer

For CIFAR-10, each client takes a different set of indices of the original cifar-10 dataset (indices are non-overlapping across clients). The selection of the indices happens based on client_id here.

In examples where we use actually different datasets from different sources, like the prostate example, again, we assume there's a datalist with the client name that can be used to load the corresponding data indices. See here.

2 replies

FabioNotaro2001 Nov 24, 2022
Author

I was studying Cifar10 example and I'm trying to understand the function _create_datasets() you cited me.

I wanted to execute the cifar10 example in order to better understand the dataset splitting process between all clients, but I read the documentation (https://github.com/NVIDIA/NVFlare/blob/main/examples/cifar10/cifar10-real-world/README.md, section Multi-tasking resource management) and it seems that I require two local GPUs with at least 8 GB of memory if I want to test cifar10 on a real worl scenario. Is it correct?

I tried anyway to execute with provision command and using an overseer, a server, two clients and an admin and all seemed work, in fact the admin CLI said that the jobs were successfully executed (despite clients and server processes did'nt make any log or any update on epochs in the CLI), and when I download the job result it's emptier than other previous and success jobs (in fact folder workspace is empy, there aren't the usual files and directories such as cross_site_val, tb_events...). So i suppose something went wrong but why I didn't get any error (on GPU absence)?

If I got an error, is there a way to execute this example, in real-world mode, with different local clients but without GPUs? First of all, would it be helpful to better understand your last hint on dataset splitting?

I noticed you suggested me to study prostate example but for now I'd like to remain on Image Classification examples, for that I'm insisting on cifar10 and hello-pt-tb.

Thank you for your support and sorry for any error. I remain available for deeper details if necessary.

holgerroth Nov 28, 2022
Maintainer

No, you can run each cifar-10 experiment on one GPU with at least 8GB memory. If have more resources and you want to run experiments in parallel, that's when the resource management becomes useful.

The cifar-10 data splitting will not currently work on different machines. You would have to run the datasplitting separately and then move the data indices to each local machine. Maybe run a simulation and then use the temporary saved splits and copy them over to the real clients. And yes, it should work on CPU only. We are selecting the CPU here if no GPU is available.

holgerroth · 2022-11-22T16:53:39Z

holgerroth
Nov 22, 2022
Maintainer

FYI, it's also possible to use the deploy_map in meta.json to send different configurations to different clients. Hence, you can just pass on a different folder name as argument to your client Executor/Learner.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different dataset for each client #1156

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Different dataset for each client #1156

FabioNotaro2001 Nov 17, 2022

Replies: 3 comments · 6 replies

chesterxgchen Nov 19, 2022 Maintainer

FabioNotaro2001 Nov 19, 2022 Author

chesterxgchen Nov 20, 2022 Maintainer

FabioNotaro2001 Nov 21, 2022 Author

chesterxgchen Nov 22, 2022 Maintainer

holgerroth Nov 22, 2022 Maintainer

FabioNotaro2001 Nov 24, 2022 Author

holgerroth Nov 28, 2022 Maintainer

holgerroth Nov 22, 2022 Maintainer

FabioNotaro2001
Nov 17, 2022

Replies: 3 comments 6 replies

chesterxgchen
Nov 19, 2022
Maintainer

FabioNotaro2001 Nov 19, 2022
Author

chesterxgchen Nov 20, 2022
Maintainer

FabioNotaro2001 Nov 21, 2022
Author

chesterxgchen Nov 22, 2022
Maintainer

holgerroth
Nov 22, 2022
Maintainer

FabioNotaro2001 Nov 24, 2022
Author

holgerroth Nov 28, 2022
Maintainer

holgerroth
Nov 22, 2022
Maintainer