Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should we update has_gpu field to be num_gpu? #756

Closed
xmonader opened this issue Jun 20, 2023 · 15 comments
Closed

should we update has_gpu field to be num_gpu? #756

xmonader opened this issue Jun 20, 2023 · 15 comments
Assignees
Milestone

Comments

@xmonader
Copy link
Contributor

instead of boolean we can have it as a number, which can save unneeded queries against nodes to find the total number of gpus available?

related to threefoldtech/tfgrid-sdk-go#210

@renauter
Copy link
Collaborator

I see the point
Since storage size will be the same (bool <=> u8) why not taking advantage to store more significant information than just if there is gpu or not?
From TFChain side it would require some rework and probably a migration (since storage is already on devnet) or a new storage creation (killing the deprecated one), if we chose this way.

@robvanmieghem
Copy link
Contributor

GPU farms ( at least the ones for mining) all had multiple gpu's per physical machine. When you mainly use GPU computing power, the CPU is not very much under load normally so it's much more efficient to put multiple GPU's in a machine.

@muhamadazmy
Copy link
Member

I think this is a good idea, but this issue then need to be moved to tfchain

@muhamadazmy muhamadazmy transferred this issue from threefoldtech/zos Jun 21, 2023
@xmonader
Copy link
Contributor Author

With such positive feedback on the idea, @renauter how long would it take to implement ? :)

@robvanmieghem
Copy link
Contributor

I see the point
Since storage size will be the same (bool <=> u8) why not taking advantage to store more significant information than just if there is gpu or not?
From TFChain side it would require some rework and probably a migration (since storage is already on devnet) or a new storage creation (killing the deprecated one), if we chose this way.

I would not bother writing a migration, it's devnet

@DylanVerstraete
Copy link
Contributor

The reasoning behind adding a number of GPU's bothers me. Why should this information be stored on tfchain in the first place? To make development easy? I rather suggest we implement the offchain indexer idea so that clients can query hardware details from any node NOT using tfchain.

@xmonader
Copy link
Contributor Author

The reasoning behind adding a number of GPU's bothers me. Why should this information be stored on tfchain in the first place? To make development easy? I rather suggest we implement the offchain indexer idea so that clients can query hardware details from any node NOT using tfchain.

Because of the following imo
1- It's equal in data size
2- It communicates more information
3- It saves calls needed to to calculate simple information
4- It keeps that info updated

We are also proceeding with the indexer indeed as agreed before, to track the devices for future filtration

@muhamadazmy
Copy link
Member

I think making development easier in itself is an enough reason! I don't get why you make this sound bad if it's exactly the same data size. this can be a u8 in rust.

Even if it takes more size on the chain but it makes performance and development easier, no more reasons need to be given.

@xmonader xmonader changed the title should we update has_gpu field to be num_gpus? should we update has_gpu field to be num_gpu? Jun 21, 2023
@renauter
Copy link
Collaborator

With such positive feedback on the idea, @renauter how long would it take to implement ? :)

Today and tomorrow probably
There is also graphql to update

@DylanVerstraete
Copy link
Contributor

I understand that it makes development easier for clients, but now 2 days of development (runtime/graphql) has to be spent to change the initially requested features. It sounds to me like we did not correctly spec out the story/feature enough and now development is slowed down. Imo we should have expressed gpu as a resource and not like this. I don't really think it's beneficial for a user "how" many gpu's there are, rather the user would like to know how powerful they are.

Also not having a migration for this on devnet can lead to corrupted state, which is another reason for me to not blindly incorporate this change.

@DylanVerstraete
Copy link
Contributor

Actually the combined information of how many gpu's + resources of each gpu would be the most optimal information

@xmonader xmonader added this to the 2.5.0 milestone Jun 21, 2023
@brandonpille
Copy link
Contributor

I understand that it makes development easier for clients, but now 2 days of development (runtime/graphql) has to be spent to change the initially requested features. It sounds to me like we did not correctly spec out the story/feature enough and now development is slowed down. Imo we should have expressed gpu as a resource and not like this. I don't really think it's beneficial for a user "how" many gpu's there are, rather the user would like to know how powerful they are.

Also not having a migration for this on devnet can lead to corrupted state, which is another reason for me to not blindly incorporate this change.

I agree here. Plus I don't think there is a use case where a user needs X amount of GPUs but doesn't care about which GPUs. So in that case he will need to ask the zos nodes. So in the end we will always have to ask the ZOS nodes. Why have that on chain if we will need to contact the zos nodes?

@LeeSmet
Copy link
Contributor

LeeSmet commented Jun 22, 2023

Alright so lets be very clear: we decided that there will be no tracking of GPU's on chain in the final release, and the current implementation was only done because the off chain indexing would "take a long time". Therefore as a compromise, on devnet we would temporarily maintain a map of nodes which have gpus, and this will be removed.

In that light, it absolutely makes zero sense to me that we are now adding feature dependencies on top of this to be "more efficient/performant".

@muhamadazmy
Copy link
Member

@LeeSmet we never agreed that we have any temporary map of node gpus. We only agreed that the node can have a flag (on/off) if it has gpu or not.

We now think (which is fine since the entire feature is still in development sage) that this same flag can be instead be the number of gpus available on the node (can be a u8). This will make finding a node that matches the user requirements is easier and more efficient.

This has 0 cost compared to having just a boolean.

The proxy will index the full gpu information for further filtering

@LeeSmet
Copy link
Contributor

LeeSmet commented Jul 19, 2023

Closing since we are removing gpu related info from chain as previously discussed: #761

@LeeSmet LeeSmet closed this as not planned Won't fix, can't repro, duplicate, stale Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

No branches or pull requests

7 participants