Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Hub: Geospatial workshop in Ghana #473

Closed
6 tasks done
choldgraf opened this issue Jun 17, 2021 · 63 comments
Closed
6 tasks done

New Hub: Geospatial workshop in Ghana #473

choldgraf opened this issue Jun 17, 2021 · 63 comments
Assignees

Comments

@choldgraf
Copy link
Member

choldgraf commented Jun 17, 2021

Background

@paigem works with @rabernat, and is helping to lead/organize a workshop around geospatial analytics (e.g., the "Pangeo stack") in Ghana. In previous years, they have asked attendees to install things on their local machines, but she would love to have access to cloud infrastructure via 2i2c that supports this workshop.

The team behind this workshop currently does not have funding for infrastructure/services, so this would be a pro-bono case. In my opinion, it is well worth the time investment because it is a great cause, and a way to see how our infrastructure could serve those in non-North America/Europe countries.

@paigem could you help us answer some of the questions in the section below?

Setup Information

  • Hub auth type: Google
  • Hub administrators:
  • Hub url: coessing.pangeo.2i2c.cloud
  • Hub logo:
  • Hub logo URL:
  • Hub type: daskhub
  • Hub cluster: pangeo-hubs
  • Hub image:

Important Information

  • Hub config name: coessing
  • Community champion: @paigem
  • Hub start date: by 19 July 2021
  • Hub end date: 31 August 2021
  • Hub important dates: anticipate highest number of users during the 2-week school: July 26 - August 6, 2021

Deploy To Do

  • Fill our hub deployment information
  • Find a funding source for the Ghana hub #491 (in order of priority below)
    1. Pangeo cloud credits
    2. Ask GCP if we can get credits for 2i2c using this as an example
    3. Use JROST credits
    4. Eat the cost
  • Initial Hub deployment
  • Administrators able to log on
  • Community Champion satisfied with hub environment
  • Hub now in steady-state
@paigem
Copy link

paigem commented Jun 18, 2021

Thanks @choldgraf for getting this underway! Very much appreciated!

I am not sure I understand a lot of what the "Setup Information" section is asking for (e.g. do I need to provide the hub type, url, etc.?), but I can fill in a bit of the "Important Information":

Important Information

  • Hub config name:
  • Community champion:
  • Hub start date: by 19 July 2021 (to allow for at least 1 week of testing before the school starts)
  • Hub end date: 31 August 2021 (to allow school attendees time to go back through their notebooks after the school)
  • Hub important dates: anticipate highest number of users during the 2-week school: July 26 - August 6, 2021

For two reasons, we may want to extend the hub end date to next year or beyond:

  • Hopefully a few school attendees will really engage with the Hub and want to continue using it for research or other scientific purposes.
  • We would prefer not to provide a wonderful resource only during the 2-week school, but rather hope to start to set up a more permanent cloud-based platform for ocean and climate research in West Africa. However, we recognize that this is a trial run and so if we are not able to extend the Hub this year, then we can learn from our mistakes and ideally bring it back at next year's school!

Happy to provide more of the above information with a bit more guidance! Thank you!!

@yuvipanda
Copy link
Member

This is great, we should definitely support this! Do you know which funding source we can use for this, @choldgraf?

@choldgraf
Copy link
Member Author

@yuvipanda that's a good question, here are a few options I can think of:

  • We could use the JROST funds (that is $5,000).
  • We could ask @rabernat if the Pangeo funds could be used for this
  • We could ask Kharan if GCP has credits for this kinda thing
  • We could eat the cost

@choldgraf
Copy link
Member Author

I think we should start off using the JROST funds, and then try to find credits elsewhere

@choldgraf
Copy link
Member Author

I wonder if @scottyhq, @consideRatio, @jhamman, or @rabernat could comment on what kind of cost we might expect for this workshop. If we have ~30-100 users doing "pangeo-style" environment analysis for 2 weeks, what kind of cost could we expect to incur in cloud infrastructure? This feels like it may be similar to the GeoHackWeeks.

@choldgraf
Copy link
Member Author

I spoke with @rabernat who mentioned that we could use the Columbia Pangeo credits for this one. I believe that those are on GCP as well. @sgibson91 @yuvipanda is there any technical challenge to using these credits for this hub? (assuming that it will be a different hub from the "main" Pangeo hubs)

@consideRatio
Copy link
Member

Hmmm hmmm @choldgraf I'm not feeling confident about cost estimation as it is so extremely dependent on how much work is generated by users on their ability to request compute via Dask-clusters, but I'll try to estimate things anyhow.

The base cost could be like any other hub for 2 weeks I guess, but then the dask worker nodes adds to that. They will be configured as spot-instances/preemptible instances that cost ~30% of original instances, so if you have for example a 32 core instances it's like 300 USD / month (150 USD / 2 weeks). I'll go ahead and guesstimate the cost wont go over 1000 USD for Dask worker nodes if ~50 users play around with dask workers and we force machines to be limited to 32 CPU cores and limit autoscaling to ~10 nodes (320 cores).

@choldgraf
Copy link
Member Author

that's a really helpful analysis @consideRatio , thanks very much :-)

@paigem
Copy link

paigem commented Jul 2, 2021

Thanks @choldgraf @consideRatio for your efforts here! With my limited understanding of all of this, I think what @consideRatio lays out here sounds very reasonable. I don't anticipate having too many high Dask-usage workloads during the school, since for many participants of the school this will be their first time using Dask or accessing large climate datasets. And especially with so many new Dask users, those CPU and scaling limits mentioned by @consideRatio will be very important.

@sgibson91
Copy link
Member

@consideRatio gave a really nice costing estimate above! 🙌🏻 From a technical stand-point, I think we run into the same issue as 2i2c-org/team-compass#136 and we don't have billing control of that project.

@paigem
Copy link

paigem commented Jul 9, 2021

Just checking on an update here! This year's workshop is coming up very soon, and I just want to know if it's likely a Hub can be set up and fully functional by July 19th at the latest, or if I should make alternate plans instead (which would be doable, as long as I know soon). Thanks!

@choldgraf
Copy link
Member Author

@sgibson91 I believe that we can deploy this hub on Pangeo infrastructure as well, so could the temporary fix for 2i2c-org/team-compass#136 also be applied to this hub?

@sgibson91
Copy link
Member

@choldgraf we now have a bigger blocker on that project and I've resorted to testing on the GCP project that is currently hosting Pangeo infrastructure (that I can access with my 2i2c account, not Columbia)

@choldgraf
Copy link
Member Author

@sgibson91 I think it's fine if we use whatever GCP account we have access to do serve the Ghana Hub. If worst comes to worst, we'll use our $5,000 JROST grant to pay for the cloud infrastructure.

@sgibson91
Copy link
Member

Ok, well there's a fresh cluster on the pangeo-181919 project as of today that I believe myself, @yuvipanda and @damianavila have access to. I can put my focus on this from Monday unless either of them get there before me?

@choldgraf
Copy link
Member Author

that'd be super awesome :-)

@sgibson91
Copy link
Member

I am not sure I understand a lot of what the "Setup Information" section is asking for (e.g. do I need to provide the hub type, url, etc.?)

Hi @paigem! I think the most important questions here to get going are:

  1. What method would you like workshop attendees to log into the hub with? Such as GitHub, or Google accounts?
  2. Do you need parallel-processing capabilities provided by dask, or will a more "vanilla" setup (such as, 1 CPU) be suffice? (This helps us answer the hub type question)

We can generate a URL that will be something like foo.bar.2i2c.cloud, but if the workshop has a URL you might like to have the hub be a subdomain of that. We could add a CNAME that is something like hub.workshop-url to our records.

@sgibson91
Copy link
Member

There's a WIP PR open to deploy a Hub in #508 :)

@sgibson91
Copy link
Member

sgibson91 commented Jul 21, 2021

Thank you for your help @rabernat! If a restart of the server doesn't help, I will poke around those hubs a little more (this hub is running in the same project for now so hopefully it should transfer over pretty easily)

@paigem
Copy link

paigem commented Jul 21, 2021

I have tried explicitly shutting down the notebook kernels and restarting, and creating a new notebook, all with no luck.

@sgibson91
Copy link
Member

@paigem can you go to https://coessing.pangeo.2i2c.cloud/hub/admin, click "stop server" next to your name and try again please? (This is different to killing the kernels, it's more like rebooting your machine)

@paigem
Copy link

paigem commented Jul 21, 2021

Thanks for specifying how to shut down my server @sgibson91. I have shut down my server and logged in again, and I am still getting the same error.

@sgibson91
Copy link
Member

Ok, thanks for bearing with me there. I will see what I can learn from the Pangeo hub deployments regarding this.

@paigem
Copy link

paigem commented Jul 21, 2021

No problem at all! Thanks for helping figure this out!

@TomAugspurger
Copy link

This sounds a bit like pangeo-data/pangeo-cloud-federation#615. I have a few comments in that thread with various commands I ran to grant GCP permissions to Google service accounts and link those Google service accounts with Kubernetes Service Accounts (which are used by the hub).

I never did confirm this, but I think there is a potential risk that a user makes requester-pays calls to non-pangeo buckets, which would end up costing money. I never found out if there's a finer-grained way to grant this permission on just certain buckets.

@sgibson91
Copy link
Member

Ok, some good news! I have bootstrapped the pangeo-notebook image so now ecco_v4_py is now available! 🎉 In the future, if you need any more packages adding, you can self-serve these in the coessing-image repository by following the instructions in the README

@sgibson91
Copy link
Member

sgibson91 commented Jul 21, 2021

  • It appears that I cannot access datasets stored on Pangeo Cloud. For instance, I tried to load the ECCO dataset as I do in Pangeo Cloud:
import intake
cat = intake.open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/ocean.yaml")
ecco_monthly_ds = cat.ECCOv4r3.to_dask()

But I get the following error: OSError: Forbidden: https://storage.googleapis.com/download/storage/v1/b/pangeo-ecco-eccov4r3/o/eccov4r3%2F.zmetadata?alt=media Caller does not have serviceusage.services.use access to the Google Cloud project. Is there any way we can have access to the Pangeo Cloud datasets found here?

Omg, I think I have solved this too! At least now when I run that snippet in a notebook, I don't get any error. @paigem can you confirm?

Thank you @rabernat and @TomAugspurger for your helpful input! 🙏🏻

@sgibson91
Copy link
Member

  • Am I understanding the documentation correctly that, if I want content from a public GitHub repo to populate in every user's Hub, then I should distribute an nbgitpuller link for the Hub, instead of the base Hub link: https://coessing.pangeo.2i2c.cloud?

Yes, this is correct

@paigem
Copy link

paigem commented Jul 21, 2021

Omg, I think I have solved this too! At least now when I run that snippet in a notebook, I don't get any error. @paigem can you confirm?

It works!! 😄 Amazing - thank you @sgibson91!

@rabernat
Copy link
Contributor

  • then I should distribute an nbgitpuller link for the Hub

In case you don't know about it, you can use this great website to generate an nbgitpuller link. I generated this one for example

https://coessing.pangeo.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fpangeo-gallery%2Fphysical-oceanography&urlpath=lab%2Ftree%2Fphysical-oceanography%2F&branch=master

And can even put your link behind a fancy looking badge

Open with Jupyter

[![Open with Jupyter](https://img.shields.io/badge/Open%20with-Jupyter-orange?style=for-the-badge&logo=Jupyter)](https://coessing.pangeo.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fpangeo-gallery%2Fphysical-oceanography&urlpath=lab%2Ftree%2Fphysical-oceanography%2F&branch=master)

@paigem
Copy link

paigem commented Jul 21, 2021

Thanks @rabernat! Yes, I have already made a link using nbgitpuller (thanks to 2i2c docs!) to sync files from a GitHub repo, but I like the fancy badge! I assume this badge is something you include in your GitHub repo?

@rabernat
Copy link
Contributor

rabernat commented Jul 21, 2021

You can certainly put the badge in a repo README. The version I shared was a markdown version, so it works well in github. But you could put such badge on any website anywhere, such as the workshop website. The html version would look like

<a href="https://coessing.pangeo.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fpangeo-gallery%2Fphysical-oceanography&urlpath=lab%2Ftree%2Fphysical-oceanography%2F&branch=master"><img alt="Open with Jupyter" src="https://img.shields.io/badge/Open%20with-Jupyter-orange?style=for-the-badge&logo=Jupyter" /></a>

@paigem
Copy link

paigem commented Jul 21, 2021

Thanks @rabernat - the html version will be very helpful for the website!

@damianavila
Copy link
Contributor

Omg, I think I have solved this too! At least now when I run that snippet in a notebook, I don't get any error. @paigem can you confirm?

@sgibson91 just to confirm the fix was running this snippet or something else?

gcloud projects add-iam-policy-binding pangeo-181919 \
  --member serviceAccount:pangeo@pangeo-181919.iam.gserviceaccount.com \
  --role roles/serviceusage.serviceUsageConsumer

@sgibson91
Copy link
Member

sgibson91 commented Jul 22, 2021

@damianavila I think the code in this comment did it. The key was the k8s annotation so it knows to use it.

(For ref because it took me some time to figure this out: the part of the gcloud command inside [] is [k8s_namespace/helm_namespace])

@damianavila
Copy link
Contributor

Thanks for the info, @sgibson91.
It would be nice to consolidate this info in our docs somehow/somewhere.

@sgibson91
Copy link
Member

It would be nice to consolidate this info in our docs somehow/somewhere.

Yeah, I'd also like a review of it to make sure we understand what's going on and that we're not unnecessarily granting elevated privileges

@sgibson91 sgibson91 self-assigned this Jul 23, 2021
@sgibson91
Copy link
Member

@paigem if you're happy with the state the hub is in now, I'm going to close this ticket.

Should I continue asking questions specific to my Hub in this thread, or should I start a new issue in 2i2c/pilot as mentioned in the documentation?

We are actually trialling a new support framework using FreshDesk and tickets can be submitted by emailing support@2i2c.org. Are you happy to be a guinea pig and send any issues through this system?

@paigem
Copy link

paigem commented Jul 26, 2021

Yes, the Hub is working great! Thank you!! And I'm happy to trial the FreshDesk support framework! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants