Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pangeo framework for LIDAR and Hyperspectral Forestry #170

Closed
bw4sz opened this issue Mar 20, 2018 · 4 comments
Closed

Pangeo framework for LIDAR and Hyperspectral Forestry #170

bw4sz opened this issue Mar 20, 2018 · 4 comments

Comments

@bw4sz
Copy link

bw4sz commented Mar 20, 2018

Hi all,

Following #144, introducing myself and my interest in this project. I am working on tree delineation and segmentation using airborne LIDAR and hyperspectral data for the NEON sites. Some project info is here. I am working on the UF Hipergator HPC environment. I appreciate the wiki doc on getting dask started on HPC. If i'm successful, I'll try to contribute additional information that might help users on other clusters (SLURM instead of PBS). If I understand correctly, alot of the speedup and memory management comes from xarrays and dask distributed processing? I'm inheriting alot of code, I'll need to decide how much to refactor to match these workflows? Our data is split into tiles, and i'd like to subset those tiles, distribute them to workers, perform our supervised classification algorithms and recombine. This will be my first experience with dask. I was using apache beam on google cloud dataflow before moving to the University cluster.

Ben Weinstein
Postdoctoral Fellow
University of Florida

@mrocklin
Copy link
Member

Welcome @bw4sz ! We're glad to see you. I'd like to recommend a couple links to you:

  • the dask-jobqueue project, and the SLURM integration there in particular: https://github.com/dask/dask-jobqueue/blob/master/dask_jobqueue/slurm.py

    I suspect that this will mostly work, but I would not be surprised to learn that some tweaks need to be made to generalize it. As we encounter more and more clusters we routinely find that we had made some assumptions based on the clusters-at-hand that are not generalizable.

  • The documentation on making dask arrays from different data sources: http://dask.pydata.org/en/latest/array-creation.html

    I suspect that if you include more information here about the kind of file format you're using and how you currently access that data from within Python that people here will have more suggestions on how to get started

@bw4sz
Copy link
Author

bw4sz commented Mar 20, 2018

Thanks @mrocklin, i'll report back on my success. I think overarching question I have is whether this pipeline will also be appropriate for some traditional embarrassingly parallel operations when needed. I can see in the mission statement that the goal is to work interactively. While that is 100% helpful and crucial in the development stage, eventually we hope to scale in a traditional batch submission approach.

In terms of data, we have thousands of .laz files stored locally on the HPC. We load them similarly to this stack overflow question. This is a very new project, so it will be a couple days before I have any much to add.

@mrocklin
Copy link
Member

mrocklin commented Mar 20, 2018 via email

@stale
Copy link

stale bot commented Jun 25, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 25, 2018
@bw4sz bw4sz closed this as completed Jun 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants