Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

finish aws s3 support #120

Closed
rom1504 opened this issue Feb 3, 2022 · 5 comments
Closed

finish aws s3 support #120

rom1504 opened this issue Feb 3, 2022 · 5 comments

Comments

@rom1504
Copy link
Owner

rom1504 commented Feb 3, 2022

pip install s3fs makes it work almost completely

only thing that doesn't work is the logger process, somehow the glob always return the same files, it doesn't see the new files
solutions:

  • change the logging pattern, directly send the info over the network from workers to driver (but non obvious how to do it in spark)
  • figure out what's up with the s3 filesystem in that case
@rom1504
Copy link
Owner Author

rom1504 commented Feb 4, 2022

mostly done now
can probably be made faster by tuning the caching options / by making write async / by doing less write
(in the current state it's 30% slower than local write)

@rom1504
Copy link
Owner Author

rom1504 commented Feb 4, 2022

@rom1504
Copy link
Owner Author

rom1504 commented Feb 4, 2022

same slowness issue on hdfs, probably also same solution

@rom1504
Copy link
Owner Author

rom1504 commented Feb 5, 2022

I'm not observing any big slowness after the #85 (that should be improved)
may still use the cache based solution of fsspec, but it doesn't seem strictly needed (ss3fs and hdfs fs already seem to have decent in built strategies)

@rom1504
Copy link
Owner Author

rom1504 commented Feb 6, 2022

I actually did more tests and this is working as expected, using local cache doesn't seem required
may revisit local caching in the future if needed

@rom1504 rom1504 closed this as completed Feb 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant