-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Downloader is not producing full set of expected outputs #159
Comments
Do you have enough ram in workers ? |
I believe they are all succeeding This is the config I'm using
This is my worker node |
Ok, do you have any way to check the memory usage during the job? Can you also check executor logs to see if you get this log https://github.com/rom1504/img2dataset/blob/main/img2dataset/downloader.py#L93 ? |
I see that from cloudwatch that it looks fine: Oh interesting, I looked through some of the executors that were running shockingly fast (<1s) and found a bunch of these errors
Wonder if thats related, I'll dig into it Any reason this is just a print statement and not a thrown exception that the tool can do retries on? |
Yeah the problem is spark doesn't have a feature of "try then give up at some point" so if there was an exception here instead of a print then your whole job would likely have failed after the tasks had retried a few times I think something that could be improved here is doing a loop in that piece of code to retry a few times instead of just failing |
Your credentials error is likely the problem There is 2 ways to solve it |
Is there any concept of surfacing these logs without having to go into the individual executors? Or somehow tracking these failures in stats.json? |
Yeah I noticed the error eventually goes away, I'm wondering if its some kind of spark job spin up time and I should just add a wait 5m to the calling script or something |
There are several options to surface them but I'm not sure if I can think of something clean, feel free to try things |
Edit - I spoke too soon, I reduced |
Could this #137 be related? |
The retry option is at the sample level, that will not help in your case
What you need is a retry at the shard level and need new code in the line i
was suggesting above.
…On Wed, Mar 30, 2022, 03:05 Pranshu Bansal ***@***.***> wrote:
Could this #137 <#137> be
related?
—
Reply to this email directly, view it on GitHub
<#159 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437QZKW6BAI2IY4UOX53VCOSGVANCNFSM5R4WWB7Q>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Reducing subjob_size to 5 will drastically decrease the performance of the tool, at this point it will be slower than running on one node. |
https://github.com/rom1504/img2dataset/blob/main/img2dataset/downloader.py#L88 by retry at the shard level i mean doing a for loop there |
Yeah thats what I did, I ran retries at shard level (i.e. i went and edited the code so that it 1. retries on shard failure, 2. eventually throws after 3 retries. This did not resolve the error). The error I'm getting is not retry-able. Do you know the minimum parallelization you'd recommend before it starts to be more efficient to run on single node? Reducing parallelism seems to be the only fix |
Yeah I just ran a test over a single parquet, running at low parallelism fixed it (but yes now the tool isnt performant) I'll spend today working on finding if there are better params I can use SSL errors also (expectedly) went way down
|
What do you mean exactly? Does it fail again if you try (after some wait time) ?
I do not advise reducing the parallism for this permission problem. The whole point of this tool is to be as parallel as possible My advice at this point is either to try to implement retrying properly (can you share your implementation?) In my case i chose not to use AWS EMR and instead to use directly ec2 instances with a spark standalone cluster on them because AWS EMR was not working well. You can see how in the distributed guide in the doc. Btw running this in distributed only makes sense if your dataset is larger than a few billions of samples For laion400m, a single node is enough to run in 3 days |
I tried something super basic like this:
Is there a better way to attempt retries? For single node - do you have recommended configs for the LAION400M download? (ec2 instance type, multiprocessing params, etc) |
Your loop seems ok. So it prints failure 10 times in a row and never succeeds ? If that's the case I'm afraid the only thing to do is really to fix the S3 auth. Maybe your AWS configs are not quite right? For a single node you can use these commands https://github.com/rom1504/img2dataset/blob/main/dataset_examples/laion400m.md |
I don't think its a aws config thing because eventually the download does start working, and it also works at very low parallelism. I'll also give the knot resolver a go. Thank you so so so so so much for all the support you've been providing to me! |
Ok I just decided to opt for a slower download, as its currently blocking some other work, not sure what the issue is. If you want to keep this open for resolution I'll update the thread when i do get around to digging into the root cause |
Huge thanks for all the ideas and support though! |
Let's keep it open for now. |
So the set of things I'm going to investigate in parallel while the real download goes on in the background is:
While I definitely believe using ec2 directly may help resolve the issues, I want to try my best to see if we can get EMR working so that we can leverage auto-scaling/remove the parallel-ssh steps for future use (doing that for 100 machines if theres a 100B dataset doesnt sound fun) |
Ok i figured it out eventually - I don't have a smoking gun with data, but when I did the below suggested fix it resolved the issue: Basically my spark defaults had Observations: I noticed that the executors would keep dieing/come back up. After a bit of digging I saw that the executors that were getting shut down were handling ~40-50% of the tasks in less than 50ms Digging: I realized that it takes a bit of time between executors spinning up and the tasks actually getting allocated to them, I noticed that in the spark logs there was a bunch of "killing executor due to timeout" Solution:
I tried 1, it worked, then I ran out of time to go back and try 2 Hopefully this helps, but your tool isn't the issue here |
Interesting, thanks for the info! Btw just to give some ideas of speeds, I talked with someone that used a bunch of freely provided TPU vms and was able to download laion2B-multi in 7h at around 100k sample/s (using img2dataset) |
Did you manage to download what you wanted in a reasonable time ? |
Yup, we used 10 c4.8x large and got it done in 30hours for LAION400M, success rate of ~93% with no dns/knot resolvers I used these settings (probably overly conservative, but I used the guidance avail here)
|
I also had this weird issue where the SparkJob would just show "successfully completed" after the first 5-6 parquets, so I had to wrap it in a loop and that helped a bunch too
|
Also - since I resolved this issue on my side/posted the steps to fix going forward for others I won't monitor this anymore, please do ping me at my email address if you have any further questions! I'll let you close this out once you're ready to |
you may now rerun the job to get missing shards, see https://github.com/rom1504/img2dataset#incremental-mode however I will also implement a shard retrying feature in a future PR |
shard retrying implemented as well |
Heya, I was trying to download the LAION400M dataset and noticed that I am not getting the full set of data for some reason.
Any tips on debugging further?
TL;DR - I was expecting ~12M files to be downloaded, only seeing successes in
*_stats.json
files indicating ~2M files were actually downloadedFor example - I recently tried to download this dataset in a distributed manner on EMR:
https://deploy.laion.ai/8f83b608504d46bb81708ec86e912220/dataset/part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet
I applied some light NSFW filtering on it to produce a new parquet
Verified its row count is ~12M samples:
Ran the download, and scanned over the output s3 bucket:
Ran this script to get the total count of images downloaded:
which gave me the following output:
The high error rate here is not of major concern, I was running at low worker node count for experimentation so we have a lot of dns issues (I'll use a knot resolver later)
I also noticed there were only 270 json files produced, but given that each shard should contain 10,000 images, I expected ~1,200 json files to be produced. Not sure where this discrepancy is coming from
The text was updated successfully, but these errors were encountered: