Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes needed to be able to use doctr on AWS Lambda #1017

Merged
merged 14 commits into from
Sep 1, 2022

Conversation

mtvch
Copy link
Contributor

@mtvch mtvch commented Aug 10, 2022

I would like to use doctr on AWS Lambda, but there is a restriction: it gives a function write access only under /tmp folder.
I have found two places where doctr claims write access outside this folder:

  1. While caching downloaded models.
  2. By using ThreadPool from python's multiprocessing package: it uses /dev/shm folder for shared memory which is not present on AWS Lambda.

That's why I think it would be nice to add an option for user to decide where models should be cached and whether to use multiprocessing or not.

@felixdittrich92
Copy link
Contributor

Hi @mtvch 👋,
that's a really interesting use case thanks for opening 👍

Some points:

  • we should add a short test that it works as expected:

https://github.com/mindee/doctr/blob/main/tests/common/test_utils_multithreading.py
create test_utils_data.py in tests/common folder and add a short test

  • we need to document this for other users i would suggest to add a page to the documentation in using_doctr maybe something like 'Run on AWS' with a subpoint AWS Lambda

About the casting to list in multithread_exec : #931

@felixdittrich92 felixdittrich92 self-assigned this Aug 11, 2022
@felixdittrich92 felixdittrich92 added this to the 0.6.0 milestone Aug 11, 2022
@felixdittrich92 felixdittrich92 added type: enhancement Improvement module: utils Related to doctr.utils ext: tests Related to tests folder ext: docs Related to docs folder labels Aug 11, 2022
Copy link
Collaborator

@odulcy-mindee odulcy-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @mtvch, thank you very much for opening this PR and to bring us this use case !

I would suggest some changes in the code and also I agree with @felixdittrich92 on the parts that he highlighted

doctr/utils/data.py Outdated Show resolved Hide resolved
doctr/utils/multithreading.py Outdated Show resolved Hide resolved
@frgfm frgfm removed the ext: docs Related to docs folder label Aug 11, 2022
frgfm
frgfm previously requested changes Aug 11, 2022
Copy link
Collaborator

@frgfm frgfm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

I agree with most of what has already been said! A few points:

  • we need some unittests for those
  • the docstring of each function should reflect any behaviour change (since now env var will modify/override the behaviour)
  • the casting of the multiprocessing output is being handled by another PR, it's a dangerous change that should be taken care of outside of here :)

Cheers ✌️

doctr/utils/data.py Outdated Show resolved Hide resolved
doctr/utils/multithreading.py Outdated Show resolved Hide resolved
doctr/utils/multithreading.py Outdated Show resolved Hide resolved
@mtvch
Copy link
Contributor Author

mtvch commented Aug 11, 2022

Thank you all for responses!
Can I open another PR about changing return type of multithread_exec function to generator?
As far as I have checked every place where it's called expects a list as a result, so this change looks easy to me

@felixdittrich92
Copy link
Contributor

Thank you all for responses! Can I open another PR about changing return type of multithread_exec function to generator? As far as I have checked every place where it's called expects a list as a result, so this change looks easy to me

Hi @mtvch 👋 , there is currently an open PR to do this #931
I have asked if he want to continue otherwise yes of course feel free to open another PR to solve this 👍

@mtvch
Copy link
Contributor Author

mtvch commented Aug 15, 2022

Hello, @felixdittrich92!
I have updated this pr and opened another pr #1019 about multithread_exec return type. Could you, please, review them?

@codecov
Copy link

codecov bot commented Aug 15, 2022

Codecov Report

Merging #1017 (2d236ac) into main (61d0f1c) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1017   +/-   ##
=======================================
  Coverage   94.93%   94.94%           
=======================================
  Files         135      135           
  Lines        5590     5599    +9     
=======================================
+ Hits         5307     5316    +9     
  Misses        283      283           
Flag Coverage Δ
unittests 94.94% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
doctr/datasets/datasets/base.py 98.00% <100.00%> (ø)
doctr/utils/data.py 90.00% <100.00%> (+1.32%) ⬆️
doctr/utils/multithreading.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Contributor

@felixdittrich92 felixdittrich92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for applying the changes 👍
I have added some minor comments :)

for mypy : make quality :)

@fharper can you review the minor docs change please ?

doctr/utils/data.py Outdated Show resolved Hide resolved
doctr/utils/data.py Outdated Show resolved Hide resolved
doctr/utils/data.py Show resolved Hide resolved
doctr/utils/data.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@frgfm frgfm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the last commits! A few comments left :)

docs/source/using_doctr/running_on_aws.rst Outdated Show resolved Hide resolved
docs/source/using_doctr/running_on_aws.rst Show resolved Hide resolved
doctr/utils/data.py Outdated Show resolved Hide resolved
doctr/utils/data.py Outdated Show resolved Hide resolved
doctr/utils/data.py Show resolved Hide resolved
doctr/utils/multithreading.py Outdated Show resolved Hide resolved
doctr/utils/multithreading.py Outdated Show resolved Hide resolved
@felixdittrich92 felixdittrich92 marked this pull request as draft August 24, 2022 07:19
@mtvch mtvch marked this pull request as ready for review August 29, 2022 09:15
@felixdittrich92
Copy link
Contributor

Hi @mtvch 👋 ,
I think you missed some suggestions from above would be great if you can add the missing ones :) 👍

@mtvch
Copy link
Contributor Author

mtvch commented Aug 30, 2022

Hello, @felixdittrich92!
Excuse me, but, honestly, I don't know what I missed(

There are two things I didn't do as you suggested:

  1. I didn't use os.access function, because it doesn't work if directory does not exist (returning False even if you can create directory).
  2. I didn't write test for exactly downloading files, instead I used mocks and covered every line of this pr with tests.

Did you mean these two? If not, please, could you tell me what suggestions I missed and I will implement them

@mtvch
Copy link
Contributor Author

mtvch commented Aug 31, 2022

I have found one more thing I couldn't implement: using os.environ.get with default value in a second argument.
I decided to stick with is None check, because mypy was failing.
It seems that os.environ.get has Optional[str] specification even if I provide default value in a second argument.
image

Copy link
Contributor

@felixdittrich92 felixdittrich92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtvch thanks for applying all the requested changes i have added some last minor points then we are good to go 🤗

doctr/datasets/datasets/base.py Outdated Show resolved Hide resolved
doctr/utils/data.py Outdated Show resolved Hide resolved
doctr/utils/data.py Outdated Show resolved Hide resolved
@mtvch
Copy link
Contributor Author

mtvch commented Sep 1, 2022

@mtvch thanks for applying all the requested changes i have added some last minor points then we are good to go hugs

@felixdittrich92
Thank you for your patience and another review!

@felixdittrich92 felixdittrich92 dismissed stale reviews from odulcy-mindee and frgfm September 1, 2022 09:30

merge

Copy link
Contributor

@felixdittrich92 felixdittrich92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the updates. LGTM 👍

@felixdittrich92 felixdittrich92 merged commit ea19161 into mindee:main Sep 1, 2022
@felixdittrich92 felixdittrich92 mentioned this pull request Sep 26, 2022
85 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ext: tests Related to tests folder module: utils Related to doctr.utils type: enhancement Improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants