-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parameter setup in v3 API #14
parameter setup in v3 API #14
Conversation
- `Processor` init: do not validate, only set `parameter` if present (needed to avoid exception from `ParameterValidator` for processors with mandatory params in non-processing usage) - `Processor.parameter`: allow `None`, but validate when set - `Processor._setup`: validate parameters, then call `Processor.setup`
Unfortunately, by always defining a We had the same with I could still change this PR to not provide |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter validation happening on assignment is much better and further reduces the complexity of the constructor. And with the _setup -> validation -> setup
mechanism we're certain that everything is properly (re-)loaded when the parameters change - which I still don't think should happen, but if it is being done, we're safe.
Also just tested it by making model
of ocrd-kraken-recognize
required and --help
works again.
So, LGTM!
Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>
Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>
Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>
Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>
…to plain dict in getter for serialization etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not tested if it works, just peeked over the code. Seems okay, I am unsure how that affects the processor caching. How about caching only the models used by the processors and not the entire processor class?
I myself am just trying to understand how
The problem is that we don't know in core where implementors store anything. We can only observe what goes in (i.e. parameters) and out (i.e. instances with bindings to large memory objects possibly even on GPU). |
We do not have separate tests for that. I never tested that part to see what performance we gain from the processor caching. |
…rameter re-assignment)
I was wrong: via BTW, I don't like the fact that ...
from ocrd_utils import config
config.OCRD_MAX_PROCESSOR_CACHE = 2
...
from ocrd import Processor
...
class MyProcessor(Processor):
...
# instances of this class will have an lru_cache size of 2 in get_processor
In 6cd4a34 I know added a simple test for This also contains the changes I deemed necessary for Unfortunately, I cannot properly test the combination of both: doing something in core/tests/processor/test_processor.py Line 123 in 6cd4a34
The problem is: I can see those exact events on the console (i.e. I know that the instance was
It cannot trivially be done in a unit test, because there is not going to be any benefit with dummy processors. We need some concrete processor with large (memory-wise) models that take some time to load/setup. And we'd have to run a bunch of them in different parameter/model settings... |
Right, but that cache was supposed to be used only by the The 2 main issues I see are:
What do you think about the following suggestion: Instead of keeping a single cache in the core for all processors, we could keep a list of caches. Each unique processor will have its separate cache. Then for each cache, the core will invoke a method to decide what is the desired max instances number. I am aware, that this still does not fully solve all the issues, but at least we get more flexibility and the implementers could override the default method to return the best number for each specific processor. EDIT: The
Exactly. I rather meant I have not tested it manually with the
I have no idea other than suggesting to write to a test file instead of stdout/stderr and read from the test file to assert. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good for now to move forward.
Yes, I know it's a separate cache for each worker. But the module level settings makes it impossible to set this programmatically. Of course, once I have a docker-compose.yml I can configure such environment variables. But if we will generate each docker-compose.yml in ocrd_all, then we would need to devise some customisation/inclusion mechanism first. Couldn't we use an inline function for
Just tried that. Unfortunately, this does not work either. The |
Oh, that cannot work, of course – it would be in a new cache with every Perhaps my above example, ...
from ocrd_utils import config
config.OCRD_MAX_PROCESSOR_CACHE = min(config.OCRD_MAX_PROCESSOR_CACHE, 4)
...
# stuff for my processor is not all wrong: it would just have to be done before importing the processor decorator, right? |
My idea was in fact to be able to set that programmatically. Maybe I should have gone into more detail. What I am suggesting is getting rid of the Example: {
"ocrd-cis-ocropy-recognize": CacheInstance,
"ocrd-cis-ocropy-binarize": CacheInstance,
"ocrd-kraken-recognize": CacheInstance,
...
} Each CacheInstance max size will depend on the |
Yes.
I think as long as we use the |
okay, but I still don't understand where we should wrap then. The storage can only be on a module, in a module global, or in a class. I fail to see any class we could use for that (and wrapping |
I also do not know. I took over the cache implementation as it is from OCR-D#884. If the current cache implementation is painful to adapt to v3 and requires globals and various hacks, we could also consider completely removing it. |
@MehmedGIT well, it is a valuable functionality as it is, and for me it's not so much a matter of adapting to v3 but taking the chance to improve it a little. See last 3 commits for said attempt via hacking global module storage. Sorry – I don't know any better. |
Thanks. I will have a look and test with the |
…ocrd-tool.json is found
…ts from ocrd-tool.json)
- `metadata`, `executable`, `ocrd_tool`, `version`: use cached_property instead of internal ad-hoc attributes - rename `metadata_location` → `metadata_filename`, add cached_property chain `metadata_location`, `metadata_rawdict` used by `metadata` to make it easy to override - `metadata_filename` if just the path of `ocrd-tool.json` in the package deviates - `metadata_location` if the `ocrd-tool.json` is not distributed via Python pkg - `metadata_rawdict` if the `ocrd-tool.json` is not in a file - `metadata` if the validated, expanded `ocrd-tool.json` is somewhere else - `DummyProcessor`: just override `Processor.metadata_filename` - processor tests: adapt to new properties and `verify` enforcing cardinality
I did some more refactoring and cleanup. Notably, because we always default-expand the schema, we can now enforce the fileGrp cardinalities already – see 5819c81 Also, I cherry-picked from #16 and exported the full chain of I also made use of the fact that in bashlib's Since no-one commented on the problem with breaking This is ready AFAICS. Does anyone have an opinion when this can/should be merged, so we can make another prerelease? |
@@ -129,4 +125,31 @@ def process_page_file(self, *input_files): | |||
print(f"[{field}]='{value}'", end=' ') | |||
output_file_id = make_file_id(input_files[0], kwargs['output_file_grp']) | |||
print(f"[outputFileId]='{output_file_id}'") | |||
ocrd_cli_wrap_processor(BashlibProcessor, **kwargs) | |||
if ocrd_tool and executable: | |||
class FullBashlibProcessor(BashlibProcessor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clever
👍
Very thorough but I needed #17 to get eynollah to work
👍
Yes, if there is breakage we can fix it quickly.
I could make a |
processor.metadata_location: use self.__module__ not __package__
addresses the problem with required parameters
@kba @MehmedGIT could you please review?