Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parameter setup in v3 API #14

Merged
merged 33 commits into from
Aug 25, 2024

Conversation

bertsky
Copy link
Owner

@bertsky bertsky commented Aug 22, 2024

addresses the problem with required parameters

@kba @MehmedGIT could you please review?

- `Processor` init: do not validate, only set `parameter` if present
  (needed to avoid exception from `ParameterValidator` for processors
   with mandatory params in non-processing usage)
- `Processor.parameter`: allow `None`, but validate when set
- `Processor._setup`: validate parameters, then call `Processor.setup`
@bertsky
Copy link
Owner Author

bertsky commented Aug 22, 2024

Unfortunately, by always defining a Processor.parameter regardless of processing or non-processing use, we break implementations like ocrd_tesserocr:
https://github.com/OCR-D/ocrd_tesserocr/blob/dcbd5227f834c219e36d5dc41606a08e85e67a15/ocrd_tesserocr/segment_region.py#L16

We had the same with Processor.output_file_grp checks, which led me to change v3 so this would not immediately break such processors. In this case, I am sure it's only ocrd_tesserocr, which we will migrate immediately, anyway. (But will all installations which update core to v3 then also update ocrd_tesserocr?)

I could still change this PR to not provide Processor.parameter == None but rather not hasattr(Processor, 'parameter') right until setting parameters explicitly (in get_processor or elsewhere)...

Copy link
Collaborator

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter validation happening on assignment is much better and further reduces the complexity of the constructor. And with the _setup -> validation -> setup mechanism we're certain that everything is properly (re-)loaded when the parameters change - which I still don't think should happen, but if it is being done, we're safe.

Also just tested it by making model of ocrd-kraken-recognize required and --help works again.

So, LGTM!

src/ocrd/processor/base.py Outdated Show resolved Hide resolved
src/ocrd/processor/helpers.py Outdated Show resolved Hide resolved
tests/processor/test_processor.py Outdated Show resolved Hide resolved
src/ocrd/processor/base.py Outdated Show resolved Hide resolved
bertsky and others added 2 commits August 22, 2024 13:46
Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>
Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>
src/ocrd/processor/base.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@MehmedGIT MehmedGIT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not tested if it works, just peeked over the code. Seems okay, I am unsure how that affects the processor caching. How about caching only the models used by the processors and not the entire processor class?

@bertsky
Copy link
Owner Author

bertsky commented Aug 22, 2024

I am unsure how that affects the processor caching.

I myself am just trying to understand how lru_cache actually works in this case. It is known to require hashable types for the function arguments, but get_cached_processor takes a dict for the parameters, not a frozendict. Where do we test this?

How about caching only the models used by the processors and not the entire processor class?

The problem is that we don't know in core where implementors store anything. We can only observe what goes in (i.e. parameters) and out (i.e. instances with bindings to large memory objects possibly even on GPU).

@MehmedGIT
Copy link
Collaborator

Where do we test this?

We do not have separate tests for that. I never tested that part to see what performance we gain from the processor caching.

@bertsky
Copy link
Owner Author

bertsky commented Aug 22, 2024

but get_cached_processor takes a dict for the parameters, not a frozendict

I was wrong: via freeze_args, it already converts dict to frozendict

BTW, I don't like the fact that config.OCRD_MAX_PROCESSOR_CACHE gets read on module initialization level already. That means I have no chance as a processor implementor to customize this based on what I know about my processor (like models are too large for more than 2 instances). What I would like to have is:

...
from ocrd_utils import config
config.OCRD_MAX_PROCESSOR_CACHE = 2
...
from ocrd import Processor
...

class MyProcessor(Processor):
    ...
    # instances of this class will have an lru_cache size of 2 in get_processor

We do not have separate tests for that.

In 6cd4a34 I know added a simple test for get_processor(... instance_caching=True).

This also contains the changes I deemed necessary for Processor.shutdown – please review!

Unfortunately, I cannot properly test the combination of both: doing something in shutdown and checking this was in fact triggered upon uncacheing with an assertion. You can see that I commented that assertion out:

#assert '0' in out.split('\n')

The problem is: I can see those exact events on the console (i.e. I know that the instance was deled and its shutdown called), but somehow, because of the weakref.finalize mechanism, I cannot capture that with pytest.

I never tested that part to see what performance we gain from the processor caching.

It cannot trivially be done in a unit test, because there is not going to be any benefit with dummy processors. We need some concrete processor with large (memory-wise) models that take some time to load/setup. And we'd have to run a bunch of them in different parameter/model settings...

@MehmedGIT
Copy link
Collaborator

MehmedGIT commented Aug 23, 2024

BTW, I don't like the fact that config.OCRD_MAX_PROCESSOR_CACHE gets read on module initialization level already. That means I have no chance as a processor implementor to customize this based on what I know about my processor

Right, but that cache was supposed to be used only by the ocrd_network module, and as a processor implementor, you were not supposed to care about that. By default, the max possible amount supported by Python should be used. The provided environment variable was introduced to provide flexibility. On the other hand, I see why keeping 128 instances is problematic. If that many instances are cached with the same big model but just different other small parameter changes - that would lead to memory peaks. Very bad in certain cases with big models! However, in a fewer cached instances, there will be more module loads/unloads happening.

The 2 main issues I see are:

  • we cannot cache only the model parameter but need to cache the entire parameter dictionary.
  • there is no good way to generalize how many instances are to be cached from the core side for each specific processor.

What do you think about the following suggestion: Instead of keeping a single cache in the core for all processors, we could keep a list of caches. Each unique processor will have its separate cache. Then for each cache, the core will invoke a method to decide what is the desired max instances number. I am aware, that this still does not fully solve all the issues, but at least we get more flexibility and the implementers could override the default method to return the best number for each specific processor.

EDIT: The get_cached_processor method in the ocrd_network context gets invoked only by the Processing Worker. Since each worker is tied up to a specific processor, each worker has a separate cache. So, it is still perfectly valid to set OCRD_MAX_PROCESSOR_CACHE separately for each worker in their environment. However, the limitation here is that the workers should run in different environments or on different hosts. When running on the same host, the memory consumption will potentially explode.

It cannot trivially be done in a unit test, because there is not going to be any benefit with dummy processors.

Exactly. I rather meant I have not tested it manually with the ocrd_network module by utilizing different processors/models and comparing the performance with and without the processor caching with different amounts of processing worker instances.

The problem is: I can see those exact events on the console (i.e. I know that the instance was deled and its shutdown called), but somehow, because of the weakref.finalize mechanism, I cannot capture that with pytest.

I have no idea other than suggesting to write to a test file instead of stdout/stderr and read from the test file to assert.

Copy link
Collaborator

@MehmedGIT MehmedGIT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good for now to move forward.

@bertsky
Copy link
Owner Author

bertsky commented Aug 23, 2024

What do you think about the following suggestion: Instead of keeping a single cache in the core for all processors, we could keep a list of caches. Each unique processor will have its separate cache. Then for each cache, the core will invoke a method to decide what is the desired max instances number. I am aware, that this still does not fully solve all the issues, but at least we get more flexibility and the implementers could override the default method to return the best number for each specific processor.

EDIT: The get_cached_processor method in the ocrd_network context gets invoked only by the Processing Worker. Since each worker is tied up to a specific processor, each worker has a separate cache. So, it is still perfectly valid to set OCRD_MAX_PROCESSOR_CACHE separately for each worker in their environment. However, the limitation here is that the workers should run in different environments or on different hosts. When running on the same host, the memory consumption will potentially explode.

Yes, I know it's a separate cache for each worker. But the module level settings makes it impossible to set this programmatically. Of course, once I have a docker-compose.yml I can configure such environment variables. But if we will generate each docker-compose.yml in ocrd_all, then we would need to devise some customisation/inclusion mechanism first.

Couldn't we use an inline function for get_cached_processor, so we can read the state of config.OCRD_MAX_PROCESSOR_CACHE when instantiating the server (so after module initialisation)?

I have no idea other than suggesting to write to a test file instead of stdout/stderr and read from the test file to assert.

Just tried that. Unfortunately, this does not work either. The shutdown (triggered by finalizer) gets called no earlier than when pytest leaves the function, so the file will only be filled after we can test it.

@bertsky
Copy link
Owner Author

bertsky commented Aug 23, 2024

Couldn't we use an inline function for get_cached_processor, so we can read the state of config.OCRD_MAX_PROCESSOR_CACHE when instantiating the server (so after module initialisation)?

Oh, that cannot work, of course – it would be in a new cache with every get_processor call.

Perhaps my above example,

...
from ocrd_utils import config
config.OCRD_MAX_PROCESSOR_CACHE = min(config.OCRD_MAX_PROCESSOR_CACHE, 4)
...
# stuff for my processor

is not all wrong: it would just have to be done before importing the processor decorator, right?

@MehmedGIT
Copy link
Collaborator

But the module level settings makes it impossible to set this programmatically.

My idea was in fact to be able to set that programmatically. Maybe I should have gone into more detail. What I am suggesting is getting rid of the lru_cache decorator on get_cached_processor and writing our wrapper for functools.lru_cache(maxsize=128, typed=False). The max size to be passed will then be decided at run-time based on the ProcessorClass.get_max_cache_size(). By default, there will be a get_max_cache_size() method in core returning 128 or config.OCRD_MAX_PROCESSOR_CACHE. That method can then be overridden on the Processor level to return the "ideal" value. However, to prevent potential issues, we would probably need a dictionary/list of caches in the core and not a single cache. One cache for each unique processor.

Example:

{
"ocrd-cis-ocropy-recognize": CacheInstance,
"ocrd-cis-ocropy-binarize": CacheInstance,
"ocrd-kraken-recognize": CacheInstance,
...
}

Each CacheInstance max size will depend on the ProcessorClass.get_max_cache_size(). Keep in mind that my suggestion is only an idea and I have not implemented something like that before. Not sure how much flexibility we can have with functools.lru_cache

@MehmedGIT
Copy link
Collaborator

Oh, that cannot work, of course – it would be in a new cache with every get_processor call.

Yes.

is not all wrong: it would just have to be done before importing the processor decorator, right?

I think as long as we use the functools.lru_cache as a decorator for get_cached_processor, we will face all kinds of limitations. A different strategy is needed.

@bertsky
Copy link
Owner Author

bertsky commented Aug 23, 2024

getting rid of the lru_cache decorator on get_cached_processor and writing our wrapper for functools.lru_cache(maxsize=128, typed=False). The max size to be passed will then be decided at run-time based on the ProcessorClass.get_max_cache_size().

okay, but I still don't understand where we should wrap then. The storage can only be on a module, in a module global, or in a class. I fail to see any class we could use for that (and wrapping lru_cache in instances takes extra steps due to GC interference). I managed to get it working with a module-level global, but that's ugly...

@MehmedGIT
Copy link
Collaborator

okay, but I still don't understand where we should wrap then. The storage can only be on a module, in a module global, or in a class. I fail to see any class we could use for that (and wrapping lru_cache in instances takes extra steps due to GC interference). I managed to get it working with a module-level global, but that's ugly...

I also do not know. I took over the cache implementation as it is from OCR-D#884. If the current cache implementation is painful to adapt to v3 and requires globals and various hacks, we could also consider completely removing it.

@bertsky
Copy link
Owner Author

bertsky commented Aug 23, 2024

@MehmedGIT well, it is a valuable functionality as it is, and for me it's not so much a matter of adapting to v3 but taking the chance to improve it a little. See last 3 commits for said attempt via hacking global module storage. Sorry – I don't know any better.

@MehmedGIT
Copy link
Collaborator

MehmedGIT commented Aug 23, 2024

@MehmedGIT well, it is a valuable functionality as it is, and for me it's not so much a matter of adapting to v3 but taking the chance to improve it a little. See last 3 commits for said attempt via hacking global module storage. Sorry – I don't know any better.

Thanks. I will have a look and test with the ocrd_network module. Please do not hinder the development by waiting for me. Merge this PR if required. I could adapt afterwards if I find a better solution.

bertsky and others added 16 commits August 23, 2024 17:54
- `metadata`, `executable`, `ocrd_tool`, `version`: use cached_property
  instead of internal ad-hoc attributes
- rename `metadata_location` → `metadata_filename`, add cached_property
  chain `metadata_location`, `metadata_rawdict` used by `metadata` to
  make it easy to override
  - `metadata_filename` if just the path of `ocrd-tool.json` in the package deviates
  - `metadata_location` if the `ocrd-tool.json` is not distributed via Python pkg
  - `metadata_rawdict` if the `ocrd-tool.json` is not in a file
  - `metadata` if the validated, expanded `ocrd-tool.json` is somewhere else
- `DummyProcessor`: just override `Processor.metadata_filename`
- processor tests: adapt to new properties and `verify` enforcing cardinality
@bertsky
Copy link
Owner Author

bertsky commented Aug 24, 2024

I did some more refactoring and cleanup.

Notably, because we always default-expand the schema, we can now enforce the fileGrp cardinalities already – see 5819c81

Also, I cherry-picked from #16 and exported the full chain of metadata loading, so it can be overridden at exactly where implementors like – see 7966057

I also made use of the fact that in bashlib's ocrd__wrap we already know where the ocrd-tool.json file is and what the executable is called: by passing this info into ocrd bashlib input-files, we can even better use the standard processor class and decorator.

Since no-one commented on the problem with breaking Processor.parameter in v2, I consider this irrelevant and will keep it. (So just ocrd_tesserocr must be updated+installed immediately.)

This is ready AFAICS. Does anyone have an opinion when this can/should be merged, so we can make another prerelease?

@@ -129,4 +125,31 @@ def process_page_file(self, *input_files):
print(f"[{field}]='{value}'", end=' ')
output_file_id = make_file_id(input_files[0], kwargs['output_file_grp'])
print(f"[outputFileId]='{output_file_id}'")
ocrd_cli_wrap_processor(BashlibProcessor, **kwargs)
if ocrd_tool and executable:
class FullBashlibProcessor(BashlibProcessor):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clever

@kba
Copy link
Collaborator

kba commented Aug 24, 2024

I did some more refactoring and cleanup.

Notably, because we always default-expand the schema, we can now enforce the fileGrp cardinalities already – see 5819c81

👍

Also, I cherry-picked from #16 and exported the full chain of metadata loading, so it can be overridden at exactly where implementors like – see 7966057

Very thorough but I needed #17 to get eynollah to work

I also made use of the fact that in bashlib's ocrd__wrap we already know where the ocrd-tool.json file is and what the executable is called: by passing this info into ocrd bashlib input-files, we can even better use the standard processor class and decorator.

👍

Since no-one commented on the problem with breaking Processor.parameter in v2, I consider this irrelevant and will keep it. (So just ocrd_tesserocr must be updated+installed immediately.)

Yes, if there is breakage we can fix it quickly.

This is ready AFAICS. Does anyone have an opinion when this can/should be merged, so we can make another prerelease?

I could make a b1 right away but would like to have #17 in there, since I'm working on eynollah now which needs it.

kba added a commit to qurator-spk/eynollah that referenced this pull request Aug 24, 2024
processor.metadata_location: use self.__module__ not __package__
@bertsky bertsky merged commit b53724e into new-processor-api Aug 25, 2024
1 check passed
@bertsky bertsky deleted the new-processor-api-parameter-setup branch September 23, 2024 15:47
@bertsky bertsky restored the new-processor-api-parameter-setup branch September 25, 2024 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants