parameter setup in v3 API #14

bertsky · 2024-08-22T11:18:18Z

addresses the problem with required parameters

@kba @MehmedGIT could you please review?

- `Processor` init: do not validate, only set `parameter` if present (needed to avoid exception from `ParameterValidator` for processors with mandatory params in non-processing usage) - `Processor.parameter`: allow `None`, but validate when set - `Processor._setup`: validate parameters, then call `Processor.setup`

…o call `setup`

bertsky · 2024-08-22T11:41:50Z

Unfortunately, by always defining a Processor.parameter regardless of processing or non-processing use, we break implementations like ocrd_tesserocr:
https://github.com/OCR-D/ocrd_tesserocr/blob/dcbd5227f834c219e36d5dc41606a08e85e67a15/ocrd_tesserocr/segment_region.py#L16

We had the same with Processor.output_file_grp checks, which led me to change v3 so this would not immediately break such processors. In this case, I am sure it's only ocrd_tesserocr, which we will migrate immediately, anyway. (But will all installations which update core to v3 then also update ocrd_tesserocr?)

I could still change this PR to not provide Processor.parameter == None but rather not hasattr(Processor, 'parameter') right until setting parameters explicitly (in get_processor or elsewhere)...

kba

The parameter validation happening on assignment is much better and further reduces the complexity of the constructor. And with the _setup -> validation -> setup mechanism we're certain that everything is properly (re-)loaded when the parameters change - which I still don't think should happen, but if it is being done, we're safe.

Also just tested it by making model of ocrd-kraken-recognize required and --help works again.

So, LGTM!

src/ocrd/processor/base.py

src/ocrd/processor/helpers.py

tests/processor/test_processor.py

src/ocrd/processor/base.py

Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>

src/ocrd/processor/base.py

Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>

…to plain dict in getter for serialization etc

MehmedGIT

I have not tested if it works, just peeked over the code. Seems okay, I am unsure how that affects the processor caching. How about caching only the models used by the processors and not the entire processor class?

bertsky · 2024-08-22T14:47:54Z

I am unsure how that affects the processor caching.

I myself am just trying to understand how lru_cache actually works in this case. It is known to require hashable types for the function arguments, but get_cached_processor takes a dict for the parameters, not a frozendict. Where do we test this?

How about caching only the models used by the processors and not the entire processor class?

The problem is that we don't know in core where implementors store anything. We can only observe what goes in (i.e. parameters) and out (i.e. instances with bindings to large memory objects possibly even on GPU).

MehmedGIT · 2024-08-22T14:53:09Z

Where do we test this?

We do not have separate tests for that. I never tested that part to see what performance we gain from the processor caching.

…rameter re-assignment)

bertsky · 2024-08-22T17:28:14Z

but get_cached_processor takes a dict for the parameters, not a frozendict

I was wrong: via freeze_args, it already converts dict to frozendict

BTW, I don't like the fact that config.OCRD_MAX_PROCESSOR_CACHE gets read on module initialization level already. That means I have no chance as a processor implementor to customize this based on what I know about my processor (like models are too large for more than 2 instances). What I would like to have is:

...
from ocrd_utils import config
config.OCRD_MAX_PROCESSOR_CACHE = 2
...
from ocrd import Processor
...

class MyProcessor(Processor):
    ...
    # instances of this class will have an lru_cache size of 2 in get_processor

We do not have separate tests for that.

In 6cd4a34 I know added a simple test for get_processor(... instance_caching=True).

This also contains the changes I deemed necessary for Processor.shutdown – please review!

Unfortunately, I cannot properly test the combination of both: doing something in shutdown and checking this was in fact triggered upon uncacheing with an assertion. You can see that I commented that assertion out:

core/tests/processor/test_processor.py

Line 123 in 6cd4a34

#assert '0' in out.split('\n')

The problem is: I can see those exact events on the console (i.e. I know that the instance was deled and its shutdown called), but somehow, because of the weakref.finalize mechanism, I cannot capture that with pytest.

I never tested that part to see what performance we gain from the processor caching.

It cannot trivially be done in a unit test, because there is not going to be any benefit with dummy processors. We need some concrete processor with large (memory-wise) models that take some time to load/setup. And we'd have to run a bunch of them in different parameter/model settings...

MehmedGIT · 2024-08-23T10:40:53Z

BTW, I don't like the fact that config.OCRD_MAX_PROCESSOR_CACHE gets read on module initialization level already. That means I have no chance as a processor implementor to customize this based on what I know about my processor

Right, but that cache was supposed to be used only by the ocrd_network module, and as a processor implementor, you were not supposed to care about that. By default, the max possible amount supported by Python should be used. The provided environment variable was introduced to provide flexibility. On the other hand, I see why keeping 128 instances is problematic. If that many instances are cached with the same big model but just different other small parameter changes - that would lead to memory peaks. Very bad in certain cases with big models! However, in a fewer cached instances, there will be more module loads/unloads happening.

The 2 main issues I see are:

we cannot cache only the model parameter but need to cache the entire parameter dictionary.
there is no good way to generalize how many instances are to be cached from the core side for each specific processor.

What do you think about the following suggestion: Instead of keeping a single cache in the core for all processors, we could keep a list of caches. Each unique processor will have its separate cache. Then for each cache, the core will invoke a method to decide what is the desired max instances number. I am aware, that this still does not fully solve all the issues, but at least we get more flexibility and the implementers could override the default method to return the best number for each specific processor.

EDIT: The get_cached_processor method in the ocrd_network context gets invoked only by the Processing Worker. Since each worker is tied up to a specific processor, each worker has a separate cache. So, it is still perfectly valid to set OCRD_MAX_PROCESSOR_CACHE separately for each worker in their environment. However, the limitation here is that the workers should run in different environments or on different hosts. When running on the same host, the memory consumption will potentially explode.

It cannot trivially be done in a unit test, because there is not going to be any benefit with dummy processors.

Exactly. I rather meant I have not tested it manually with the ocrd_network module by utilizing different processors/models and comparing the performance with and without the processor caching with different amounts of processing worker instances.

The problem is: I can see those exact events on the console (i.e. I know that the instance was deled and its shutdown called), but somehow, because of the weakref.finalize mechanism, I cannot capture that with pytest.

I have no idea other than suggesting to write to a test file instead of stdout/stderr and read from the test file to assert.

MehmedGIT

Good for now to move forward.

bertsky · 2024-08-23T12:07:24Z

What do you think about the following suggestion: Instead of keeping a single cache in the core for all processors, we could keep a list of caches. Each unique processor will have its separate cache. Then for each cache, the core will invoke a method to decide what is the desired max instances number. I am aware, that this still does not fully solve all the issues, but at least we get more flexibility and the implementers could override the default method to return the best number for each specific processor.

EDIT: The get_cached_processor method in the ocrd_network context gets invoked only by the Processing Worker. Since each worker is tied up to a specific processor, each worker has a separate cache. So, it is still perfectly valid to set OCRD_MAX_PROCESSOR_CACHE separately for each worker in their environment. However, the limitation here is that the workers should run in different environments or on different hosts. When running on the same host, the memory consumption will potentially explode.

Yes, I know it's a separate cache for each worker. But the module level settings makes it impossible to set this programmatically. Of course, once I have a docker-compose.yml I can configure such environment variables. But if we will generate each docker-compose.yml in ocrd_all, then we would need to devise some customisation/inclusion mechanism first.

Couldn't we use an inline function for get_cached_processor, so we can read the state of config.OCRD_MAX_PROCESSOR_CACHE when instantiating the server (so after module initialisation)?

I have no idea other than suggesting to write to a test file instead of stdout/stderr and read from the test file to assert.

Just tried that. Unfortunately, this does not work either. The shutdown (triggered by finalizer) gets called no earlier than when pytest leaves the function, so the file will only be filled after we can test it.

bertsky · 2024-08-23T12:34:18Z

Couldn't we use an inline function for get_cached_processor, so we can read the state of config.OCRD_MAX_PROCESSOR_CACHE when instantiating the server (so after module initialisation)?

Oh, that cannot work, of course – it would be in a new cache with every get_processor call.

Perhaps my above example,

...
from ocrd_utils import config
config.OCRD_MAX_PROCESSOR_CACHE = min(config.OCRD_MAX_PROCESSOR_CACHE, 4)
...
# stuff for my processor

is not all wrong: it would just have to be done before importing the processor decorator, right?

MehmedGIT · 2024-08-23T12:35:11Z

But the module level settings makes it impossible to set this programmatically.

My idea was in fact to be able to set that programmatically. Maybe I should have gone into more detail. What I am suggesting is getting rid of the lru_cache decorator on get_cached_processor and writing our wrapper for functools.lru_cache(maxsize=128, typed=False). The max size to be passed will then be decided at run-time based on the ProcessorClass.get_max_cache_size(). By default, there will be a get_max_cache_size() method in core returning 128 or config.OCRD_MAX_PROCESSOR_CACHE. That method can then be overridden on the Processor level to return the "ideal" value. However, to prevent potential issues, we would probably need a dictionary/list of caches in the core and not a single cache. One cache for each unique processor.

Example:

{
"ocrd-cis-ocropy-recognize": CacheInstance,
"ocrd-cis-ocropy-binarize": CacheInstance,
"ocrd-kraken-recognize": CacheInstance,
...
}

Each CacheInstance max size will depend on the ProcessorClass.get_max_cache_size(). Keep in mind that my suggestion is only an idea and I have not implemented something like that before. Not sure how much flexibility we can have with functools.lru_cache

MehmedGIT · 2024-08-23T12:40:23Z

Oh, that cannot work, of course – it would be in a new cache with every get_processor call.

Yes.

is not all wrong: it would just have to be done before importing the processor decorator, right?

I think as long as we use the functools.lru_cache as a decorator for get_cached_processor, we will face all kinds of limitations. A different strategy is needed.

bertsky · 2024-08-23T15:09:03Z

getting rid of the lru_cache decorator on get_cached_processor and writing our wrapper for functools.lru_cache(maxsize=128, typed=False). The max size to be passed will then be decided at run-time based on the ProcessorClass.get_max_cache_size().

okay, but I still don't understand where we should wrap then. The storage can only be on a module, in a module global, or in a class. I fail to see any class we could use for that (and wrapping lru_cache in instances takes extra steps due to GC interference). I managed to get it working with a module-level global, but that's ugly...

MehmedGIT · 2024-08-23T15:28:37Z

okay, but I still don't understand where we should wrap then. The storage can only be on a module, in a module global, or in a class. I fail to see any class we could use for that (and wrapping lru_cache in instances takes extra steps due to GC interference). I managed to get it working with a module-level global, but that's ugly...

I also do not know. I took over the cache implementation as it is from OCR-D#884. If the current cache implementation is painful to adapt to v3 and requires globals and various hacks, we could also consider completely removing it.

…ntime

bertsky · 2024-08-23T15:43:09Z

@MehmedGIT well, it is a valuable functionality as it is, and for me it's not so much a matter of adapting to v3 but taking the chance to improve it a little. See last 3 commits for said attempt via hacking global module storage. Sorry – I don't know any better.

MehmedGIT · 2024-08-23T15:45:11Z

@MehmedGIT well, it is a valuable functionality as it is, and for me it's not so much a matter of adapting to v3 but taking the chance to improve it a little. See last 3 commits for said attempt via hacking global module storage. Sorry – I don't know any better.

Thanks. I will have a look and test with the ocrd_network module. Please do not hinder the development by waiting for me. Merge this PR if required. I could adapt afterwards if I find a better solution.

…ocrd-tool.json is found

…ts from ocrd-tool.json)

- `metadata`, `executable`, `ocrd_tool`, `version`: use cached_property instead of internal ad-hoc attributes - rename `metadata_location` → `metadata_filename`, add cached_property chain `metadata_location`, `metadata_rawdict` used by `metadata` to make it easy to override - `metadata_filename` if just the path of `ocrd-tool.json` in the package deviates - `metadata_location` if the `ocrd-tool.json` is not distributed via Python pkg - `metadata_rawdict` if the `ocrd-tool.json` is not in a file - `metadata` if the validated, expanded `ocrd-tool.json` is somewhere else - `DummyProcessor`: just override `Processor.metadata_filename` - processor tests: adapt to new properties and `verify` enforcing cardinality

…cutable name

bertsky · 2024-08-24T11:29:54Z

I did some more refactoring and cleanup.

Notably, because we always default-expand the schema, we can now enforce the fileGrp cardinalities already – see 5819c81

Also, I cherry-picked from #16 and exported the full chain of metadata loading, so it can be overridden at exactly where implementors like – see 7966057

I also made use of the fact that in bashlib's ocrd__wrap we already know where the ocrd-tool.json file is and what the executable is called: by passing this info into ocrd bashlib input-files, we can even better use the standard processor class and decorator.

Since no-one commented on the problem with breaking Processor.parameter in v2, I consider this irrelevant and will keep it. (So just ocrd_tesserocr must be updated+installed immediately.)

This is ready AFAICS. Does anyone have an opinion when this can/should be merged, so we can make another prerelease?

kba · 2024-08-24T14:13:24Z

src/ocrd/cli/bashlib.py

@@ -129,4 +125,31 @@ def process_page_file(self, *input_files):
                print(f"[{field}]='{value}'", end=' ')
            output_file_id = make_file_id(input_files[0], kwargs['output_file_grp'])
            print(f"[outputFileId]='{output_file_id}'")
-    ocrd_cli_wrap_processor(BashlibProcessor, **kwargs)
+    if ocrd_tool and executable:
+        class FullBashlibProcessor(BashlibProcessor):


kba · 2024-08-24T14:45:00Z

I did some more refactoring and cleanup.

Notably, because we always default-expand the schema, we can now enforce the fileGrp cardinalities already – see 5819c81

👍

Also, I cherry-picked from #16 and exported the full chain of metadata loading, so it can be overridden at exactly where implementors like – see 7966057

Very thorough but I needed #17 to get eynollah to work

I also made use of the fact that in bashlib's ocrd__wrap we already know where the ocrd-tool.json file is and what the executable is called: by passing this info into ocrd bashlib input-files, we can even better use the standard processor class and decorator.

👍

Since no-one commented on the problem with breaking Processor.parameter in v2, I consider this irrelevant and will keep it. (So just ocrd_tesserocr must be updated+installed immediately.)

Yes, if there is breakage we can fix it quickly.

This is ready AFAICS. Does anyone have an opinion when this can/should be merged, so we can make another prerelease?

I could make a b1 right away but would like to have #17 in there, since I'm working on eynollah now which needs it.

processor.metadata_location: use self.__module__ not __package__

bertsky added 3 commits August 22, 2024 13:10

get_processor: ensure passing non-empty parameter, rely on _setup t…

7998aae

…o call `setup`

test_processor: adapt, check required parameters

cc8592b

bertsky mentioned this pull request Aug 22, 2024

New processor API OCR-D/core#1240

Merged

kba approved these changes Aug 22, 2024

View reviewed changes

src/ocrd/processor/base.py Outdated Show resolved Hide resolved

src/ocrd/processor/helpers.py Outdated Show resolved Hide resolved

tests/processor/test_processor.py Outdated Show resolved Hide resolved

src/ocrd/processor/base.py Outdated Show resolved Hide resolved

bertsky and others added 2 commits August 22, 2024 13:46

improve _setup docstring

45e556d

Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>

Processor._setup: raise with full ParameterValidator report

d4c802b

Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>

kba reviewed Aug 22, 2024

View reviewed changes

src/ocrd/processor/base.py Outdated Show resolved Hide resolved

bertsky mentioned this pull request Aug 22, 2024

Port to ocrd core version 3.0.0 bertsky/ocrd_cis#5

Open

bertsky and others added 5 commits August 22, 2024 14:58

get_processor: parameter only as kwarg

b28fefb

Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>

tests: adapt for get_processor parameter only as kwarg

642938b

Processor.parameter: make the bound dict read-only

f5e5c54

Co-authored-by: Konstantin Baierer <kba@users.noreply.github.com>

Processor.parameter: move ParameterValidator back to setter, convert …

f2d53a6

…to plain dict in getter for serialization etc

Processor.parameter: frozendict instead of mappingproxy, add test

7297ca2

MehmedGIT approved these changes Aug 22, 2024

View reviewed changes

introduce Processor.shutdown to be overridden (called at deinit or pa…

6cd4a34

…rameter re-assignment)

MehmedGIT approved these changes Aug 23, 2024

View reviewed changes

Processor: introduce max_instances class attribute

407bff8

get_cached_processor: set lru_cache maxsize from min(cfg,class) at ru…

c9fbb2c

…ntime

bertsky and others added 16 commits August 23, 2024 17:54

test get_processor instance_caching w/ clear_cache

a413f04

📦 v3.0.0a2

870523c

remove make *-workaround, we will not do that for v3+

20bb6d1

Processor.metadata_location property to specify where in the package …

faa59a8

…ocrd-tool.json is found

Processor.verify: always check cardinality (as we now have the defaul…

5819c81

…ts from ocrd-tool.json)

fix --log-filename (6fc6060): apply in ocrd_cli_wrap_processor

4f88f1d

fix exception

d621f36

adapt to PIL.Image moved constants

4868fb1

ocrd_utils: add parse_json_file_with_comments

da72c0a

cli.workspace: pass fileGrp as well, improve description

ca78b94

OcrdMets.add_agent: does not have positional args

cf41745

remove misplaced kwargs from run_processor

cadc6e6

bashlib input-files: adapt, allow passing ocrd-tool.json path and exe…

bba142d

…cutable name

add to pylint karma

32cdc5a

update pylintrc

a95f269

bertsky mentioned this pull request Aug 24, 2024

Processor.metadata_location property to specify where in the package ocrd-tool.json is found #16

Closed

kba reviewed Aug 24, 2024

View reviewed changes

processor.metadata_location: use self.__module__ not __package__

50c088e

kba added a commit to qurator-spk/eynollah that referenced this pull request Aug 24, 2024

update processor to the latest change in bertsky/core#14

0d83db7

Merge pull request #17 from OCR-D/new-processor-api-parameter-setup

ad8c76e

processor.metadata_location: use self.__module__ not __package__

bertsky mentioned this pull request Aug 25, 2024

Processor.parameter: make property always return a (frozen)dict #18

Closed

bertsky merged commit b53724e into new-processor-api Aug 25, 2024
1 check passed

pylint: try ignoring generateds (again)

8211237

bertsky deleted the new-processor-api-parameter-setup branch September 23, 2024 15:47

bertsky restored the new-processor-api-parameter-setup branch September 25, 2024 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parameter setup in v3 API #14

parameter setup in v3 API #14

bertsky commented Aug 22, 2024

bertsky commented Aug 22, 2024

kba left a comment

MehmedGIT left a comment

bertsky commented Aug 22, 2024

MehmedGIT commented Aug 22, 2024

bertsky commented Aug 22, 2024

MehmedGIT commented Aug 23, 2024 •

edited

Loading

MehmedGIT left a comment

bertsky commented Aug 23, 2024

bertsky commented Aug 23, 2024

MehmedGIT commented Aug 23, 2024

MehmedGIT commented Aug 23, 2024

bertsky commented Aug 23, 2024

MehmedGIT commented Aug 23, 2024

bertsky commented Aug 23, 2024

MehmedGIT commented Aug 23, 2024 •

edited

Loading

bertsky commented Aug 24, 2024

kba Aug 24, 2024

kba commented Aug 24, 2024

parameter setup in v3 API #14

parameter setup in v3 API #14

Conversation

bertsky commented Aug 22, 2024

bertsky commented Aug 22, 2024

kba left a comment

Choose a reason for hiding this comment

MehmedGIT left a comment

Choose a reason for hiding this comment

bertsky commented Aug 22, 2024

MehmedGIT commented Aug 22, 2024

bertsky commented Aug 22, 2024

MehmedGIT commented Aug 23, 2024 • edited Loading

MehmedGIT left a comment

Choose a reason for hiding this comment

bertsky commented Aug 23, 2024

bertsky commented Aug 23, 2024

MehmedGIT commented Aug 23, 2024

MehmedGIT commented Aug 23, 2024

bertsky commented Aug 23, 2024

MehmedGIT commented Aug 23, 2024

bertsky commented Aug 23, 2024

MehmedGIT commented Aug 23, 2024 • edited Loading

bertsky commented Aug 24, 2024

kba Aug 24, 2024

Choose a reason for hiding this comment

kba commented Aug 24, 2024

MehmedGIT commented Aug 23, 2024 •

edited

Loading

MehmedGIT commented Aug 23, 2024 •

edited

Loading