Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should Effect even be a subclass of DataContainer? #479

Open
teutoburg opened this issue Oct 19, 2024 · 9 comments
Open

Should Effect even be a subclass of DataContainer? #479

teutoburg opened this issue Oct 19, 2024 · 9 comments
Assignees
Labels
API How users interact with the software refactor Implementation improvement

Comments

@teutoburg
Copy link
Contributor

teutoburg commented Oct 19, 2024

While working on improving some basic ideas about how we implement the fundamental Effect class (in the context of #387), I started wondering about the rationale behind inheriting from DataContainer. I understand the requirements for loading and storing data (i.e. FITS files or Tables, mostly) in an Effect, but I'm thinking this might conceptually be better handled by an attribute, rather than a parent class.

What?

This gets somewhat into the whole "is-a" vs "has-a" argument, but is arguably a good example thereof, so hear me out. IS an Effect a piece of data that's handled in some way? Or does an Effect HAVE some data associated with it? The more I think about it the more I tend towards the latter interpretation. Especially given the fact that many effects don't have any data, but just some config parameters.

Why?

Now why does this matter at all? Well, the Effect class is already a rather low base class, so having some more (rather complex) functionality "below" it not only feels a bit weird, but also makes it harder to change things or even get a good picture of how things work. Secondly, a composition-over-inheritance (which is the fancy wording for this whole argument) approach would also make the tests somewhat simpler and cleaner (think dependency injection). In general, the whole interface of Effect and also its state would be a fair bit cleaner this way.

How?

"Simply" (famous last words) make DataContainer an attribute of the Effect class (perhaps .data or even .data_container, since DataContainer itself already has .data (which is rarely used and somewhat duplicate of .get_data() anyway? But I wouldn't wanna end up with .data.data...), rather than it's base class!

Various effects that need this data attribute could then define a constructor in the __post_init__() method, for example.

@teutoburg teutoburg added API How users interact with the software refactor Implementation improvement labels Oct 19, 2024
@teutoburg teutoburg self-assigned this Oct 19, 2024
@teutoburg
Copy link
Contributor Author

teutoburg commented Oct 19, 2024

[Continuation, got interrupted by overstappen]

So, in the context of dataclasses, which is what the effects will become, I would propose to add this as a non-init field. That is, one that's not part of the constructor, but rather gets populated "later" (as mentioned in __post_init__()). This would also mean keeping the current signatures of the effects, e.g. supplying a file name! The initial value could be None, instead of an empty instance of DataContainer, which might also make it somewhat simpler to check if any data is present (which I think is currently done via checking the _file attribute, but what about data added from arrays (can't check now on mobile)). For any effects that work without external data (e.g. ShotNoise or analytical PSFs), the fiels would simply stay None.

@hugobuddel and also @astronomyk : good idea or horsesh*te? 🙃 Did I miss any reason why we really need this in an inheritance form that also makes sense conceptually?

@teutoburg
Copy link
Contributor Author

In terms of what to call it, one option that would make sense to me is .data_container for the actual field, but then have a .data property on the Effect class that is just a shortcut to .data_container.data (or .data_container.get_data(), I still need to figure out the reason both exist...). This would also actually maintain the current API of the Effect class (unless I'm missing anything), which would make this change have much less consequences to the rest of the codebase 🙂

@teutoburg
Copy link
Contributor Author

I decided to try it, how hard can it be 🙃

Turns out, not particularly hard. See the fh/datacont branch, which now passes locally (excluding notebook tests). Now that is not how it will finally look like, but I think if we manage to make this change in several steps, keeping things functional in between, that would be preferable (and easier to review).

@hugobuddel
Copy link
Collaborator

hugobuddel commented Oct 19, 2024

I'd say that indeed composition > inheritance.

From a practical perspective: there could be an effect that requires two or more pieces of data, that is not really possible in the current setup.

But we should only change the structure if we actually gain some tangible benefit, and I don't think you gave convincing arguments. You are going to do it anyway though :-). [edit: indeed!]

The biggest can of worms that I can currently think of, is that many of those files also contain information that currently ends up in .meta and that I think should become proper class attributes. In that sense, an Effect indeed actually is the data.

@teutoburg
Copy link
Contributor Author

From a practical perspective: there could be an effect that requires two or more pieces of data, that is not really possible in the current setup.

That it one more convincing argument I'd say 😜

The biggest can of worms that I can currently think of, is that many of those files also contain information that currently ends up in .meta

In fh/datacont, I "solved" this by adding self.meta.update(self.data_container.meta) in Effect.__init__(). Yes it's hacky, no I don't intend to keep that. Other than that, the only thing that crept up was in MetisLMSEfficiency.__int__(), which is problematic anyway because it calls super().__init__ twice 👀

@hugobuddel
Copy link
Collaborator

Other than that, the only thing that crept up was in MetisLMSEfficiency.__int__(), which is problematic anyway because it calls super().__init__ twice 👀

Yeah that was another thing I thought off. Well I did not recall that specific class, but I recalled that some __init__()s do weird things that might break if we make this change. But I think that is actually a good thing, because that would mean there was a problem there anyway.

The getters and setters you added in fh/datacont should make the transition rather painless.

I'm still a bit reluctant w.r.t. to the meta attributes. E.g. say the Effects would have .save() function that would produce a file that can be passed as a filename to the constructor to recreate the Effect instance, what should that function do?

In the old inheritance design, such a save() function could just save .meta to the headers of the file, because it is the DataContainer. In the new composition design it should copy .meta to the DataContainer first and then have the DataContainer save the file. It doesn't 'feel right', but I can't articulate why.

The inheritance design is clear: An Effect is something that can be loaded (and in theory saved) to a file. (But not in two files; which conceptually makes some sense.) The composition design is less clear: an Effect is something that has data that can be stored in a file. But some Effects don't have external data associated with it, so those can still 'have' a file that can be used to store their attributes.

Maybe we should still make this change though. But I think it is good to at least have a thought about where we want to go with this.

For example, we could also say that an Effect can store it's meta attributes (which IMO should become proper class attributes) read from some file. A simpler file than a full blown DataContainer; maybe just a yaml dump. That metadata file does not actually contain 'large' data, like pixels or transmission curves, but it would contain filenames of files with that 'bulk' data. The Effect then has 0, 1, or potentially more data files associated with it, which are DataContainer instances, each corresponding to one of the filenames in the Effect. (But note that it is possible to create a DataContainer without having an actual file, so the filename can be None.)

That way we could remove all the Effect meta content from the data file. The data file would/could still have metadata of its own though, although some of those could actually be part of the data directly (like units and such).

In practice, we might still need this line, at least for a while:

        self.meta.update(self.data_container.meta)

but I think that one way I would be okay with this proposed composition-instead-of-inheritance change is if we would have a (concept of a) plan to get rid of that line.

But don't get too hung up on my reluctance though. Maybe the proposed change is fine as-is.

@teutoburg
Copy link
Contributor Author

I'm still a bit reluctant w.r.t. to the meta attributes. E.g. say the Effects would have .save() function that would produce a file that can be passed as a filename to the constructor to recreate the Effect instance, what should that function do?

In the old inheritance design, such a save() function could just save .meta to the headers of the file, because it is the DataContainer. In the new composition design it should copy .meta to the DataContainer first and then have the DataContainer save the file. It doesn't 'feel right', but I can't articulate why.

The inheritance design is clear: An Effect is something that can be loaded (and in theory saved) to a file. (But not in two files; which conceptually makes some sense.) The composition design is less clear: an Effect is something that has data that can be stored in a file. But some Effects don't have external data associated with it, so those can still 'have' a file that can be used to store their attributes.

For example, we could also say that an Effect can store it's meta attributes (which IMO should become proper class attributes) read from some file. A simpler file than a full blown DataContainer; maybe just a yaml dump. That metadata file does not actually contain 'large' data, like pixels or transmission curves, but it would contain filenames of files with that 'bulk' data. The Effect then has 0, 1, or potentially more data files associated with it, which are DataContainer instances, each corresponding to one of the filenames in the Effect. (But note that it is possible to create a DataContainer without having an actual file, so the filename can be None.)

In my thinking, an Effect is separate from its data. The representation of the Effect is a YAML file (or indeed a section within one), that CAN point to a (FITS or ASCII) file containing some data. But that data should be static as far as the effect is concerned (is that ever not the case anywhere currently?), so saving an Effect would be equivalent to dumping all the attributes of the effect into YAML, along with 0/1/n path(s) to a static data file.

That way we could remove all the Effect meta content from the data file. The data file would/could still have metadata of its own though, although some of those could actually be part of the data directly (like units and such).

In practice, we might still need this line, at least for a while:

        self.meta.update(self.data_container.meta)

but I think that one way I would be okay with this proposed composition-instead-of-inheritance change is if we would have a (concept of a) plan to get rid of that line.

Perhaps this actually reflects that we should differentiate between (meta) data of the effect itself (i.e. soon-to-be class attributes) and meta data of the data, e.g. units (like you mentioned) in a table. For table, a lot of this might be handled by the data structure itself in the future, e.g. using QTable when dealing with tables that have units. For (image) HDUs (where the meta data would be the header), this might be a bit more tricky, but in the end it would probably be good to define what information should go into the header and what should be an attribute of the effect itself...

@hugobuddel
Copy link
Collaborator

I vote that we just go for it.

Maybe we could then replace

        self.meta.update(self.data_container.meta)

with exactly those attributes we actually need to propagate, that is, a bunch of

        if 'xyz' in self.data_container.meta:
            self.meta['xyz'] = self.data_container.meta['xyz']

(Or something smarter, with some class parameter attributes_to_copy_from_data = {'xyz'}, whatever.)

And then later move most (all?) of those attributes to the yaml files defining the Effect.

And then we can make the Effects dataclasses!

@teutoburg
Copy link
Contributor Author

I'll have to look at some of the effects that use this kind of data (I'd wildly guess about half of our Effect subclasses?) to get an idea about the extent of this occurring. That is, I'm wondering if it's just a few table column units and PSF FITS file WCS information, or more.

I started today to experiment with dataclass effects in a local branch (superset of fh/datacont), after ironing out some details of the structure you pioneered in #387 (which I should probably share somewhere as the prototype to this) and this is not trivial. Mostly because of cases like this that access parameters that need to be resolved through the meta before super().__init__ is called. I'd kinda need a __pre_init__ for that, but dataclasses doesn't offer this. I'm sure I'll find a solution though.

But since this issue is about whether to make this change in the first place and not to actually implement it, I suggest we can close it now, since we mostly agree on the way forward I think, and move any further discussion to the PR that will result from the fh/datacont branch. In hindsight, after spending 5 days in 🇳‍🇱 , I rather should have called the branch datakont 😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API How users interact with the software refactor Implementation improvement
Projects
Status: 📋 Backlog
Development

No branches or pull requests

2 participants