Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DL2 to DL3 step implementation #81

Merged
merged 26 commits into from
Jan 19, 2022
Merged

DL2 to DL3 step implementation #81

merged 26 commits into from
Jan 19, 2022

Conversation

morcuended
Copy link
Member

@morcuended morcuended commented Jan 17, 2022

Usage: dl3_stage -d 2021_08_08 -c cfg/sequencer.cfg LST1

It makes use of the metadata information extracted from the TCU database (source name and RADec coordinates). (WARNING: a run catalog is to be implemented. This would ease the access to metadata information needed for the DL3 tool).

It pipes the lstchain scripts:

  • lstchain_create_irf_files (once per selection cuts)
  • lstchain_create_dl3_file (run-wise)
  • lstchain_create_dl3_index_files (source-wise)

The idea is to end up with the following structure:

/fefs/aswg/data/real
├── monitoring
│   ├── RunSummary
│   ├── DrivePositioning
│   └── PixelCalibration
├── DL1
│   └── YYYYMMDD
│       └── vX.Y.Z
│           ├── muons.fits
│           └── tailcut84
│               ├── dl1.h5
│               └── datacheck_dl1.h5
├── DL2
│   └── YYYYMMDD
│       └── vX.Y.Z
│           └── tailcut84
│               └── dl2.h5
└── DL3
    └── YYYYMMDD
        └── vX.Y.Z
            └── tailcut84
                ├── std_cuts
                │   ├── irf_std_cuts.fits
                │   ├── source_name1
                │   │   ├── dl3_LST-1_Run00001.fits
                │   │   ├── dl3_LST-1_Run00002.fits
                │   │   ├── hdu-index.fits.gz
                │   │   └── obs-index.fits.gz
                │   └── source_name2
                │       ├── dl3_LST-1_Run00003.fits
                │       ├── dl3_LST-1_Run00004.fits
                │       ├── hdu-index.fits.gz
                │       └── obs-index.fits.gz
                └── other_cuts

Right now this script is intended to be run separately, once closer has been launched and files have been moved to final destinations.

TODO in further PRs:

  • Refactoring.
  • DL3 stage should run right after the merging of DL2 files without having to depend on the closer. Sequencer should take care of this analysis step as well through the datasequence.
  • Implement unit tests.
  • Next-day high-level analysis.

@codecov
Copy link

codecov bot commented Jan 17, 2022

Codecov Report

Merging #81 (797e31e) into main (cdc88ca) will increase coverage by 0.00%.
The diff coverage is 85.71%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main      #81   +/-   ##
=======================================
  Coverage   81.39%   81.40%           
=======================================
  Files          41       41           
  Lines        4294     4296    +2     
=======================================
+ Hits         3495     3497    +2     
  Misses        799      799           
Impacted Files Coverage Δ
osa/scripts/sequencer.py 86.80% <ø> (ø)
osa/nightsummary/extract.py 78.19% <66.66%> (ø)
osa/configs/datamodel.py 86.79% <100.00%> (+0.25%) ⬆️
osa/utils/utils.py 67.13% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cdc88ca...797e31e. Read the comment docs.

@morcuended
Copy link
Member Author

morcuended commented Jan 19, 2022

Currently, we may have problems with IERS-A data from astropy (see astropy/astropy#10494). We should download the IERS data separately (keep them up to date on a weekly basis?) and cache that data without trying to download anything when running in the cluster.

Just for reference, I leave here a link on how to proceed with the astropy cache when working in clusters (https://docs.astropy.org/en/stable/utils/data.html#astropy-data-and-clusters)

@morcuended
Copy link
Member Author

morcuended commented Jan 19, 2022

@moralejo @rlopezcoto @chaimain how does this directory structure scheme for post-DL2 analysis steps sound to you?

@morcuended morcuended merged commit d13d192 into main Jan 19, 2022
@morcuended morcuended deleted the dl3 branch January 19, 2022 19:19
This was referenced Jan 19, 2022
@rlopezcoto
Copy link

Sorry @morcuended , we have been very busy this week with the school. The structure looks overall good, thanks for this taking care of proposing it and for this PR. I just have a few questions:

  • std_cuts -> have we already defined any set of std_cuts for processing? It may be a good idea to define a few different ones.
  • In this structure you are introducing for the first time source_name, where are you expecting to get it from?
  • do we really need to include irf_std_cuts.fits per source? or will they be produced by lstmcpipe and you will only link them there (how are you planning to know which one?)
  • I guess all this structure will be source-independent analysis, right?

@chaimain
Copy link

chaimain commented Jan 21, 2022

Sorry @morcuended , we have been very busy this week with the school. The structure looks overall good, thanks for this taking care of proposing it and for this PR. I just have a few questions:

  • std_cuts -> have we already defined any set of std_cuts for processing? It may be a good idea to define a few different ones.

We indeed have to have a discussion on the definition of std_cuts, and probably store this in lstchain first.

  • In this structure you are introducing for the first time source_name, where are you expecting to get it from?

It seems to be by using a database query from DriveControl_SourceName. We should also discuss this, along with our discussion on creating and maintaining a standard observed source catalog of LST-1.

  • do we really need to include irf_std_cuts.fits per source? or will they be produced by lstmcpipe and you will only link them there (how are you planning to know which one?)

For IRFs, we are still awaiting the completion of various tasks - producing 'all-sky' MC list, creating the new RF model, DL2 files, merging the IRF interpolation and using energy-dependent cuts in IRF/DL3 Tools. I think it would make sense, only after these tasks, that we can talk on a 'standard' IRF type/s.

@morcuended
Copy link
Member Author

Hi @rlopezcoto

After talking with @chaimain I realized that there are still some open issues that are not reflected in this very simplistic scheme. The main one is the selection of proper IRFs for each observation. I'll try to answer your questions below:

  • First, all this stuff refers to source-independent analysis. This is the only stream lstosa currently does. We may want to indicate this somewhere in the data tree.
  • No standard cuts defined yet. Just wanted to sketch how it could be in the future. We indeed could go for several sets of cuts (e.g. tight, standard, soft cuts). This is to be discussed and agreed upon within lstchain first as @chaimain says.
  • Source name associated with a given run_id (as well as source coordinates) is fetched from the TCU database in this scheme (see https://github.com/cta-observatory/lstosa/blob/main/osa/nightsummary/database.py). This is a preliminary version and needs to be discussed too. I'd say that querying this database works for runs taken from about Nov 2020 (no information is consistently there for previous runs). Also, source RADec information seems not to reflect the actual target coordinates but the coordinates plus wobbling offset. So we might want to get this drive info from drive logs instead. Although I think all this information should be centralized. I think from now on, TCU will be writing a Run Catalog having all this information. However for previously taken runs we'll have to figure out this in some other way (e.g. using Create script to merge run summaries with drive logs into a single file cta-lstchain#880). This will need discussion among analysis, TCU and drive teams.
  • do we really need to include irf_std_cuts.fits per source? or will they be produced by lstmcpipe and you will only link them there (how are you planning to know which one?)

Here I was foreseen to have an IRF file per set of cuts not per source. This could rather be done by lstmcpipe and then we just would look for them. As @chaimain says there are several open points to be considered before we go further in lstosa.

For the moment we could produce IRF-less DL3 files and store them in the same night/date directory (without sorting them by source). We would not run either the observation indexing script for the moment. Or we could not produce DL3 files at all for the time being until previous issues are worked out.

I just wanted to move this forward so we could have automatic & fast next-day high-level results. But I guess that it only makes sense to produce theta2 & significance results for now from DL2 files or these DL3 without IRFs incorporated. Do you think this makes sense @rlopezcoto?

@rlopezcoto
Copy link

thanks @morcuended and @chaimain, this sounds good for the time being, it would, however, be great if DL3 files could at least be produced for a few sets of cuts (that can be discussed as @chaimain was suggesting).

For the moment we could produce IRF-less DL3 files and store them in the same night/date directory (without sorting them by source). We would not run either the observation indexing script for the moment. Or we could not produce DL3 files at all for the time being until previous issues are worked out.

what is the problem with running the observation indexing script?

@chaimain
Copy link

chaimain commented Jan 24, 2022

what is the problem with running the observation indexing script?

Sorry, I just checked again, and there should be no problem with running the indexing Tool for IRF-less DL3 files.

Also, we will try and merge the PR in lstchain for using energy-dependent cuts, so we can have a better definition on the types of cuts we apply, based on the gamma efficiency for each energy bin we define.

@morcuended
Copy link
Member Author

Sorry, I just checked again, and there should be no problem with running the indexing Tool for IRF-less DL3 files.

Then we will do it like this. No IRFs but we do index the files.

For the time being, I will test the production of DL3 with fixed cuts then we will move to energy-dependent ones.

@rlopezcoto
Copy link

Great, thanks guys!

@chaimain
Copy link

Hi @morcuended, after discussing with @maxnoe regarding IRF-less DL3 production, we should not create DL3 files without IRFs, as it provides no additional information over the DL2 files.
It would be better, to just create DL3 files, with the existing IRFs (Maybe point-like IRFs for now) daily with lstosa.

The lstchain DL3 Tool, will be fixed in #709 to require IRFs, and take the gammaness cut information from the provided IRFs only, be it global cut or energy-dependent cuts. It will be available in the upcoming lstchain release.

So, maybe you should close #94

@morcuended
Copy link
Member Author

Hi @chaimain. Thanks for letting me know. I will close #94. Have you discussed the "standard" creation of DL3 files? We go for point-like, several sets of cuts (fixed or energy-dependent)?

@chaimain
Copy link

No, I have not yet discussed the "standard" cuts and type of IRFs to be produced. Will do it on Slack now, and if need be, open a github issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants