-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash in bParking processing #26154
Comments
A new Issue was created by @fabiocos Fabio Cossutti. @davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
@prebello could you please point us to a failing workflow/initial settings? |
sure for instance details for each step can be found at does it help? any further detail? |
previously reported errors were due to special characters in the file (it seems) The file read here is is that correct? |
@davidlange6 yes, the file is the right one.
|
On Mar 12, 2019, at 5:20 PM, Mauro Verzetti ***@***.***> wrote:
@davidlange6 yes, the file is the right one.
previously reported errors were due to special characters in the file (it seems)
• Out of curiosity, how can you tell?
• Why it works locally but not remotely?
• How do we clean the xml from these characters?
sorry I mean previous times when this same error was reported..
I don't actually see special characters in this file (yet), so past experience may have nothing to do with the current problem.
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Note that we had to execute the following command:
xmllint --format BDT.xml > BDT_formatted.xml
before we were able to successfully read the files with the TXMLEngine.
Also we used gzip, as you can see from the file name suffixes.
So perhaps there are some issues / inconsistencies related to these two executables?
Rob
… On 12 Mar 2019, at 17:05, David Lange ***@***.***> wrote:
> On Mar 12, 2019, at 5:20 PM, Mauro Verzetti ***@***.***> wrote:
>
> @davidlange6 yes, the file is the right one.
>
> previously reported errors were due to special characters in the file (it seems)
>
> • Out of curiosity, how can you tell?
> • Why it works locally but not remotely?
> • How do we clean the xml from these characters?
sorry I mean previous times when this same error was reported..
I don't actually see special characters in this file (yet), so past experience may have nothing to do with the current problem.
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or mute the thread.
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
probably good to see if there is better success using an unzipped file. No reason why there should be but..
otherwise, I'm lost about which releases have or have not worked in the various crab tests. Can someone be specific (with three digit release names).
… On Mar 12, 2019, at 6:28 PM, bainbrid ***@***.***> wrote:
Note that we had to execute the following command:
xmllint --format BDT.xml > BDT_formatted.xml
before we were able to successfully read the files with the TXMLEngine.
Also we used gzip, as you can see from the file name suffixes.
So perhaps there are some issues / inconsistencies related to these two executables?
Rob
> On 12 Mar 2019, at 17:05, David Lange ***@***.***> wrote:
>
>
>
> > On Mar 12, 2019, at 5:20 PM, Mauro Verzetti ***@***.***> wrote:
> >
> > @davidlange6 yes, the file is the right one.
> >
> > previously reported errors were due to special characters in the file (it seems)
> >
> > • Out of curiosity, how can you tell?
> > • Why it works locally but not remotely?
> > • How do we clean the xml from these characters?
>
> sorry I mean previous times when this same error was reported..
>
> I don't actually see special characters in this file (yet), so past experience may have nothing to do with the current problem.
>
>
> > —
> > You are receiving this because you were mentioned.
> > Reply to this email directly, view it on GitHub, or mute the thread.
> >
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or mute the thread.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@guitargeek thanks for the explanation! Indeed it makes sense as the weight files are quite large. |
That would of course be quite a big backport, other ideas would be really convenient! In case we want to keep the backport option open however, there is also a cmsdist backport which would have to be made for tinyxml2, analogue to cms-sw/cmsdist#4312. To make this list of references related to #24432 complete, I should also mention the external PR that it depended on, translating some old txt weight files to the XML format: cms-sw/cmsdist#4332 |
Given the review comments posted for #24432 (see #24432 (comment)) I see no counterindication in principle in backporting that PR. It must be noted however, as also pointed out by @guitargeek right above. that it would be quite a big backport, it would act on parts of the code which have been touched and modified independently in the meanwhile, and therefore both the backport into a production release and its review should be done and validated carefully. As such, I may expect some non negligible further delay in the start of the bParking reco production. Moreover (just to keep here some record of it) one has to backport and integrate also the later fix #25085. If I read correctly the issue #24395, there was a suggestion there of acting on the |
I personally never explored that option. But I think it must be somehow more complicated than just changing the limit, because there is also this other strange effect of it working locally but not on CRAB. What I can do is to explore how easy a backport of #24432 + #25085 would be (i.e. how many conflicts would arise). |
My feeling is that the back port, while a non-negligible effort, is presumably the most sensible way to proceed. So if @guitargeek <https://github.com/guitargeek> can create the PR, then once all the code-checks, etc are done, myself and @mverzett can use the PR branch to test locally and on CRAB to ensure it does the job. Presumably if the problem with CRAB goes away, then one can only hope the newly discovered issues with wmagent also go away ...?
… On 13 Mar 2019, at 09:35, Jonas Rembser ***@***.***> wrote:
I personally never explored that option. But I think it must be somehow more complicated than just changing the limit, because there is also this other strange effect of it working locally but not on CRAB.
What I can do is to explore how easy a backport of #24432 <#24432> + #25085 <#25085> would be (i.e. how many conflicts would arise).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#26154 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABEfkgKXA_pWNZgwQlMhjwHtAZdcitM1ks5vWMZhgaJpZM4bq-WM>.
|
This is the backport for cmsdist, cmssw will follow. |
@bainbrid at the same time we could test the increase of |
Yes, sounds good - thanks!
… On 13 Mar 2019, at 10:58, Mauro Verzetti ***@***.***> wrote:
@bainbrid <https://github.com/bainbrid> at the same time we could test the increase of xmlenginebuffersize with the same assumption that if it works on crab works for pdmV as well.
@guitargeek <https://github.com/guitargeek> is it an environmental variable? a global variable?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#26154 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABEfklXBVkZ6WxRDLuN68edGOI--pvqXks5vWNnygaJpZM4bq-WM>.
|
Yes, that's good we try things in parallel. I have no idea how to change the |
Apparently the buffer size was hardcoded in the TMVA code until this august. I guess that the backport is the only solution |
Still, is a mystery how locally it works though |
I'm theoretically done with the backport, but I have real trouble compiling and testing today! Both on lxplus and the machines of by lab I always get Edit: Never mind, I realized my mistake. SL6 vs. SL7 problem, because 10_2_X defaulted to SL6. |
Probably naive question: Could the problem lie here? Given we have two BDT models being instantiated sequentially in the same module and the comment on security of the line above the quoted ones? |
What do you mean by "have two BDT models being instantiated sequentially in the same module"? There are modules in production for quite some time now which instantiate dozens of GBRForests, read from gzipped files right now. What does the low pt reconstruction do different with respect to what already exists? But honestly I don't quite understand the comment behind your link anyway :) |
There is a potential problem here
@bbockelm could that be causing the problems we see? |
Nope, that’s not the cause.
|
ok, @prebello , all, just to confirm, I tested line by line this #26154 (comment) and it works fine for me. @guitargeek , to note, FYI:
|
Thank you Rob, I edited my PR description accordingly. Seems I accidentally pushed the change with the |
I tested the recipe on crab, one file, 400 events. It works! @prebello can you confirm the same? |
Great!!
Yes @prebello could you please rest too with WMagent?
If this also works, I’d suggest we are good to proceed with the integration of this PR ...
… On 14 Mar 2019, at 13:05, Mauro Verzetti ***@***.***> wrote:
I tested the recipe on crab, one file, 400 events. It works! @prebello can you confirm the same?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@bainbrid sorry. busy with lectures marathon since yesterday. I will take care of it today only after 5pm CERN. |
@bainbrid I have injected two relvals in the system to test. |
On 3/15/19 7:15 PM, Patricia Rebello Teles wrote:
@bainbrid
again the error in 10 events local test (lxplus scl6)
%MSG-e TkDetLayers:
***@***.***
16-Mar-2019 03:01:26 CET Run: 1 Stream: 7
ForwardDiskSectorBuilderFromDet: Trying to build Petal Wedge from Dets
at different z positions !! Delta_z = -0.950417
%MSG
If I'm not mistaken, the above error "Trying to build ..." is showing up
in all data processing since 2017 (or was it 2016).
This is not related to bParking
… I have injected two relvals in the system to test.
—
|
I confirm that you can safely ignore this message |
Yes, we’ve seen this since the beginning and we ignore.
… On 16 Mar 2019, at 09:06, boudoul ***@***.***> wrote:
I confirm that you can safely ignore this message
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Colleagues, unfortunately |
how was this test performed?
… On Mar 17, 2019, at 1:39 AM, Patricia Rebello Teles ***@***.***> wrote:
Colleagues, unfortunately
---- Begin Fatal Exception 16-Mar-2019 17:15:57 UTC-----------------------
An exception of category 'FatalRootError' occurred while
[0] Constructing the EventProcessor
[1] Constructing module: class=LowPtGsfElectronIDProducer label='lowPtGsfElectronID'
Additional Info:
[a] Fatal Root Error: @sub=TXMLEngine::ParseFile
Unexpected end of xml file
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@prebello did you manage to successfully build the patch detailed here #26154 (comment) ? And you still see the Exception with this recipe above? Is anyone able to submit jobs using WMagent? Are there instructions we can follow? We’ve had the recipe available for 5-6 days, so I’m keen to draw a firm conclusion quickly. |
Hi @bainbrid, Slava gave me some feedback on the PR during the weekend, and I took the opportunity to also fix the line with |
its unlikely that any wmagent test is possible with a custom recipe (eg, outside of a release) - thats why I asked how this test was done..
… On Mar 18, 2019, at 4:32 AM, Mauro Verzetti ***@***.***> wrote:
@prebello I am surprised that the error still appears, as the with the PR the XML parsing for the low pt electrons in overhauled and TXMLEngine is replaced. I am wondering as @bainbrid if you included the recipe he suggested.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Ah ok, I'm not aware of the technical details / limitations involved in such a test. Perhaps this explains why @prebello still sees the error. @fabiocos, all: given Mauro's test with CRAB appears to mitigate the issue, which is very encouraging, is it at all possible to proceed with reviewing/merging #26176, create a new CMSSW_10_2 release as soon as possible, and try again? We are on an very tight timeline for the bParking processing and I am worried we will miss our slot. |
exaclty.. I did the test in patch1 release only changing the era accordingly. Indeed if the PR with dedicated changes was not merged, it will not work. I have understood that patch1 was the fixed release. |
@bainbrid one needs the fixes in a dedicated release to make the test properly. |
@guitargeek I have done it (changed from <tinyxml2.h> to "tinyxml2.h") to fix the compilation error. Indeed it is not the reason of the issue. |
Sorry - I'm still confused:
If yes to both, then the question still remains why you see the exception from the TXMLEngine constuctor when the patch no longer uses this code ... |
Then given the successful CRAB test, perhaps we can proceed straight to a new release (i.e. merge and include this PR) and test again?
… @bainbrid <https://github.com/bainbrid> one needs the fixes in a dedicated release to make the test properly.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#26154 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABEfkpWYBN5m0qWVSjxuZw5FE3LkO_L5ks5vX4gugaJpZM4bq-WM>.
|
I am not saying that CMSSW_10_2_12_patch1 contains the fix. I have said that "I have guessed that CMSSW_10_2_12_patch1 had the fix" . Yes I have used it. |
@bainbrid do you think we will have problems in rereco 10-2-12? maybe, as relvals are failing too? the samples are still in assignment-approved, although we have requested the block transfer, and I would like to confirm with you to avoid contacting computing for potential problematic runs. FYI @davidlange6 |
CMSSW_10_2_12_patch1 has nothing to do with this issue, see the release notes https://cmssdt.cern.ch/SDT/ReleaseNotes/CMSSW_10/CMSSW_10_2_12_patch1.html Yesterday I have privately tested the recipe locally, and it works (i.e. the workflow 136.898 arrives at the end without apparent problems). Having a new release with the various additions for B parking inside looks the best way to move forward, provided the backport is approved |
We need a new CMSSW release if we are to use WMagent successfully.
… On 18 Mar 2019, at 13:14, Patricia Rebello Teles ***@***.***> wrote:
@bainbrid <https://github.com/bainbrid> do you think we will have problems in rereco 10-2-12? maybe, as relvals are failing too?
(both are injected using wmagent framework)
the samples are still in assignment-approved, although we have requested the block transfer, and I would like to confirm with you to avoid contacting computing for potential problematic runs.
https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?campaign=Commissioning2018 <https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?campaign=Commissioning2018>
FYI @davidlange6 <https://github.com/davidlange6>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#26154 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABEfkhwZr2FsNhIc61X-kppYHjZD_7oPks5vX5E4gaJpZM4bq-WM>.
|
PdmV gladly informs that relvals were announced with 100% success. @fabiocos |
A problem with bParking code has been reported by PdmV. This issue has been initially observed during CRAB-based private tests, and were reproducible when running in CRAB but not when running locally. The problem is:
initially reported in https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/4440.html and https://indico.cern.ch/event/790963/contributions/3306140 .
It looks present in 10_2_X, not in cycles after 10_3_X, starting with merges on top of 10_2_9.
The text was updated successfully, but these errors were encountered: