-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thermal Model protection #3552
Thermal Model protection #3552
Conversation
This is already reimplemented in the newer fsensor implementation
Setting pullups on the ADC should trigger the model-based check, making this redundant and wasteful. Keep the DEBUG_PULLUP_CRASH menu so that we can verify this behavior in the future.
Disable the interrupt source instead, which avoids the added latency of reentering the isr in the first place.
The current code assumes that values are directly comparable
Read from ADC as fast as possible using the ADC interrupt to get more accurate instantaneous readings. Decouple the temperature_isr from the adc reading interval, so that the two can run independently for future use.
Isolate the PWM management into soft_pwm_core
Use a new low-priority "temp_mgr_isr" running at constant rate for temperature management. This is done so that the temperatures are sampled at a constant independent interval *and* with reduced jitter. Likewise for actual PID management. This will require further adjustment for the min/max/runaway display, which cannot be done directly into this function anymore (the code will need to disable heaters but flag for display to be handled in manage_heaters).
*_temperature_raw: buffer for the ADC ISR (read by temp ISR) *_temperature_isr: latest temperatures for PID regulation (copied from _raw values) *_temperature: latest temperature for user code The flow: - ADC ISR (async) - perform oversampling - call ADC callback: copy to _raw (async) - temp ISR (timer) - convert to C (_isr values) - user code (async) - check temp_meas_ready - call updateTemperature() - copy from _isr to current - syncronize target temperatures This removes PINDA value averaging (if needed, should be re-implemented by averaging in user code where needed)
Split off setIsrTargetTemperatures and temp_mgr_pid() so that we can propagate the target temperatures instantaneously down the pid/pwm chain during emergencies. This reduces the amount of code in disable_heater() itself, making it a bit more maintenable. The bed still isn't disabled on-the-spot yet, due to the heatbed_pwm automaton. To be improved later.
- Flag the error condition from the temp_mgr_isr - Handle the error state from the user code Currently only handles min/maxtemp and relays the error to the original handler (which is a poor fit for the current design).
As for min/maxtemp, flag the error in the isr, then handle it in the user code calling the original handler.
Avoid running the user-level error handlers too fast.
That didn't work for me: Note: using revo 6 instead of stock hotend. M310 A, M310 S1, M500 were done. |
Can you post the output of M310 ? After this, can do you the following: let the hotend cool down below 40C, then issue the following (saving all the output):
This will generate a lot of output so we can have a look at why it fails. |
Ok, flashed again, used M310 P40 this time (at previous attempt I left default which was 38 I think), did M310 A, M310 S1, M500 then tried to set 210 for nozzle, screen flashed, beeper beeped but... heater was not disabled and reached 210 and kept it there (until I choose cooldown from menu). Then I've tried to get the same result of heater not being disabled by setting 210 few times but wasn't able to reproduce - lcd flash, beep but heater was always disabled in my few tests (maybe it only happened for first time after flashing and training?). Then did what you asked. Unfortunately octoprint disconnects from printer on error so only initial training and then debug training was logged. My multiple "210 tests" were not recorded. btw. what's meaning of these new icons in left bottom corner? |
It's nice to actually test this on the Revo, didn't have a change to do so yet. So far we checked on the classic v6, with the copper heatblock and some other ones, various heatbreaks, with socks, on the dragon, and I do expect this to work almost everywhere with minor tweaks. D70 outputs the raw thermistor/heater values so it's possible to have a look at the entire temperature regulation during regular usage. This is what is shown during calibration after you enabled D70: The autocalibration has 2 stages of heating from 50C to 230C. If you look at the first one, the heater is kept at full power, however it's struggling to reach a steady 230C. However at some point the temperature shoots to 250C, with the PID regulator immediately killing power. This lasts for about 6 seconds, before it oscillates somewhat randomly before the cooling phase starts. The thermistor might be potentially underreporting the temperature at that point, although the only way to be sure there would be to use a second measurement. The sudden bumps of ~40C just before the cooldown are also unrealistic. I doubt the revo can change temperature that fast (in fact, the revo cools down at a normal pace just after), so it couldn't have jumped by 40C in either direction so fast. Did you notice if the temperature on the LCD was oscillating? On the second heating run, everything seems to behave correctly. Any chance the connector is loose or there are cabling issues? The autocalibration is designed to be robust, so it should still produce consistent calibration values in this case. However a single jump of 40C during print will definitely be sufficient to trigger an error condition. If you can re-run the same procedure 3 times (same g-code sequence for the full log) we can see if this happens consistently. |
Could you share a script that parses log and produces these charts? (Generally revo is not behaving as typical heatblock - https://www.youtube.com/watch?v=DdZlBiFajWE&t=771s but I guess that doesn't matter here) I'll do more testing in few hours. |
I'm polishing a set of scripts to debug thermal issues to be placed in tools/ so that these can be checked automatically, but it's going to take a few days. I'm aware of the revo heater configuration, however this shouldn't generate this sort of issues. The temperature gradient in the heatblock is mostly irrelevant for this. What could matter is the PTC element, but at higher temperatures (we can actually test this later by running the autocalibration at higher temps). The effect we see here, inconsistent readings while heating seem more in line with bad temperature readings. |
Just saying, the temperature oscillations during heatup look oddly similar to this: https://youtu.be/DdZlBiFajWE?t=1157 |
Repeated three times. Second and third also on video (youtube still processing) Second: Third: After that M310 S1, nozzle to 210 and unfortunately beep and failure. Thermistor board connector on revo doesn't have a latch like the one on stock prusa but it wasn't loose when I mounted it. I'll try to replace connector with the same as used on prusa and see if that makes a difference. Wasn't looking at previous time at LCD, so can't tell. This time there are videos. |
I'll try to make a quick dumber parser for the log so you can play with this locally while I finish the proper tool. Meanwhile, here's the output: Run 1 and 2 look perfect. The third run has a weird spike. Zoomed in: The thermistor unexpectedly dips in temperature this time, although with a "smaller" 15C jump. |
Aside from the temperature problem, the PTC heating element does change things up significantly enough for the simulation as it is. A regular heater exhibits almost-perfect linear heating. You can see on the Revo that the heat-up step is starting to look more bell-shaped just by eye, and this matters enough to generate a false trigger at around 200C so the PTC needs to be taken into account here. This is neatly documented here: https://e3d-online.zendesk.com/hc/en-us/articles/5926520938013-Revo-Six-Datasheet but I suspect the actual coefficient will have tolerances that would further need to be calibrated. |
@arekm could you do the following starting from a cooled down heater (<40C) and record the entire session:
Untested. The idea is to have a full recording of the heating profile from 40C to 285C. We wait at 285C for 60 seconds, then cooldown. This is all done without the print fan. Ideally we would set the highest temperature we plan on ever using. Keep an eye on the printer if you never used 285C before. Although the revo is self-limiting, temperature spikes in the read-out can still ramp up the temperature above 285C and smoke things up. I will be able to use this to check if the PTC is in line with the datasheet. |
That Revo Six heater core/thermistor is dying easily it seems. Now (thermistor) is crazy: https://www.youtube.com/watch?v=os93d-cdMQg |
Judging by the previous logs, this might had some defects already. |
More about my revo heaters. When that one died I ordered another heater. My mistake was to order from the same local company and I got new heater with very similar serial number to old one. That new one also died after month of usage (even without playing with thermal model protection or ~300 degree temperatures). By died I mean temperature wasn't stable at all and fluctuating. Sometimes old thermal protection was kicking in because reported temperature was suddenly like 290 degree. (image shows temperatures for short print on that second revo) Now I got replacements from E3D directly. These new heaters have much later numbers and have blue thermistor wires (previous ones had white thermistor wire), so are clearly a very different batch. Here is a log for a new heater
Obviously later M310 S1 and hotend to 200 degree -> error. |
Thanks a lot. Nice to know. |
(done) If you can generate graphs for it I would like to see these. Thanks. |
This PR introduces a new safety feature in the firmware called “thermal model protection”.
This new safety feature is similar in principle to “thermal runaway”: the aim is to catch unexpected heating issues of any sort (cabling issues, under/overperforming heater, thermistor faults, and others) and stop heating to avoid potential damage. The key difference with thermal runaway is that this feature is designed to respond much quicker. While a “thermal runaway error” can take a minute or more to trigger, the thermal model can report and stop heating in order of mere seconds, with a properly tuned model being able to respond within 10 seconds.
Thermal model protection as implemented in this PR works alongside all existing thermal safety features: mintemp, maxtemp and thermal runaway still take priority over the thermal model and still stop the printer. Thermal model protection will currently first warn, and then pause the print with heaters off. It can be turned on/off/tuned through the new
M310
service g-code. It is not automatically enabled, and requires calibration before it can be used (read further to see how this can be done - hopefully everything is completely automatic).Due to how the model requires to run internally, and due to efficiency reasons, this PR includes significant changes to the entire temperature management loop. As such, it’s a substantial change even if the model is kept disabled. Please report any issues! Also for efficiency reasons the thermal model protection is only applied to the hotend; regular “thermal runaway” works well enough for the bed already.
How to start using the thermal model protection
As the name suggests, “thermal model protection” is based around a simulation (aka “model”) of the hotend of the printer. We keep an internal simulation of how the hotend should behave and continuously compare the simulation with the real hotend. When simulation and reality disagree, an error condition is triggered. The thermal model parameters (Power of the heater, Resistance and Capacitance of the heatblock) can be set manually using the new
M310
instruction, however this PR also includes a temperature model autocalibration function which only requires an initial power estimate and can set these values for you with no assistance!In short, if you have MK3/S, you can just:
M310 A
to start the autocalibration (requires about 15 minutes). You should watch the printer during calibration.M310 S1
M500
.At this point the thermal protection is continuously running, and you can print as usual!
You should check if it’s working as intended though. There are two quick ways to do this.
Method 1: Use a wet q-tip (works on all hotends):
VID_20220723_135758.mp4
Method 2: Blow hard on the heatblock (might be tricky with a silicone sock):
What to expect when the model is enabled
When something goes wrong the printer can start to beep. On the LCD the message “THERMAL ANOMALY” will be shown. This can happen, for example, if there’s a sudden cold strong draft on the printer. If the draft stops, the beeping will resolve itself. The “THERMAL ANOMALY” message will disappear after 30 seconds.
If the problem is more severe, the printer will warn with a strong, continuous beep and pause the print. All heaters are disabled in this case (both bed and nozzle). “PAUSED THERMAL ERROR” is shown.
The thermal model itself cannot know what is going wrong! You shouldn’t ignore this!
It could mean the heater is not working or that the thermistor is not working: check the temperatures on the LCD! A cable could be partially damaged or a connector is making an unreliable contact. Try to wiggle the extruder cable bundle and see if the temperatures show any artifacts on the LCD. It can also trigger if the fan is not spinning as it used to! After checking the printer, you can attempt to resume the print, but you should watch the printer while doing so and look for potential issues as it heats up again!
If the printer appears to be fine and it is printing fine, but the model is still triggering for no reason, you might need to recalibrate the thermal model. In general the thermal model will need recalibration for any of the following reasons:
You might also need recalibration if the fan has aged and it’s not spinning as it used to (although it’s probably wiser to change the fan at that point).
Additional notes and gotchas of the thermal model
Aside from the above, it’s also important to know that the thermal model needs to calculate the power losses from the heatblock to the environment. To do so, we’re using the temperature sensor currently located on the Einsy board as a proxy. This means that if you place the electronics of the MK3 outside of an enclosure (so that the Einsy cannot see the chamber temperature) the thermal model might not work as expected. You can adjust the temperature difference between einsy/chamber temperature using the
M310 T
g-code: a difference of <10C compared to the real temperature shouldn’t cause issues. However for larger differences, you might need to have some sort of additional chamber temperature sensor and modify the sources to use that sensor if you still want to enable the thermal model protection.There’s no way for the autocalibration to know the effective power of the heater since it’s the result of PSU+heater cartridge tolerances. The value of 38W is a good average of the expected performance on the MK3. Because of this, the calculated C/R values are dependent on the current printer and are only loosely comparable between printers. It’s important to note that it doesn’t matter for the thermal protection itself: the power only acts as a linear scale for the model constants and autocalibration will automatically scale these values to match your hotend. These values should only be compared exactly If you can measure (and set) the effective power of your heater using external hardware.
Model configuration / custom hotends
The
M310
instruction is quite complex due to the number of parameters that can be set. For reference:At this moment additional details are in the source. There are three main ways to use
M310
:M310
without parameters (also shown byM501
).M310 A
M310 S[0 or 1]
If you know the power of your heater (for example if you’re using a more powerful heater) you must set the correct value with
M310 P<power in watts>
and only then recalibrate the model. The C/R values will be automatically computed for you.You might want to change the value of the warning and error thresholds using “M310 W” and “M310 E”. This is needed, for example, when using a silicone sock. The most intuitive value to set the threshold without using external hardware is to use a wet q-tip. After the initial calibration, enable the model with the default values and do the following:
M310 W0.01
. This will make it so the printer will start beeping and print the current threshold value on the serial output.M310 W<value>
M310 E<value*1.7>
You might need some fine adjustment of the warning value in 0.05 increments if the warning beeps seem to trigger during regular printing. You shouldn’t need to adjust the error threshold, and in general do not set the error value to something which is greater than 2x the value obtained by the wet q-tip method.
Silicone socks on the hotend
A sock not only reduces temperature losses to the environment, but also shields the heatblock from unexpected drafts. This is why, when using one, we also recommend setting the thresholds to be more sensitive. This provides increased protection at higher temperatures: one of the main reasons to use a sock in the first place.
On average you can use a threshold which is close to half of the value normally used on a hotend without the sock. On a MK3S+sock, the following values are a good match:
M310 W0.6 E1.0
Remember that when adding or removing a sock a model recalibration is needed.
How does it work internally
To keep the simulation numerically stable, as well as fast enough to run in realtime on the AVR, the simulation only keeps track of thermal differences across the hotend, effectively only measuring power gain/losses over very small intervals using a simple dampened R/C circuit. The speed at which the temperature is diverging from the model becomes the threshold value to trigger the problem condition.
We can see how this is working internally by the following example: In the next chart a faulty thermistor is reporting wrong values (blue line on the bottom chart) while we try to heat the hotend to 270C. The real temperature is instead reported on the red line. The calculated hotend temperature differentials are shown in blue at the top, with the red line (our threshold condition) showing the difference between the simulated model and the hotend. The black vertical lines show what temperatures the model is expecting to “see” at various times during the simulation so we can compare them to what the faulty thermistor is reporting.
There are many more little details which are required to make the model robust. To reduce the sampling and cumulative error, the entire ADC loop has been changed to sample as fast as the ADC allows and run the entire thermal regulation loop at a fixed rate of ~3.7Hz (which matches the existing MK3S regulation interval). To account for the temperature transport delay from the heater cartridge to the heater there’s an internal lag buffer of ~2s. The Resistance component is not a single value as one would expect, but a vector of 16 elements where each value is sampled during calibration. This complication was required to support custom fans and shrouds that can result in heavily non-linear loss behavior.
Power, Resistance, Capacitance and threshold values are the only values which should be required to be changed even for heavily modded printers, with C/R being automatically set for you by running the autocalibration. The temperature transport delay, regulation interval and filtering constants are designed to improve the resolution of the model (so that response time is lower) and shouldn’t require customization unless the hotend design is truly completely different in design from what a V6 hotend design looks like.
What is missing and RFC
This PR currently disables IR_SENSOR_ANALOG which removes checks for a faulty IR filament sensor and board version detection on the MK3S. The ADC behavior was fundamentally incompatible with the new code and will be rewritten.
The thermal model checking, when enabled, makes it impossible to perform a nozzle change. You currently need to disable model checking with
M310 S0
, change the nozzle the usual way, then switch it back on again withM310 S1
. This is the perfect excuse to address #2972 with a decent workflow.The thermal error pause disables both extruder and bed, but it’s debatable whether disabling the bed does actually improve safety. A resulting potential problem is that an incorrect, unattended thermal error pause will result in a failed print due to detachment (there’s no question that for a real fault disabling both is the correct thing to do!). Comments would be highly appreciated!