Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thermal Model protection #3552

Merged
merged 95 commits into from
Aug 24, 2022
Merged

Thermal Model protection #3552

merged 95 commits into from
Aug 24, 2022

Conversation

wavexx
Copy link
Collaborator

@wavexx wavexx commented Aug 1, 2022

This PR introduces a new safety feature in the firmware called “thermal model protection”.

This new safety feature is similar in principle to “thermal runaway”: the aim is to catch unexpected heating issues of any sort (cabling issues, under/overperforming heater, thermistor faults, and others) and stop heating to avoid potential damage. The key difference with thermal runaway is that this feature is designed to respond much quicker. While a “thermal runaway error” can take a minute or more to trigger, the thermal model can report and stop heating in order of mere seconds, with a properly tuned model being able to respond within 10 seconds.

Thermal model protection as implemented in this PR works alongside all existing thermal safety features: mintemp, maxtemp and thermal runaway still take priority over the thermal model and still stop the printer. Thermal model protection will currently first warn, and then pause the print with heaters off. It can be turned on/off/tuned through the new M310 service g-code. It is not automatically enabled, and requires calibration before it can be used (read further to see how this can be done - hopefully everything is completely automatic).

Due to how the model requires to run internally, and due to efficiency reasons, this PR includes significant changes to the entire temperature management loop. As such, it’s a substantial change even if the model is kept disabled. Please report any issues! Also for efficiency reasons the thermal model protection is only applied to the hotend; regular “thermal runaway” works well enough for the bed already.

How to start using the thermal model protection

As the name suggests, “thermal model protection” is based around a simulation (aka “model”) of the hotend of the printer. We keep an internal simulation of how the hotend should behave and continuously compare the simulation with the real hotend. When simulation and reality disagree, an error condition is triggered. The thermal model parameters (Power of the heater, Resistance and Capacitance of the heatblock) can be set manually using the new M310 instruction, however this PR also includes a temperature model autocalibration function which only requires an initial power estimate and can set these values for you with no assistance!

In short, if you have MK3/S, you can just:

  1. Build/flash this PR
  2. Issue M310 A to start the autocalibration (requires about 15 minutes). You should watch the printer during calibration.
  3. Enable the model checking with M310 S1
  4. Save your calibration settings for the future using M500.

At this point the thermal protection is continuously running, and you can print as usual!
You should check if it’s working as intended though. There are two quick ways to do this.

Method 1: Use a wet q-tip (works on all hotends):

  • Set nozzle temperature to 210C
  • Wait until temperature is stable
  • Put a wet q-tip directly on the nozzle
  • You should hear a beep and an LCD message within 10 seconds
VID_20220723_135758.mp4

Method 2: Blow hard on the heatblock (might be tricky with a silicone sock):

  • Set nozzle temperature to 210C
  • Wait until temperature is stable
  • Blow hard on the heatblock with all your strength for about 10-20 seconds
  • You should hear a beep and an LCD message as above

What to expect when the model is enabled

When something goes wrong the printer can start to beep. On the LCD the message “THERMAL ANOMALY” will be shown. This can happen, for example, if there’s a sudden cold strong draft on the printer. If the draft stops, the beeping will resolve itself. The “THERMAL ANOMALY” message will disappear after 30 seconds.

If the problem is more severe, the printer will warn with a strong, continuous beep and pause the print. All heaters are disabled in this case (both bed and nozzle). “PAUSED THERMAL ERROR” is shown.

The thermal model itself cannot know what is going wrong! You shouldn’t ignore this!

It could mean the heater is not working or that the thermistor is not working: check the temperatures on the LCD! A cable could be partially damaged or a connector is making an unreliable contact. Try to wiggle the extruder cable bundle and see if the temperatures show any artifacts on the LCD. It can also trigger if the fan is not spinning as it used to! After checking the printer, you can attempt to resume the print, but you should watch the printer while doing so and look for potential issues as it heats up again!

If the printer appears to be fine and it is printing fine, but the model is still triggering for no reason, you might need to recalibrate the thermal model. In general the thermal model will need recalibration for any of the following reasons:

  • You changed the hotend in any way (heatblock, thermistor or heater cartridge).
  • You added or removed a silicone sock.
  • You changed the print fan or fan shroud.
  • You changed the whole extruder with a different assembly.

You might also need recalibration if the fan has aged and it’s not spinning as it used to (although it’s probably wiser to change the fan at that point).

Additional notes and gotchas of the thermal model

Aside from the above, it’s also important to know that the thermal model needs to calculate the power losses from the heatblock to the environment. To do so, we’re using the temperature sensor currently located on the Einsy board as a proxy. This means that if you place the electronics of the MK3 outside of an enclosure (so that the Einsy cannot see the chamber temperature) the thermal model might not work as expected. You can adjust the temperature difference between einsy/chamber temperature using the M310 T g-code: a difference of <10C compared to the real temperature shouldn’t cause issues. However for larger differences, you might need to have some sort of additional chamber temperature sensor and modify the sources to use that sensor if you still want to enable the thermal model protection.

There’s no way for the autocalibration to know the effective power of the heater since it’s the result of PSU+heater cartridge tolerances. The value of 38W is a good average of the expected performance on the MK3. Because of this, the calculated C/R values are dependent on the current printer and are only loosely comparable between printers. It’s important to note that it doesn’t matter for the thermal protection itself: the power only acts as a linear scale for the model constants and autocalibration will automatically scale these values to match your hotend. These values should only be compared exactly If you can measure (and set) the effective power of your heater using external hardware.

Model configuration / custom hotends

The M310 instruction is quite complex due to the number of parameters that can be set. For reference:

   /*!
    ### M310 - Temperature model settings
    #### Usage

        M310                                           ; report values
        M310 [ I ] [ R ]                               ; set resistance at index
        M310 [ P ] [ C ] [ S ] [ B ] [ E ] [ W ] [ T ] ; other parameters
        M310 [ A ]                                     ; autotune

    #### Parameters

    - `I` - resistance index position (0-15)
    - `R` - resistance value at index (K/W; requires `I`)
    - `P` - power (W)
    - `C` - capacitance (J/K)
    - `S` - set 0=disable 1=enable
    - `B` - beep and warn when reaching warning threshold 0=disable 1=enable (default: 1)
    - `E` - error threshold (K/s; default in variant)
    - `W` - warning threshold (K/s; default in variant)
    - `T` - ambient temperature correction (K; default in variant)
    - `A` - autotune C+R values
    */

At this moment additional details are in the source. There are three main ways to use M310:

  • Check current values: use M310 without parameters (also shown by M501).
  • Run model autocalibration: M310 A
  • Enable/disable the model checking: M310 S[0 or 1]

If you know the power of your heater (for example if you’re using a more powerful heater) you must set the correct value with M310 P<power in watts> and only then recalibrate the model. The C/R values will be automatically computed for you.

You might want to change the value of the warning and error thresholds using “M310 W” and “M310 E”. This is needed, for example, when using a silicone sock. The most intuitive value to set the threshold without using external hardware is to use a wet q-tip. After the initial calibration, enable the model with the default values and do the following:

  1. Prepare some water and a q-tip
  2. Set the warning value to a very low value using M310 W0.01. This will make it so the printer will start beeping and print the current threshold value on the serial output.
  3. Wet the q-tip and immediately place it on the nozzle.
  4. Wait 10 seconds (5 if you’re using a sock), watching as the threshold value increases on the serial output
  5. Note down the threshold as shown on the printer on the 10 seconds mark (5 with a sock or other strong insulator).
  6. Set the warning value with M310 W<value>
  7. Set the error to value*1.7 of the warning value: M310 E<value*1.7>

You might need some fine adjustment of the warning value in 0.05 increments if the warning beeps seem to trigger during regular printing. You shouldn’t need to adjust the error threshold, and in general do not set the error value to something which is greater than 2x the value obtained by the wet q-tip method.

Silicone socks on the hotend

A sock not only reduces temperature losses to the environment, but also shields the heatblock from unexpected drafts. This is why, when using one, we also recommend setting the thresholds to be more sensitive. This provides increased protection at higher temperatures: one of the main reasons to use a sock in the first place.

On average you can use a threshold which is close to half of the value normally used on a hotend without the sock. On a MK3S+sock, the following values are a good match:

    M310 W0.6 E1.0

Remember that when adding or removing a sock a model recalibration is needed.

How does it work internally

To keep the simulation numerically stable, as well as fast enough to run in realtime on the AVR, the simulation only keeps track of thermal differences across the hotend, effectively only measuring power gain/losses over very small intervals using a simple dampened R/C circuit. The speed at which the temperature is diverging from the model becomes the threshold value to trigger the problem condition.

We can see how this is working internally by the following example: In the next chart a faulty thermistor is reporting wrong values (blue line on the bottom chart) while we try to heat the hotend to 270C. The real temperature is instead reported on the red line. The calculated hotend temperature differentials are shown in blue at the top, with the red line (our threshold condition) showing the difference between the simulated model and the hotend. The black vertical lines show what temperatures the model is expecting to “see” at various times during the simulation so we can compare them to what the faulty thermistor is reporting.

thermal-model-sim

There are many more little details which are required to make the model robust. To reduce the sampling and cumulative error, the entire ADC loop has been changed to sample as fast as the ADC allows and run the entire thermal regulation loop at a fixed rate of ~3.7Hz (which matches the existing MK3S regulation interval). To account for the temperature transport delay from the heater cartridge to the heater there’s an internal lag buffer of ~2s. The Resistance component is not a single value as one would expect, but a vector of 16 elements where each value is sampled during calibration. This complication was required to support custom fans and shrouds that can result in heavily non-linear loss behavior.

Power, Resistance, Capacitance and threshold values are the only values which should be required to be changed even for heavily modded printers, with C/R being automatically set for you by running the autocalibration. The temperature transport delay, regulation interval and filtering constants are designed to improve the resolution of the model (so that response time is lower) and shouldn’t require customization unless the hotend design is truly completely different in design from what a V6 hotend design looks like.

What is missing and RFC

This PR currently disables IR_SENSOR_ANALOG which removes checks for a faulty IR filament sensor and board version detection on the MK3S. The ADC behavior was fundamentally incompatible with the new code and will be rewritten.

The thermal model checking, when enabled, makes it impossible to perform a nozzle change. You currently need to disable model checking with M310 S0, change the nozzle the usual way, then switch it back on again with M310 S1. This is the perfect excuse to address #2972 with a decent workflow.

The thermal error pause disables both extruder and bed, but it’s debatable whether disabling the bed does actually improve safety. A resulting potential problem is that an incorrect, unattended thermal error pause will result in a failed print due to detachment (there’s no question that for a real fault disabling both is the correct thing to do!). Comments would be highly appreciated!

wavexx added 30 commits July 18, 2022 17:53
This is already reimplemented in the newer fsensor implementation
Setting pullups on the ADC should trigger the model-based check, making
this redundant and wasteful.

Keep the DEBUG_PULLUP_CRASH menu so that we can verify this behavior in
the future.
Disable the interrupt source instead, which avoids the added latency of
reentering the isr in the first place.
The current code assumes that values are directly comparable
Read from ADC as fast as possible using the ADC interrupt to get
more accurate instantaneous readings.

Decouple the temperature_isr from the adc reading interval, so that
the two can run independently for future use.
Isolate the PWM management into soft_pwm_core
Use a new low-priority "temp_mgr_isr" running at constant rate for
temperature management.

This is done so that the temperatures are sampled at a constant
independent interval *and* with reduced jitter. Likewise for actual
PID management.

This will require further adjustment for the min/max/runaway display,
which cannot be done directly into this function anymore (the code will
need to disable heaters but flag for display to be handled in
manage_heaters).
*_temperature_raw: buffer for the ADC ISR (read by temp ISR)
*_temperature_isr: latest temperatures for PID regulation (copied from
  _raw values)
*_temperature: latest temperature for user code

The flow:
  - ADC ISR (async)
    - perform oversampling
    - call ADC callback: copy to _raw (async)
  - temp ISR (timer)
    - convert to C (_isr values)
  - user code (async)
      - check temp_meas_ready
      - call updateTemperature()
        - copy from _isr to current
        - syncronize target temperatures

This removes PINDA value averaging (if needed, should be re-implemented
by averaging in user code where needed)
Split off setIsrTargetTemperatures and temp_mgr_pid() so that we can
propagate the target temperatures instantaneously down the pid/pwm chain
during emergencies.

This reduces the amount of code in disable_heater() itself, making it
a bit more maintenable.

The bed still isn't disabled on-the-spot yet, due to the heatbed_pwm
automaton. To be improved later.
- Flag the error condition from the temp_mgr_isr
- Handle the error state from the user code

Currently only handles min/maxtemp and relays the error to the original
handler (which is a poor fit for the current design).
As for min/maxtemp, flag the error in the isr, then handle it in the
user code calling the original handler.
Avoid running the user-level error handlers too fast.
@arekm
Copy link
Contributor

arekm commented Aug 26, 2022

That didn't work for me:
https://www.youtube.com/watch?v=zvcG6T8V45w (HD still processing)

Note: using revo 6 instead of stock hotend.

M310 A, M310 S1, M500 were done.

@wavexx
Copy link
Collaborator Author

wavexx commented Aug 27, 2022

Can you post the output of M310 ?

After this, can do you the following: let the hotend cool down below 40C, then issue the following (saving all the output):

M155 S1 C3
D70 I1
M310 A

This will generate a lot of output so we can have a look at why it fails.
If you can have timestamps, please turn them on too.

@arekm
Copy link
Contributor

arekm commented Aug 27, 2022

Ok, flashed again, used M310 P40 this time (at previous attempt I left default which was 38 I think), did M310 A, M310 S1, M500 then tried to set 210 for nozzle, screen flashed, beeper beeped but... heater was not disabled and reached 210 and kept it there (until I choose cooldown from menu).

Then I've tried to get the same result of heater not being disabled by setting 210 few times but wasn't able to reproduce - lcd flash, beep but heater was always disabled in my few tests (maybe it only happened for first time after flashing and training?).

Then did what you asked.
serial.log.txt.gz

Unfortunately octoprint disconnects from printer on error so only initial training and then debug training was logged. My multiple "210 tests" were not recorded.

btw. what's meaning of these new icons in left bottom corner?

@wavexx
Copy link
Collaborator Author

wavexx commented Aug 27, 2022

It's nice to actually test this on the Revo, didn't have a change to do so yet. So far we checked on the classic v6, with the copper heatblock and some other ones, various heatbreaks, with socks, on the dragon, and I do expect this to work almost everywhere with minor tweaks.

D70 outputs the raw thermistor/heater values so it's possible to have a look at the entire temperature regulation during regular usage. This is what is shown during calibration after you enabled D70:

2022-08-28T000732

The autocalibration has 2 stages of heating from 50C to 230C. If you look at the first one, the heater is kept at full power, however it's struggling to reach a steady 230C. However at some point the temperature shoots to 250C, with the PID regulator immediately killing power. This lasts for about 6 seconds, before it oscillates somewhat randomly before the cooling phase starts.

The thermistor might be potentially underreporting the temperature at that point, although the only way to be sure there would be to use a second measurement. The sudden bumps of ~40C just before the cooldown are also unrealistic. I doubt the revo can change temperature that fast (in fact, the revo cools down at a normal pace just after), so it couldn't have jumped by 40C in either direction so fast.

Did you notice if the temperature on the LCD was oscillating?

On the second heating run, everything seems to behave correctly.

Any chance the connector is loose or there are cabling issues?
That seems exactly the kind of subtle issues the new checker is designed to catch.

The autocalibration is designed to be robust, so it should still produce consistent calibration values in this case. However a single jump of 40C during print will definitely be sufficient to trigger an error condition.

If you can re-run the same procedure 3 times (same g-code sequence for the full log) we can see if this happens consistently.

@arekm
Copy link
Contributor

arekm commented Aug 27, 2022

Could you share a script that parses log and produces these charts?

(Generally revo is not behaving as typical heatblock - https://www.youtube.com/watch?v=DdZlBiFajWE&t=771s but I guess that doesn't matter here)

I'll do more testing in few hours.

@wavexx
Copy link
Collaborator Author

wavexx commented Aug 28, 2022

I'm polishing a set of scripts to debug thermal issues to be placed in tools/ so that these can be checked automatically, but it's going to take a few days.

I'm aware of the revo heater configuration, however this shouldn't generate this sort of issues. The temperature gradient in the heatblock is mostly irrelevant for this.

What could matter is the PTC element, but at higher temperatures (we can actually test this later by running the autocalibration at higher temps). The effect we see here, inconsistent readings while heating seem more in line with bad temperature readings.

@leptun
Copy link
Collaborator

leptun commented Aug 28, 2022

Just saying, the temperature oscillations during heatup look oddly similar to this: https://youtu.be/DdZlBiFajWE?t=1157

@arekm
Copy link
Contributor

arekm commented Aug 28, 2022

Repeated
M310
M155 S1 C3
D70 I1
M310 A

three times.

serial.log.txt.gz

Second and third also on video (youtube still processing)

Second:
https://www.youtube.com/watch?v=aZvm62f8cgY

Third:
https://www.youtube.com/watch?v=EOwKdyAcqGI

After that M310 S1, nozzle to 210 and unfortunately beep and failure.

Thermistor board connector on revo doesn't have a latch like the one on stock prusa but it wasn't loose when I mounted it. I'll try to replace connector with the same as used on prusa and see if that makes a difference.

Wasn't looking at previous time at LCD, so can't tell. This time there are videos.

@wavexx
Copy link
Collaborator Author

wavexx commented Aug 28, 2022

I'll try to make a quick dumber parser for the log so you can play with this locally while I finish the proper tool. Meanwhile, here's the output:

2022-08-28T125334

Run 1 and 2 look perfect. The third run has a weird spike. Zoomed in:

2022-08-28T125355

2022-08-28T125808

The thermistor unexpectedly dips in temperature this time, although with a "smaller" 15C jump.

@wavexx
Copy link
Collaborator Author

wavexx commented Aug 28, 2022

Aside from the temperature problem, the PTC heating element does change things up significantly enough for the simulation as it is.

A regular heater exhibits almost-perfect linear heating. You can see on the Revo that the heat-up step is starting to look more bell-shaped just by eye, and this matters enough to generate a false trigger at around 200C so the PTC needs to be taken into account here. This is neatly documented here:

https://e3d-online.zendesk.com/hc/en-us/articles/5926520938013-Revo-Six-Datasheet

but I suspect the actual coefficient will have tolerances that would further need to be calibrated.

@wavexx
Copy link
Collaborator Author

wavexx commented Aug 28, 2022

@arekm could you do the following starting from a cooled down heater (<40C) and record the entire session:

M104 S0
M109 R40
M310 S0
M155 S1 C3
D70 I1
M104 S285
M109 R285
G4 S60
M104 S0
M109 R40
G4 S60

Untested. The idea is to have a full recording of the heating profile from 40C to 285C. We wait at 285C for 60 seconds, then cooldown. This is all done without the print fan. Ideally we would set the highest temperature we plan on ever using.

Keep an eye on the printer if you never used 285C before. Although the revo is self-limiting, temperature spikes in the read-out can still ramp up the temperature above 285C and smoke things up.

I will be able to use this to check if the PTC is in line with the datasheet.

@arekm
Copy link
Contributor

arekm commented Aug 28, 2022

@arekm
Copy link
Contributor

arekm commented Aug 30, 2022

That Revo Six heater core/thermistor is dying easily it seems. Now (thermistor) is crazy:

https://www.youtube.com/watch?v=os93d-cdMQg
https://www.youtube.com/watch?v=T38_olPrVZs

@wavexx
Copy link
Collaborator Author

wavexx commented Aug 30, 2022

Judging by the previous logs, this might had some defects already.

@arekm
Copy link
Contributor

arekm commented Sep 30, 2022

More about my revo heaters. When that one died I ordered another heater.

My mistake was to order from the same local company and I got new heater with very similar serial number to old one. That new one also died after month of usage (even without playing with thermal model protection or ~300 degree temperatures). By died I mean temperature wasn't stable at all and fluctuating. Sometimes old thermal protection was kicking in because reported temperature was suddenly like 290 degree.

revo24-print

(image shows temperatures for short print on that second revo)

Now I got replacements from E3D directly. These new heaters have much later numbers and have blue thermistor wires (previous ones had white thermistor wire), so are clearly a very different batch.

Here is a log for a new heater

M310
M155 S1 C3
D70 I1
M310 A

Obviously later M310 S1 and hotend to 200 degree -> error.

newrevo.log.txt.gz

@wavexx
Copy link
Collaborator Author

wavexx commented Sep 30, 2022

Thanks a lot. Nice to know.
Can you post this along with the attachment to #3636 so we have all in one place?

@arekm
Copy link
Contributor

arekm commented Sep 30, 2022

(done)

If you can generate graphs for it I would like to see these. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants