Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

digitalWrite is slow #58

Closed
WestfW opened this issue Dec 10, 2019 · 24 comments
Closed

digitalWrite is slow #58

WestfW opened this issue Dec 10, 2019 · 24 comments

Comments

@WestfW
Copy link

WestfW commented Dec 10, 2019

I am disappointed that digitalWrite() is slower on the 4809 than it is on the 328 series.
This is a feeler to see how willing "we" are to let this repository get out-of-sync with the official Arduino repository. I already have digitalWriteFast() for the case where the pin number is constant, but I can also envision an intermediate-speed version even for the non-constant case (that doesn't do the input_pullup check, for instance.)

Or, there might be value to re-doing the whole digitalPinToXXX macros and logic...

(but it depends on how close we want to stay to the source...)

@MCUdude
Copy link
Owner

MCUdude commented Dec 10, 2019

I haven't done any benchmarks. How much faster is the 328 compared to 4809?

I'm not sure how I feel about this really. On one side it's nice to keep compatibility with the official megaAVR core, but on the other side, it sucks that the speed and efficiency aren't what it could be. Maybe some improved non-constant code could be made as a PR to the official repo? What would non-constant code look like, and how much faster would it be?

@WestfW
Copy link
Author

WestfW commented Dec 11, 2019

Hmm. Actually, digitalWrite is NOT so much slower than 328p, with the current cores. They both run a while(1) loop at pretty close to 150kHz. (I could swear it was worse the last time I checked. Maybe that was a PWM pin, or before the removal of the unnecessary pgmspace stuff.)
To make it faster, I had in mind removing the PWM and "input mode" checks, and perhaps refactoring the pin validity check (which isn't very good anyway!)

could be made as a PR to the official repo?

Alas, the official repository has been very resistant to "performance improvements." I guess they prefer stability to speed-ups that no one needs. I can understand that, but I was wondering whether a new 3rd party core for new chips would be a good demo/proving-ground/example.

@MCUdude
Copy link
Owner

MCUdude commented Dec 11, 2019

To make it faster, I had in mind removing the PWM and "input mode" checks, and perhaps refactoring the pin validity check (which isn't very good anyway!)'

Wouldn't this break the "Arduino compatibility" where you pretty much can't screw things up?

@WestfW
Copy link
Author

WestfW commented Dec 11, 2019

I'd propose to add "digitalWriteFast()" with the modifications (and the possibility of single instructions for the constant-based case), while leaving the original digitalWrite() function alone.
The danger would be that Arduino could change digitalWrite() in ways that would make the new function incompatible, so that if you synced the repositories, it would suddenly "not compile"...

@MCUdude
Copy link
Owner

MCUdude commented Dec 11, 2019

I'm definitely up for adding a separate "extended" digitalWriteFast() function while leaving the original alone. If you post your code here I can format it the way I like it and so some testing on various hardware before I add it to the main repo. I'm sure megaTinyCore would benefit from this as well.

@MCUdude
Copy link
Owner

MCUdude commented Feb 28, 2020

Any updates on the digitalWriteFast/digitalReadFast implementation? I'd love to have "fast" digitalRead/digitalWrite implemented!

@WestfW
Copy link
Author

WestfW commented Mar 8, 2020

Hmm; some progress. I made a test shield so I can see if things "seem" to be working.
Currently there are two hangups:

  1. with the current state of compiler optimization, it's hard to create test cases that do "good" comparisons. A normal digitalWrite() loop will get "some" optimization.

  2. Should we do multiple cases of constant/non-constant arguments? digitalWriteFast() could ALWAYS be faster than the equivalent digitalWrite(), just from not doing the PWM or input-pullup checks. But it could get to the point where it's annoyingly inconsistent.

  3. Should it be part of the core, or a library? Currently I have one .h file that works (maybe) for all types of AVR, but it'd be more obvious if megaAVR and normal AVR were separate.

  4. Only works with -flto

@MCUdude
Copy link
Owner

MCUdude commented Mar 8, 2020

It would be great if it were as fast as possible, without any "safety features" like digitalWrite() has. A little documentation is all that's needed.

I'm fine with adding this as a core feature, and not a library. On the other side, will it take PinStatus or a boolean value as argument? Personally, I think PinStatus was and is a dumb idea that breaks lots of existing code.

This core has always LTO enabled, so 4. won't be a problem.

Another, slightly related thing. I'm not sure how It can be achieved, but it would be awesome if there was a way to support the "old" pinMode()/digitalRead()/digitalWrite() functions with the old arguments. However, the existing functions are written in C (and not C++), so the can't be overridden. I've tried to move them around and rename some files, but without luck.

@nerdralph
Copy link

The fancy (and cool) LTO detection code isn't necessary if you use macros like Wiring does. I suggested this while back for MiniCore (but didn't create a separate issue for it).
MCUdude/MiniCore#58

I suggested the same thing on the Arduino forums years ago, to no avail.

Here's the board defs from Wiring.
https://github.com/WiringProject/Wiring/blob/da0b6b9282ae6a3f047235b5a91b6df567fdf22f/framework/hardware/MLAB/ATmegaXX8P/BoardDefs.h

MicroCore already uses the same technique (macros instead of pin mapping arrays), so you shouldn't have any concerns over compatibility, Hans.

If type safety and readability of compiler errors from macros is a concern, extern inline functions may be worth trying.
https://gcc.gnu.org/onlinedocs/gcc/Inline.html

@MCUdude
Copy link
Owner

MCUdude commented Mar 20, 2020

@nerdralph the reason why I've been hesitant about replacing the digitalRead/digitalWrite functions is because I'm afraid people's projects rely on these being slow. However, as you pointed out, there are many things that affect the speed of these functions anyways. Maybe I should try to improve these functions in the next release, and see if I get any issues related to it?

@nerdralph
Copy link

nerdralph commented Mar 20, 2020

@nerdralph the reason why I've been hesitant about replacing the digitalRead/digitalWrite functions is because I'm afraid people's projects rely on these being slow. However, as you pointed out, there are many things that affect the speed of these functions anyways. Maybe I should try to improve these functions in the next release, and see if I get any issues related to it?

Arduino is actually a fork of Wiring, so the fact that the "upstream" repo made the optimizations should give you some confidence. Wiring optimized the digitalRead/Write functions after the Arduino fork, but Arduino did not merge any upstream changes. There seems to be no technical reasons for this; I suspect it was really about Massimo's ego.
https://arduinohistory.github.io/

Secondly, if commonly used Arduino/Wiring libraries relied on digitalWrite/digitalRead being slow, then they would already have issues on the non-AVR Arduino cores like esp8266, STM32, or even the "official" Arduino cores like ARM (Arduino SAM & SAMD boards).

About 5 years ago I got so annoyed with the bloat and lousy coding in the official AVR core that I started working on a small/optimized core for the mX8 series, based on Wiring (but never finished it).

@WestfW
Copy link
Author

WestfW commented Mar 21, 2020

The macros used in wiring are really ugly, hard to maintain, and presumably make the non-fast versions slower, at least eventually (ie they don't scale, especially to the bigger chips where the mappings are increasingly ungrouped.)
A significant goal of my current code is to use whatever pin-mapping features are already in use...

I wish I could talk people into ALWAYS using macros or functions for accessing pin-info arrays instead of occasionally directly accessing g_aPinDescription[] or whatever. You should be able to replace the mapping mechanism without having to change the actual driver code. :-(

@MCUdude
Copy link
Owner

MCUdude commented Mar 23, 2020

I tried to implement the same type of macros as the Wiring core file does, but I did not see a significant bump in speed or reduced compiled size. I've taken a look at the digitalWriteFast library, but this library expects constants, which I can't do because I'm trying to keep this core as compatible as possible.

@WestfW have you been able to create a faster digitalWrite that's fully compatible (no compiler warnings or errors) with the "stock" one?

@nerdralph
Copy link

nerdralph commented Apr 9, 2020

The macros used in wiring are really ugly, hard to maintain, and presumably make the non-fast versions slower, at least eventually (ie they don't scale, especially to the bigger chips where the mappings are increasingly ungrouped.)
A significant goal of my current code is to use whatever pin-mapping features are already in use...

Actually I find the Wiring macros reasonably clean and easy to follow. Here's an example:

#define digitalPinToPort(PIN) \
        ( ((PIN) >= 0  && (PIN) <= 7)  ? 0 : \
        ( ((PIN) >= 8  && (PIN) <= 13) ? 1 : \
        ( ((PIN) >= 14 && (PIN) <= 19) ? 2 : NOT_A_PORT)))

I actually find the pin mapping arrays uglier and harder to maintain. Look at how tricky it is to catch the missing line in the code below, even with the comments.

const uint8_t PROGMEM digital_pin_to_port_PGM[] = {
	PD, // PD0 - D0
	PD, // PD1 - D1
	PD, // PD2 - D2
	PD, // PD3 - D3
	PD, // PD4 - D4
	PD, // PD5 - D5
	PD, // PD6 - D6
	PD, // PD7 - D7
	PB, // PB1 - D9
	PB, // PB2 - D10
	PB, // PB3 - D11
	PB, // PB4 - D12
	PB, // PB5 - D13
	PC, // PC0 - D14 / A0
	PC, // PC1 - D15 / A1
	PC, // PC2 - D16 / A2
	PC, // PC3 - D17 / A3
	PC, // PC4 - D18 / A4
	PC, // PC5 - D19 / A5
	PB, // PB6 - D20 / XTAL1
	PB, // PB7 - D21 / XTAL2
	PC, // PC6 - D22 / RESET
};

Macros can be trickier to debug, so an inline function would probably be even better.

@nerdralph
Copy link

I tried to implement the same type of macros as the Wiring core file does, but I did not see a significant bump in speed or reduced compiled size.

If you tried it with MiniCore, you'd certainly see a difference. With Wiring, a digtitalWrite(13, HIGH) compiles to a single sbi instruction which takes two cycles. For Minicore, there's a few table lookups (a couple ldi instructions followed by lpm), then several more instructions to save SREG, then read/modify/write the port register. Overall it's about 10x slower and 10x the code.

@MCUdude
Copy link
Owner

MCUdude commented Apr 9, 2020

@nerdralph could you upload a code snippet that I can test on? I tried to implement the Wiring macros to MegaCore, but I did not gain any flash savings or speed.

@WestfW
Copy link
Author

WestfW commented Apr 9, 2020

I think the wiring-style macros really fall apart compared to the table-based approach as the mapping between Arduino pin numbers and ports becomes more random and scattered. For example the (beginning of the) Uno WiFi 2 table looks like:

  PC, // 0 PC5/USART1_Rx
  PC, // 1 PC4/USART1_Tx
  PA, // 2 PA0
  PF, // 3 PF5
  PC, // 4 PC6
  PB, // 5 PB2
  PF, // 6 PF4
  PA, // 7 PA1
  PE, // 8 PE3
  PB, // 9 PB0
  PB, // 10 PB1
  PE, // 11 PE0
  PE, // 12 PE1
  PE, // 13 PE2

(and it goes on for 40 "pins.")

Which is not going to make for a pretty "if" statement.
My current "fast" code is here: https://github.com/WestfW/Duino-hacks/blob/master/fastdigitalIO/fastdigitalIO.h
and the mega0 version looks like:

#define digitalWriteFast(pin, val)                                      \
    if (__builtin_constant_p(pin) /*&& __builtin_constant_p(val)*/) {_dwfast(pin, val);} else {NonConstantUsed();}

#if defined(VPORTB)
static inline  __attribute__((always_inline)) void _dwfast(int pin, int val) {
/*
 * Mega-0, Tiny-1 style IOPORTs
 * (assumes VPORTs exist starting at 0 for each PORT structure)
 */
    uint8_t mask = 1<<digital_pin_to_bit_position[pin];
    uint8_t port = digital_pin_to_port[pin];
    VPORT_t *vport;
    if (digital_pin_to_bit_position[0] == 0x99) {  // never true.
        LTO_Not_Enabled();
    }
    if (port > 0x1F) {
        /*
         * "FASTER" port logic is low bits of PORTB struct address
         * spaced every 32 bytes.  PORTA is still 0.
         */
        vport = (VPORT_t *) ((port / 8));
    } else {
        /*
         * Old style port logic is a small integer 0 for PORTA, 1 for PORTB
         * etc. 
         */
        vport = (VPORT_t *)(port * 4);
    }
    if (val) {
        vport->OUT |= mask;
    } else {
        vport->OUT &= ~mask;
    }
}

It's a little over-complicated because of the two styles of the array that have existed, and I've been dragging my feet on the "faster but not a single sbi/cbi" versions.

@MCUdude
Copy link
Owner

MCUdude commented Apr 10, 2020

@WestfW thanks for sharing! I did a test on an ATmega4809 running at 16MHz. I got close to 4 MHz toggle speed!

But why include this code?

if (digital_pin_to_bit_position[0] == 0x99) {  // never true.
        LTO_Not_Enabled();
    }
    if (port > 0x1F) {
        /*
         * "FASTER" port logic is low bits of PORTB struct address
         * spaced every 32 bytes.  PORTA is still 0.
         */
        vport = (VPORT_t *) ((port / 8));
    }

It's never true for any of the megaAVR-0 series, and afaik they all have vports.

@MCUdude
Copy link
Owner

MCUdude commented Apr 10, 2020

I tried to use the same implementation for digitalReadFast:

#define digitalReadFast(pin) \
    if (__builtin_constant_p(pin) /*&& __builtin_constant_p(val)*/) {_digitalReadFast(pin);} else {digitalRead(pin);}

static inline  __attribute__((always_inline)) int _digitalReadFast(uint8_t pin)
{
  // Mega-0, Tiny-1 style IOPORTs
  // Assumes VPORTs exist starting at 0 for each PORT structure
  uint8_t mask = 1 << digital_pin_to_bit_position[pin];
  uint8_t port = digital_pin_to_port[pin];
  VPORT_t *vport;

  // Old style port logic is a small integer 0 for PORTA, 1 for PORTB etc.
  vport = (VPORT_t *)(port * 4);

  // Read pin value from VPORTx.IN register
  return !!(vport->IN & mask);
}

However, I'm not able to do this (compiler is complaining):

digitalWriteFast(LED_BUILTIN, !digitalReadFast(LED_BUILTIN));

But this works:

digitalWriteFast(LED_BUILTIN, !_digitalReadFast(LED_BUILTIN));

Any idea why using the digitalReadFast macro doesn't work?

@WestfW
Copy link
Author

WestfW commented Apr 12, 2020

But why include this code?

if (port > 0x1F)
    /*
     * "FASTER" port logic is low bits of PORTB struct address
     * spaced every 32 bytes.  PORTA is still 0.
     */
    vport = (VPORT_t *) ((port / 8));
  1. Various parts of the code still need access to the non-VPORT PORT structures (input_pullup, for instance.)
  2. Currently, the digitalPinToPort() logic returns a small integer (PA, PB, PC, etc) that is converted to the address of the actual PORT data structure as a separate step (portToPortStruct(port)) I think I was considering a global optimization that replaced the current implementation with something more direct. As you say, it doesn't currently come into play.

@WestfW
Copy link
Author

WestfW commented Apr 12, 2020

#define digitalReadFast(pin) \
    if (__builtin_constant_p(pin) /*&& __builtin_constant_p(val)*/) {_digitalReadFast(pin);} else {digitalRead(pin);}

Ah; probably needs to be a ternary expression in order to return a value. Good job noticing!

(untested...)

#define digitalReadFast(pin) \
    (__builtin_constant_p(pin) ? /*&& __builtin_constant_p(val)*/) _digitalReadFast(pin) : digitalRead(pin))

@MCUdude
Copy link
Owner

MCUdude commented Apr 12, 2020

Ah; probably needs to be a ternary expression in order to return a value. Good job noticing!
(untested...)

That didn't work at all...

/var/folders/6l/ypg6qbw172v1s4vtt6g990tw0000gn/T/arduino_modified_sketch_726559/Blink.ino: In function 'void loop()':
Blink:3:67: error: expected primary-expression before ')' token
     ((__builtin_constant_p(pin) ? /*&& __builtin_constant_p(val)*/) _digitalReadFast(pin); : digitalRead(pin);)
                                                                   ^
/var/folders/6l/ypg6qbw172v1s4vtt6g990tw0000gn/T/arduino_modified_sketch_726559/Blink.ino:52:3: note: in expansion of macro 'digitalReadFast'
   digitalReadFast(LED_BUILTIN);
   ^~~~~~~~~~~~~~~
Blink:3:67: error: expected ':' before ')' token
     ((__builtin_constant_p(pin) ? /*&& __builtin_constant_p(val)*/) _digitalReadFast(pin); : digitalRead(pin);)
                                                                   ^
/var/folders/6l/ypg6qbw172v1s4vtt6g990tw0000gn/T/arduino_modified_sketch_726559/Blink.ino:52:3: note: in expansion of macro 'digitalReadFast'
   digitalReadFast(LED_BUILTIN);
   ^~~~~~~~~~~~~~~
Blink:3:67: error: expected primary-expression before ')' token
     ((__builtin_constant_p(pin) ? /*&& __builtin_constant_p(val)*/) _digitalReadFast(pin); : digitalRead(pin);)
                                                                   ^
/var/folders/6l/ypg6qbw172v1s4vtt6g990tw0000gn/T/arduino_modified_sketch_726559/Blink.ino:52:3: note: in expansion of macro 'digitalReadFast'
   digitalReadFast(LED_BUILTIN);
   ^~~~~~~~~~~~~~~
Blink:3:69: error: expected ')' before '_digitalReadFast'
     ((__builtin_constant_p(pin) ? /*&& __builtin_constant_p(val)*/) _digitalReadFast(pin); : digitalRead(pin);)
                                                                     ^
/var/folders/6l/ypg6qbw172v1s4vtt6g990tw0000gn/T/arduino_modified_sketch_726559/Blink.ino:52:3: note: in expansion of macro 'digitalReadFast'
   digitalReadFast(LED_BUILTIN);
   ^~~~~~~~~~~~~~~
Blink:3:92: error: expected primary-expression before ':' token
     ((__builtin_constant_p(pin) ? /*&& __builtin_constant_p(val)*/) _digitalReadFast(pin); : digitalRead(pin);)
                                                                                            ^
/var/folders/6l/ypg6qbw172v1s4vtt6g990tw0000gn/T/arduino_modified_sketch_726559/Blink.ino:52:3: note: in expansion of macro 'digitalReadFast'
   digitalReadFast(LED_BUILTIN);
   ^~~~~~~~~~~~~~~
Blink:3:111: error: expected primary-expression before ')' token
     ((__builtin_constant_p(pin) ? /*&& __builtin_constant_p(val)*/) _digitalReadFast(pin); : digitalRead(pin);)
                                                                                                               ^
/var/folders/6l/ypg6qbw172v1s4vtt6g990tw0000gn/T/arduino_modified_sketch_726559/Blink.ino:52:3: note: in expansion of macro 'digitalReadFast'
   digitalReadFast(LED_BUILTIN);
   ^~~~~~~~~~~~~~~
exit status 1
expected primary-expression before ')' token

@MCUdude
Copy link
Owner

MCUdude commented Apr 12, 2020

Nevermind, I figured it out:

#define digitalReadFast(pin) \
	(__builtin_constant_p(pin) ? /*&& __builtin_constant_p(val)*/ _drfast(pin) : digitalRead(pin))

@nerdralph
Copy link

2. Currently, the digitalPinToPort() logic returns a small integer (PA, PB, PC, etc) that is converted to the address of the actual PORT data structure as a separate step (portToPortStruct(port))  I think I was considering a global optimization that replaced the current implementation with something more direct.  As you say, it doesn't currently come into play.

I've written some prototype code for the traditional AVRs in which digitalPinToPort returns the address of the PORT. portOutputRegister returns this address cast as volatile uint8_t*, portModeRegister returns the address - 1, and portInputRegister returns the address -2.
This eliminates 3 of the lookup tables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants