-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compaction crash loops and data loss on Raspberry Pi 3 B+ under minimal load #11339
Comments
I don't know if it is supposed to work but I can definitely reproduce this issue on my Raspi 3B+. |
I've given up trying to keep it running. My planned next steps, whenever I have time, are:
|
If you're willing to build InfluxDB yourself, you could try the branch in pr #12362 |
My RPI 3B+ with InfluxDB has just started being hit by this same issue, ironically also while collecting environmental data. I however haven't yet started losing data, but I'm getting the endless crash loops filling up syslog with the same memory allocation and compaction issue(s). #12362 makes mention of this issue too in relation to mmap on 32-bit platforms, due to the limited allocatable memory. I'm too debating my options... as it runs on a RPI because I don't want a hot and power hungry server for what should be a simple function! |
Thanks for the corroboration! FWIW, I no longer think this is due to the 32-bit platform / max memory problem. I don't remember the details, but I think I saw a DB get past that size on my RPi, and also see this crash start well before anything should have been hitting that limit. I've seen evidence of disk write failures or slowness possibly causing the initial problem (files quarantined with It seems like we're stuck until one of us captures the full state of one of these failing deployments, and someone who knows how to parse that state has a look… |
I do find it seems to go through fits and starts... it runs fine for several days, then suddenly services start dropping offline. I'm actually thinking to modify the service file to limit scheduler priority (via chrt) to ensure influx cannot consume all the resources of the system, as I get the feeling the load average spiking as high as it does cannot be helping the situation, since the CPU on the Pi tends to become near-unusable under high load. I have also been finding that under high I/O caused by Influx, that I start getting issues with corruption in some places on the SD card... My influx db sits on an external USB3-based SSD (an old Intel X25-M 74GB SSD, so by no means fast, but definitely a million times faster than the SD card!) - so I don't think the disk I/O is an issue in my case. Perhaps if you are finding corruption and influx is sitting on the SD, it could be the same issue as me with the high CPU... but perhaps also give a different SD a try? SD cards aren't really designed for heavy write activity, as they don't have any smarts to clean up deleted files with trim etc... I have also seen many cases of failed SD cards on thin clients due to antivirus definitions, so I know SD card failures are definitely not unheard of... |
gtk you are seeing this, or something similar, on an SSD. I switched influx to a USB thumb drive in my pi, when I was suspecting it was an SD card IO issue. It should be an order faster than the SD card, but I saw the issue similarly on both. |
Hi, just to share my experiences for anybody affected by the issue. I had a database that was about 1.5 GB in size, and influxdb would keep crashing on me.
The swap space can be disabled or reduced after that. After reaching about 900 MB and starting to swap, the memory usage actually dropped back down and InfluxDB is using only 200MB. All my data looks like it has been retained, at least from before InfluxDB started crapping out two weeks ago. |
@alexpaxton I see you are one of the programmer who has contributed the most to the Influxdb project. Sorry to drag you into the conversation here, but I would love to see some more attention to this post. This is similar issue: #6975 It would be good to know if we can stick to influx with our RPIs or not. Moving to another engine would be the worst for a lot of us and I feel a very good amount of people use influx on RPIs for IoT projects. Please let us know and thank you very much in advance. |
@ryan-williams , how large is each of your uncompressed shard groups under the default retention policy of 168 hours? Using mmapped TSM files, compaction jobs can grab a large chunk of the process address space because they are writing out new TSM files (presumably mmapped) while reading other TSM files (also mmapped). Just ballparking, but you end up needing 2x the mmapped address space: one for TSM inputs and one for TSM outputs. So if your shard group duration is large, resulting in a large files size, you can hit your mmap limit during a compaction job when otherwise you'd have enough headroom in the process address space. If this is the issue you are encountering, I think #12362 will definitely help you. You might also or alternatively need to change the default retention policy so that the shard group duration is much smaller, so your compaction jobs are handling a much smaller amount of data at a time. |
@fluffynukeit, what do you think is the uncompressed shard group size limit at an 168 hours RP? |
@pinkynrg I think it will depend on how many TSM files you have and how much data you collect in that 168 hours. By default, TSMs all get mmapped to the process address space. So you could have a situation where a compaction job for 168 hours RP works fine for a nearly empty database, but eventually the size of all the TSM files could be large enough that the compaction job fails because there is not enough address space to do it. Avoiding so much mmapping was one of the motivating reasons for #12362. My use case is that I wanted to keep my data forever on a device with a projected 10 year lifespan. Even if I made my shard group duration a tiny 1 hour (making compaction jobs very small), I would still eventually hit the address space limit as my database filled up with more and more TSM files. |
I would like to collect ~5k tags every minute for 90 weeks. I would then route my queries to best bucket (minute, hour, day), depending on the time delta of the query. I was waiting to size my shards in the best possible way. Right now they are all 7d long. |
What matters is the MB size of the TSM files for each shard group. I'd guess that 2x this size is the upper bound of address space needed for a compaction job. TSM size is not easy to predict because the data get compressed, so you just have to test it out and measure it. In my case, if you look at the logs on #12362, the uncompressed shard group is about ~400-450 MB. So let's assume a compaction job requires 900 MB of address space. With an empty DB, there are no mmapped TSM files, so your process address space is close to empty, and there is much more than 900 MB free. The compaction job runs. Over time, the older shards will stop getting compacted, but they will still take up address space. Let's say you have 15 shard groups each with 200 MB in them, plus an uncompacted hot shard of 450 MB. That's 3.45GB of address space taken up by database data. If your user-space address limit is 3.6 GB, the next compaction job will likely fail because there's not enough free address space to run it. It would need an additional 450 MB to mmap the compaction job output file. Don't take my size figures as gospel. I'm just making up numbers to be illustrative. You'll have to test it out for your own data and tune it appropriately. Or use #12362. |
Ok will test #12362. You confirm that it has been working fine for you so far correct? No errors at all so far? |
I have not encountered any problems, but I also have not tested it exhaustively. Our device is still in development. |
In the mean time I think I will also try an unofficial image for RPI to use all 64 bit. https://wiki.debian.org/RaspberryPi3 @fluffynukeit, that should technically resolve the issue too, correct? UPDATE: weren't able to try a 64 bit OS because, as predicted, it ends up using almost double the memory for other processes (such as Gunicorn web server for example) so even if it solved the InfluxDB problem it wouldn't be a good final solution anyway. |
tl;dr Server memory resources were low enough that newly compacted TSM files were unable to be The log files were analyzed and it was determined that a low memory condition (
This in turn caused temporary TSM files to be orphaned. Subsequent compactions for this group failed due to the orphaned
This issue is filed as #14058. Low memory issuesFixing #14058 will not address the problems that occur when additional TSM data cannot be Due to the low memory condition, snapshots eventually began to fail, resulting in the same Notes collected during analysis
influxdb/tsdb/engine/tsm1/file_store.go Line 738 in e9bada0
Creates a new influxdb/tsdb/engine/tsm1/file_store.go Line 763 in e9bada0
which attempts to influxdb/tsdb/engine/tsm1/reader.go Line 1334 in 05e7def
influxdb/tsdb/engine/tsm1/file_store.go Lines 765 to 770 in e9bada0
and renames the file back to influxdb/tsdb/engine/tsm1/engine.go Line 2210 in aa3dfc0
The influxdb/tsdb/engine/tsm1/compact.go Lines 1052 to 1060 in 2dd913d
|
I had the same issue with InfluxDB on a Raspberry Pi, it was crashing at startup even before starting the compaction. Setting swap via dphys-swapfile to 2GB had no effect. I had reservations on converting from TSM to TSI as there are some other issues open reporting that TSI uses more memory. The fix was to copy /var/lib/influxdb to a 64-bit Debian Buster based system and run InfluxDB there. This loaded the files and started the compaction immediately which took about 5 minutes to complete as there was a ton of uncompacted files. Memory usage spiked to about 3.8G resident during the initial startup. Subsequent startups after compaction used about 217M resident. Copying the database back to the Pi resulted in a successful startup of InfluxDB with it using only 163M resident. So 64-bit systems will use considerably more RAM during normal operation (217M vs 163M) so a 64-bit build of Raspbian may not me the best choice. It definitely wouldn't have helped in my case as the initial startup took 3.8G, the Pi only has 1G RAM, even a 2G swap file may not have been enough. A long term solution would be to start the compaction way earlier so we don't end up with so many uncompacted files. Perhaps this can be tuned via compaction settings. |
Sorry, getting long. Can someone give a summary what the problem/status is?
Moving those .tmp files away and restarting doesn't help. |
@vogler I'd first make sure you have as much free memory as possible - stop all other services, reboot if possible (will clear possible memory fragmentation) My errors were |
@vogler I second @jjakob on the move to another server. It's the only way to recover the data. Until the InfluxDB team addresses the way compaction operates on address-space limited devices (e.g. 32-bit OS and restrictive RAM of the RPi (even with the 4GB Pi 4, I have the same issue!)) - you have no alternative short of using another time series DB platform. I tried max-concurrent-compactions = 1, but in my case at least it still fails. I just gave up on the compaction process entirely on the Pi and just rely on occasionally shipping everything to a VM on my main PC. I have since recovered my influx DB multiple times now using the method of transferring to a PC VM. I simply stop the services, tar the files, scp them over, start the services, within about 2 minutes the files are compacted... so I stop the services, ship back, done. The lack of solution for this issue suggests it may be easier to simply script the above solution of log shipping between hosts. |
Question for the people who moved away from InfluxDB as a result of this issue: which database do you use instead? By the way, my current "workaround" is to keep my database size very very small. The main offender for DB size was collectd. I cleaned out the store, created retention policies and continuous queries for data downsampling and now the collectd DB currently sits at around 60 megabytes. This will probably work just fine for me, but is obviously not a solution if you need high-volume, high-resolution data. |
@ITguyDave, do you really have the same issue with RPI4? Is it with a 32 or 64 bit OS? |
My above mentioned "fix" only lasted 3.5 days until influxdb started crashing again. Then it was offline for 4 days until somehow coming back on its own, I have no idea how. I didn't check on it until now so have no logs older than when the DB came back again, it may have OOMed the Pi so hard it rebooted or something.
That's interesting. I think my culprit is telegraf's system metrics, which log a similar amount of data than collectd (every 10s: cpu, load avg, memory, processes, ctx switches, forks, swap usage, disk i/o). My main use case for influx is to log metrics from ebusd via telegraf, which is ~2 measurements/sec max (20/10s), a lot less than telegraf's system metrics. Can you detail on the continuous queries you created for data downsampling? I wouldn't want to downsample, but it's fine for system metrics, which isn't so important. Maybe a way to lessen the compaction intervals is possible, so each compaction has less uncompacted data to load, but I don't know how or have time to research it, it would be highly appreciated if someone did and shared their findings. |
Yes - I am still using Raspbian on it (a 32-bit OS), so the same upper memory limit issue occurs after some time. I've since retasked the RPI4 for other duties, so haven't played around with it much more since, but the RPI3 is still running the InfluxDB. I however am absolutely certain that if I was running a 64-bit OS, this compaction issue would not occur. It would however over time suffer severe performance degradation during compaction if the memory usage exceeded the 4GB physical and started paging, but it would still succeed (eventually). I have yet to hear of a stable and supported 64-bit RPi OS in any case. There are several out there, but many lose key functionality of the Raspberry Pi, such as GPIO support and require a lot of customisation to get going properly.
Interestingly... my use case for InfluxDB is logging both the telegraf system metrics and received messages via Mosquitto MQTT. All in all, I'm peaking something like 23 metrics/sec when all my MPUs are in full swing - although it does jump around a bit, since some of the sensors can only poll every ~3 seconds, whereas others are polling >4 times per second. The nature of my logging is that there are tens of different metrics, but I am also tagging them by device and sensor. Maybe that partitioning has something to do with it? I'm not sure how behind the scenes InfluxDB treats data that is tagged like this, if it does anything different at all... Currently my data directory is 1.9GB. The wal directory is at 405MB across 11000 files (and growing rapidly). I'm already suffering the dreaded compaction issues the same day after running the last compaction, so it's just a matter of time before it dies again... |
Note that this will create "mean_value" and "mean_mean_value" fields in the one_month and six_months retention policy respectively, due to issue #7332 . |
For anyone still battling with this issue... Raspbian now has an experimental 64-bit kernel available. I have seen successful compaction on my RPI 4 (4 GB RAM) since switching to that kernel. Technically the 64-bit kernel works on the 3 series too, but I would probably suggest upgrading to a RPI 4 for the extra memory, as it's more likely to sustain larger databases in the long run. Info on the 64-bit kernel is here: https://www.raspberrypi.org/forums/viewtopic.php?f=29&t=250730 |
@ITguyDave I suspect that the higher amount of RAM (4 vs 1) is the key factor, not the 64-bit kernel, as I've detailed the memory usage in my previous post. During the compaction influxd used ~3.8G RAM on an amd64 OS, so while this isn't directly comparable to ARM64, it's indicative. If someone wants to do testing with a 64-bit kernel and OS build on RPi3 we'll know for sure, but I doubt it'll improve anything, I suspect it'll make it worse. |
After trying the unofficial Ubuntu release... every issue I had with InfluxDB seems to have gone. Despite not actually changing any settings, the memory usage is minimal, the CPU load is nil under standard load, and the upper limit of pps being sent to the DB is extremely high (I'm seeing about 4300 pps with an ancient Intel 74 GB MLC SSD). I have a 53GB database running currently, and no issues to speak of with compacting any more. I hadn't intended on doing this test today, but shortly after I replied the earlier comment, InfluxDB went into the dreaded endless service restart loop due to the failing compactions. For anyone else who comes across this thread and wants a minature InfluxDB server that can actually handle a moderately sized DB... you won't get a reliable InfluxDB without a fully 64-bit environment, as InfluxDB does not support 32-bit very well. Just do away with Raspbian and Balena and go to the Ubuntu server image as mentioned by @CJKohler. It seems much more responsive, is running significantly faster and the RPi4 is running almost cool to the touch for the first time, with InfluxDB running faster than it ever has! |
64Bit Rasperry OS is coming: https://www.raspberrypi.org/forums/viewtopic.php?f=117&t=275370 I orderd a 8GB Pi and will test the setup.... |
Hi @ITguyDave ! I'm seeing the exact same problem here with my pi4 4GB + having some others as well (Xorg crashing randomly and needed to restart networking after each boot because the pi will lose connectivity, more info here: https://www.raspberrypi.org/forums/viewtopic.php?f=28&t=277231 ). Can you confirm that the GPIO ports are working with Ubuntu? Greetings. |
@unreal4u I personally don't use the RPi for GPIO, as I predominantly use mine for acting as mini low-power servers and have Arduino-based MCUs sending data to them over the network, but according to the maintainer of the Ubuntu image (see https://jamesachambers.com/raspberry-pi-4-ubuntu-server-desktop-18-04-3-image-unofficial/), the standard Raspbian kernel and utilities are available so I see no reason they shouldn't work. For InfluxDB also... be sure to set the index to file-based rather than in-memory, since a growing DB will inevitably have issues with memory caching at some point once there is enough data. You might also need to play around a bit with the retention period and shard groups to better tune how the database manages the underlying shards. I played around a lot with mine over countless hours (I'm still not happy with the performance, but it's 90% better immediately simply by going Ubuntu). My server is logging about 47 parameters every 5 seconds into InfluxDB with no issues now, via Mosquitto MQTT into Telegraf and the CPU is almost idle, with very low memory usage now. Just be sure you have a decent MLC-based SSD attached to get the most out of it - avoid the SD card wherever you can, and be sure to move all frequently-accessed log files to the SSD rather than SD to avoid the card dying from excessive writes. I run an ancient Intel X25-M 74GB SSD for my InfluxDB via USB3 and it runs brilliantly in that setup, considering the very low power needs of that setup. Let us know your experience with GPIO? |
Thanks @ITguyDave ! I installed Ubuntu Server during the weekend and finally came around last night to play around with the GPIO ports. And yes, I can confirm they do work without problems! I haven't played a lot yet with Influxdb but the compactation process did work without issues and the avg. load has come down from a permanent 2.x to <1.0 (not bad considering I run 16+ docker images AND use the GUI as well to display a magic mirror, all while recollecting data through USB ports + GPIO). I was already using an SSD, the only quirk is that I had to go back to using a microsd card for All in all, I'm quite happy so far, the only thing I miss is Thanks! |
I don't know about Ubuntu, but on Raspbian this is no longer needed after some update. My RPi4 is running solely from SSD. |
It does not seem possible yet. I had that same setup however with Raspberry Pi OS, but my USB ports were nog being recognized at boot so it has to go through the SD card first. Not a big issue, I had the same setup before it was possible to boot directly from USB. |
I'm facing this very same issue (on an Odroid XU4). I did try to copy the files to a desktop, run influx there, let it do it's thing and copy it back. That solved it for about a day or so. Influxdb size:
Any advice to solve this? |
Switch to an 64 Bit OS works for me...
Sonntag, 10. Januar 2021, 15:03 +0100 von notifications@github.com <notifications@github.com>:
…I'm facing this very same issue (on an Odroid XU4).
I'm a bit amazed that this bug (i got here via one from 2016!) is open and unresolved for this long. Is influxdb not meant to be run on single board computers?
I did try to copy the files to a desktop, run influx there, let it do it's thing and copy it back. That solved it for about a day or so.
It's not like my influx instance is logging a whole cluster of machines. It has been running fine for a couple years.
But it seems like once you trigger this issue you just cannot solve it unless you either start removing data (which i don't want to) or upgrade to a beefier machine (which i don't want to either).
Influxdb size:
12K /var/lib/influxdb/meta
3.2G /var/lib/influxdb/data
33M /var/lib/influxdb/wal
3.3G /var/lib/influxdb
Any advice to solve this?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub , or unsubscribe .
|
You can change the RETENTION POLICY. Example ALTER RETENTION POLICY autogen ON xxxx DURATION 8w REPLICATION 1 SHARD DURATION 7d DEFAULT |
I might not entirely understand retention policies.. I have no lack of space. Only of memory. I don't want to delete my data. |
Than you should switch to 64bit system with more RAM (Pi 4 4GB/8GB). There is no other solution. But if you switch to Pi 4 you will get the same problem like now. But later. ;) Sometime you have to delete very old data. Or use other (better) system than influxdb. Is there one without those problems? |
@markg85 the DURATION depends on your data. For me wort the example. Cause I'm collection a lot of data. Every 10 seconds. Sometimes from 10 computers. Perhaps with you data if work wir 10w or 14w. |
Is that seriously the way how influxdb is developed. Stuffing all it's data in memory and thus eventually you need to upgrade... I'm not going to remove my data. Switching to another SBC is also difficult (not impossible) because the one this is running on is hosting a bit more services then just influxdb. I kinda hate to migrate that. Also, the XU4 is still a quite powerful CPU. I'd argue that the XU4 is more powerful still, just not in terms of memory. If there is a better alternative out there for 32 bit ARM setups, i'm all ears :) |
This looks very promising! https://www.youtube.com/watch?v=C4YV-9CrawA The prometheus "tsdb" that was new with their 2.0 release (3 years ago). Why isn't influxdb using something like it? |
@somera Not entirely true... after going to Ubuntu on my RPi 4 with 64-bit, I now have a 64GB database running on an SSD that has been running flawlessly for many months 24x7 now. Prior to that, the database maximum size was 3GB before compaction issues would prevent the influx service even starting. The key comes down to disabling the in-memory support by switching the sharding method in influxdb.conf: By default it is "inmem", meaning the shards sit in memory. This will cause issues with compaction, as in order to compact it needs to compact multiple shards at the same time in memory, thus depending on the size of the shard it could be too big to fit in memory (especially on a 32-bit OS). This is why your mileage may vary with adjustment of retention periods, a VERY complex topic on InfluxDB in itself - as the number of data points will impact on the amount of data requiring compaction. You might see some improvement by adjusting to shorter retention, but it heavily depends on the data being stored. I myself actually have indefinite retention running, and have no issues at all since moving away from inmem. Performance is slightly lower, but as it's on an SSD and on a USB3 thanks to the RPi4 - it's not all that noticeable. As it's a RPi anyway, the memory isn't blazing fast regardless, so I personally would not be concerned about the performance decrease unless you're running a production workload. IMPORTANT: do NOT change to tsi1 unless you have a decent external SSD, as the increased disk activity will destroy any standard microSD cards since most non-industrial cards don't have any wear levelling capability, thus will hit the limit of writes the card can handle and cause the memory chips on the card itself to fail. I can speak to this from experience...
@markg85 I've said it before, InfluxDB cares not for our plight. To them, it's all about Facebook-scale usage. 32-bit and IoT are definitely not their focus, we don't even show up on the radar and that much is clear by the lack of any dev response to this thread. Maybe someone who has the skills could create a fork of InfluxDB some day, to actually handle small hardware? I see a massive market for this type of thing, since IoT sensors out in the wild are all too common, and more often than not don't have the luxury of high-end hardware or network bandwidth. A distributed architecture with Influx running on RPIs scattered in remote farm sheds surrounded by LoRA sensors in a field, is just one such example where a solution like this could thrive. FWIW - the real irony with influx is that after I switched to tsi1 rather than inmem for sharding, performance actually IMPROVED due to influx no longer logging thousands of errors every second to the syslog. My log files were rotating on average every hour through influx error messages. It's definitely not what you would expect in moving away from memory cache...
@markg85 Ubuntu 64-bit (I run this, RPi3 technically can run it too). 32-bit you should avoid at all costs. If the CPU doesn't have support for 64-bit, try my comments above to switch to tsi1 and you might be able to workaround the compaction issue. Compaction for me has been flawless ever since. It actually compacted the data that had been failing to compact for 6 months after switching it, even though it took 15 minutes to complete... much more reliable than copying back and forward to desktop hardware. |
@aemondis thx for the info! |
@aemondis, thank you for that detailed reply! That's much appreciated! My appreciation for influxdb went straight through the floor. There are lots and lots of IoT/sensor projects out there where single board computers are involved, influxdb often is too. Then to figure out that you're basically installing a timebomb is disappointing to say the least. In these environments it can be expected to run a sbc for some fancy functionality. If you need to run a desktop pc or a more higher end sbc you quickly just don't use it. In my specific case i'm running the odroid XU4 with the home server package. That home server adds a daughter board giving you access to two SATA connections (and a bunch of other stuff). I can't just throw that away as there isn't a real alternative for it. I get that it's ill advised to use 32bit platforms. I myself am a developer and i too also just discard it and say "use 64 bit". Truth be told, that's for the desktop and the x86-64 architecture, not ARM. I'm already running a second SBC (rk3399 based) for media player purposes. I don't know how i'm going to solve this issue. Very definitely not a fourth SBC.. I might add it to my rk3399. I might search for alternatives.. i just don't know. Yet. |
I wrote the #12362 patch to prevent the compaction issue on 32 bit systems, and I believe it does work (at the very least it did). However, my former company, the one for whom I was doing this work, ran into subsequent problems when running influxdb even with that patch, or perhaps because of it. The problem was that as the DB content got larger and larger, influxdb took longer and longer to boot up on a SBC running a microSD card. At one point it was over 5 minutes, and that kind of bootup time is just not acceptable for our application. I tried to mess with the DB configuration, trying different indexing methods and such, but I was unable to find a solution. We had had enough headaches with influx at that point to pull the plug. We eliminated it from our device entirely, which was a real headache because it was the keystone of our software architecture. And why wouldn't it be the keystone? It's a great convenience to use a web service to stick your data into an easily searchable database with great compression that tells the entire story of your system. Sensor data, configuration, syslogd, etc, all together in a neat package. But it just failed for us in enough ways that we had to move on. We don't have a replacement solution, either. If you want to record data on this system, you have to plug it via crossover into a PC that is hosting an influx instance. I think it's true that influxdb is just targeting a different use case that we want. They want to be the DB in a the cloud that boots up and runs always, never shutting down, collecting metrics from net-capable devices, and running nearly constant queries for analytics. In many embedded cases, mine included, we just want something that boots up quickly, records data efficiently, is robust to power loss, and might have no or only sporadic internet access. I don't even need to run queries very often; I really only look at the data if something went wrong. Notice that these are all implementation gripes. I think the influxdb web interface is pretty good, and one reason I chose it originally was that it was one of only a few options that allowed me to specify my own timestamps. An embedded-focused alternative could keep the same interface but make different tradeoff decisions in the implementation behind the scenes. |
Thank you for your insight!
I don't think you need a lot (or any) trade-offs. As a user you just need to be made aware that none of your queries should go over your memory boundary. So for instance, say you have 10GB of collected data spread over years of collecting. If the queries you run never reach a point now till as far back as what 3.2GB is then there would be no issue at all. And i'd be willing to bet that in most of the influxdb usecases this is enough of a range to go back at least a year. And if you go further back, say to some data that sits at the 4GB point, that you'll simply suffer the performance penalty that comes with it. But... I'm also guessing that the storage engine isn't as efficient in data storage as that one in prometheus 2.0. And depending on the type of data you collect you can use other efficient ways of compression for that specific data type. I don't know what influx does and does not do here, but i just have a hunch that it can be optimized a lot. |
I recall looking at the suggested fix code, and I reckon that might have simply been due to the sheer volume of parsing required at the file-system layer of the database files. Depending on the retention and compaction configuration and volume, the number of files can blow out dramatically. In my InfluxDB RPi4, before compaction I ran a count and returned 15,000,000 files. After I found the tsi1 approach, that dropped down to about 150,000. SBC hardware is pretty woefully underspecced for handling such volumes.
I think a lot of it comes down to the architecture and distribution. Influx could be made to work, but there will be no "single" solution. Whilst there is technically no reason why Influx shouldn't be able to run on SBC, the volume of data and the potential overheads involved just make it too heavy for big workloads centrally. As with any IoT-like solution, it makes sense to scale-out solutions, such as I mentioned earlier having various LoRA-enabled sensors in a field, talking back to a localised SBC to act as an aggregator. You would then have an upstream "central" console that consolidates the aggregated information or even pulls the raw data periodically. Through this architecture, you would then focus on having very short retention on the regional collectors, and this would workaround the underlying limitations of Influx on small systems. I however still firmly believe that Influx has failed the SBC community in simply refusing to acknowledge that SBC is a viable deployment platform for it, so there really does need to be consideration in the solution on how to make it work properly on such hardware - and this wouldn't be hard for someone who is intimately familiar with the code architecture underlying Influx.
I tried several alternatives, and came to a realisation that most viable alternatives were either not as efficient to query, or had poor/non-existent compression capabilities (e.g. the Postgres-enabled TSDB solutions). Influx is a fantastic product that has simply failed to embrace one of the most potent use cases for it: data logging in the field. This is essentially what underpins SBC and where such solutions are most prominent.
If one is querying such a large volume of data, I would suggest aggregation should be more prominently used to avoid such a penalty. Querying such a long range of raw data would be a big no-no and would probably even bring high-end server hardware to its knees. To facilitate this, it would be better to batch-query in loops and hope the memory model of Influx itself can ensure you don't end up utilising all available memory with stale data (i.e. it should age out queries that are no longer being used). I haven't tested it, but I wonder if using tsi1 would actually change the memory caching behaviour, since the shard should be remaining on disk in this configuration, rather than being loaded into memory?
I took a look at prometheus a long time back, it would be interesting to hear your experience with the latest version vs. influxDB? I can't recall why I decided against it at the time, but there was a specific reason for it (possibly Grafana-related?). I did try it in the early days though, and for some reason it didn't do what I needed it to. I might play around with it again if I get the time. |
Interesting re: odroid XU4, it's not a platform I've looked into (there's so many SBC solutions out there these days...), and time is rarely on my side of late. Totally agree on 32-bit vs. 64-bit, it's not always necessary but generally most modern ARM IP is 64-bit enabled, but more often than not simply not available due to OS or HW limitations. 64-bit generally is better optimised in modern hardware though, even in ARM so you can often extract slightly more performance out of it; plus it has the advantage of a seemingly endless amount of memory being available for usage (rather than the restrictive 4GB upper limit). Even if you only have 4GB RAM in the system, the optimisations in memory management can still be beneficial as I discovered in my RPi 3 (which saw a healthy boost switching from 32-bit Ubuntu to 64-bit Ubuntu). Give the tsi1 approach a try - you might find it just make Influx workable on the 32-bit platform. It will still utilise a lot of memory during compaction, but if you are splitting out the shards into smaller sizes, you might just find it will fit in a 32-bit footprint. You will definitely need to play around a bit, as it's highly dependent on your data volumes. It took me almost a month of daily tuning to get mine running the way I like it, but it paid off and has been flawless ever since. Memory usage during compaction peaks at 2.4GB on mine, but has not failed any compaction since. Completely agree on the "timebomb". It reminds me of early day IBM/dell servers (if I recall) that had a time bomb in the BIOS that after a certain date would simply "disable" the RAID controller. Or even the old Y2K thing. The fact that compaction simply fails at a certain point due to data volumes is bad software design, and there are absolutely ways to work around it. In truth, with a large enough volume of data on high-end hardware, in theory you could actually encounter the same issue. This suggests it is a bug that if resolved would benefit both SBC and enterprise markets, and also potentially better optimise the code to be more "graceful" in handling such events (instead of spamming syslog with errors, retrying, failing on the same issue, spamming, etc. until eventually the service can't load any more). Believe me, whilst I was trying to find a solution to this I was frustrated and would even to this day not recommend InfluxDB until such an issue is fixed. It's too much of a liability, as it is really a time-bomb triggered by data volume. |
FWIW - here's some useful reading material to understand tuning of Influx that helped me get mine working: https://www.influxdata.com/blog/influxdb-shards-retention-policies/ I am still using the 1.x variant of Influx, have yet to try out 2.x version - but it looks to have a very different architecture for downsampling data that is a bit more fine-grained. I might need to get a PhD to get my head around it first though... This is what I have currently (I have just altered the standard autogen retention policy on the telegraf DB): The above is simply 400 days of history. If you query a long range of data, you will get memory issues - but I tend to query specific limited durations from within that window, and when I aggregate results it is grouped by larger averages, thus reducing the load. |
@aemondis Just a note regarding data aggregation and more powerful hardware. I both agree and disagree :) But in the IoT world, especially a home environment like mine, where i'm aggregating data of my net meter and - say - about 15 IoT devices + weather information. That should be very possible on a SBC for years! Just goes to show that influx isn't designed for home IoT usage. Even though one of their usecases is IoT https://www.influxdata.com/customers/iot-data-platform/ there too it seems very much tailored to commercial needs. Funny side note, that page of theirs mentions IoT examples. Why the **** do they mention planets??? |
We no longer support 32 bit systems, closing. |
Following up on this post with a fresh issue to highlight worse symptoms that don't seem explainable by a db-size cutoff (as was speculated on #6975 and elsewhere):
In the month since that post, I've had to forcibly
mv
thedata/collectd
directory twice to unstick influx from 1-2min crash loops that lasted days, seemingly due to compaction errors.Today I'm noticing that my
temps
database (which I've not messed with during thesecollectd
db problems, and gets about 5 points per second written to it) is missing large swaths of data from the 2 months I've been writing to it:The last gap, between 1/14 and 1/17, didn't exist this morning (when influx was still crash-looping, before the most recent time I ran
mv /var/lib/influxdb/data/collectd ~/collectd.bak
). That data was just recently discarded, it seems, possibly around the time I performed my "work-around" for the crash loop:The default retention policy should not be discarding data, afaict:
Here's the last ~7d of syslogs from the RPi server, 99.9% of which is logs from Influx crash-looping.
There seem to be messages about:
Is running InfluxDB on an RPi supposed to generally work, or am I in uncharted territory just by attempting it?
The text was updated successfully, but these errors were encountered: