Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when using postgres plugin #5829

Closed
davidmehren opened this issue Apr 10, 2019 · 14 comments · Fixed by #5882
Closed

Segfault when using postgres plugin #5829

davidmehren opened this issue Apr 10, 2019 · 14 comments · Fixed by #5882
Assignees
Labels
bug priority/high Super important issue
Milestone

Comments

@davidmehren
Copy link

Bug report summary

When the postgres plugin is enabled, netdata segfaults. When running the postgres plugin in standalone mode using sudo -u netdata /usr/libexec/netdata/plugins.d/python.d.plugin postgres it works fine.

I have captured a stacktrace:

> valgrind netdata -D                                                                                                                                                              Memcheck, a memory error detector                                                                                                                                                         
Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.                                                                                                                           
Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info                                                                                                                        
Command: netdata -D                                                                                                                                                                                                                                                                                                                                                                
Thread 7:                                                                                                                                                                                 
Invalid read of size 1                                                                                                                                                                    
  at 0x4C33DA3: strcmp (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)                                                                                                            by 0x1944A0: rrdpush_send_chart_definition_nolock (rrdpush.c:154)                                                                                                                      
  by 0x1944A0: rrdset_done_push (rrdpush.c:281)                                                                                                                                          
  by 0x19B33E: rrdset_done_push_exclusive (rrdset.c:933)                                                                                                                                 
  by 0x19B33E: rrdset_done (rrdset.c:1173)                                                                                                                                               
  by 0x136A44: pluginsd_process (plugins_d.c:245)                                                                                                                                        
  by 0x123F34: pluginsd_worker_thread (plugins_d.c:540)                                                                                                                                  
  by 0x156084: thread_start (threads.c:126)                                                                                                                                              
  by 0x59F66DA: start_thread (pthread_create.c:463)                                                                                                                                      
  by 0x4F5D88E: clone (clone.S:95)                                                                                                                                                       
Address 0x0 is not stack'd, malloc'd or (recently) free'd                                                                                                                                
                                                                                                                                                                                         
                                                                                                                                                                                         
Process terminating with default action of signal 11 (SIGSEGV)                                                                                                                            
Access not within mapped region at address 0x0                                                                                                                                           
  at 0x4C33DA3: strcmp (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)                                                                                                          
  by 0x1944A0: rrdpush_send_chart_definition_nolock (rrdpush.c:154)                                                                                                                      
  by 0x1944A0: rrdset_done_push (rrdpush.c:281)                                                                                                                                          
  by 0x19B33E: rrdset_done_push_exclusive (rrdset.c:933)                                                                                                                                 
  by 0x19B33E: rrdset_done (rrdset.c:1173)                                                                                                                                               
  by 0x136A44: pluginsd_process (plugins_d.c:245)                                                                                                                                        
  by 0x123F34: pluginsd_worker_thread (plugins_d.c:540)                                                                                                                                  
  by 0x156084: thread_start (threads.c:126)                                                                                                                                              
  by 0x59F66DA: start_thread (pthread_create.c:463)                                                                                                                                      
  by 0x4F5D88E: clone (clone.S:95)                                                                                                                                                       
If you believe this happened as a result of a stack                                                                                                                                      
overflow in your program's main thread (unlikely but                                                                                                                                     
possible), you can try to increase the size of the                                                                                                                                       
main thread stack using the --main-stacksize= flag.
The main thread stack size used in this run was 8388608.

HEAP SUMMARY:
    in use at exit: 8,113,683 bytes in 13,580 blocks  total heap usage: 13,948 allocs, 368 frees, 10,483,463 bytes allocated
 
LEAK SUMMARY:
   definitely lost: 0 bytes in 0 blocks   indirectly lost: 0 bytes in 0 blocks
     possibly lost: 1,600 bytes in 5 blocks
   still reachable: 8,112,083 bytes in 13,575 blocks
        suppressed: 0 bytes in 0 blocksRerun with --leak-check=full to see details of leaked memory

For counts of detected and suppressed errors, rerun with: -v
ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
could not unlink /tmp/vgdb-pipe-from-vgdb-to-31349-by-root-on-???
could not unlink /tmp/vgdb-pipe-to-vgdb-from-31349-by-root-on-???
could not unlink /tmp/vgdb-pipe-shared-mem-vgdb-31349-by-root-on-???
OS / Environment

Ubuntu 18.04.2 in an LXC container on a Proxmox host

Netdata version (output of netdata -V)

netdata bee48fd3

Component Name

Not sure, collectors maybe?

Steps To Reproduce

Configure postgres credentials.
Enable the postgres plugin.

Expected behavior

Postgres metrics get collected.

@netdatabot netdatabot added bug needs triage Issues which need to be manually labelled labels Apr 11, 2019
@vlvkobal
Copy link
Contributor

Please send the output of the postgres plugin with chart definitions and the first two steps of data collection.

@ilyam8
Copy link
Member

ilyam8 commented Apr 12, 2019

hi @davidmehren

When the postgres plugin is enabled, netdata segfaults.

so it crashes immediately after the start?

@vlvkobal
Copy link
Contributor

Enable debugging and set debug flags = 0x0000000000020000. What do you have in /var/log/netdata/debug.log when netdata crashes?

@davidmehren
Copy link
Author

Please send the output of the postgres plugin with chart definitions and the first two steps of data collection.

Please have a look at https://gist.github.com/davidmehren/05196dcbb0df2a45ecc47648bee1b0c2
pg_plugin.out was generated using sudo -u netdata /usr/libexec/netdata/plugins.d/python.d.plugin postgres debug trace |& tee pg_plugin.out

What do you have in /var/log/netdata/debug.log when netdata crashes?

When I use netdata -D -W debug_flags=0x0000000000020000, the debug log is empty. I have included the error.log in the gist above.

so it crashes immediately after the start?

I have tried to run netdata a few times:

root@pg:~# time netdata -D -W debug_flags=0x0000000000020000
Segmentation fault

real    0m5.526s
user    0m0.030s
sys     0m0.026s
root@pg:~# time netdata -D -W debug_flags=0x0000000000020000
Segmentation fault

real    0m6.046s
user    0m0.039s
sys     0m0.022s
root@pg:~# time netdata -D -W debug_flags=0x0000000000020000
Segmentation fault

real    0m5.709s
user    0m0.042s
sys     0m0.018s
root@pg:~# time netdata -D -W debug_flags=0x0000000000020000
Segmentation fault

real    0m5.777s
user    0m0.033s
sys     0m0.028s
root@pg:~# time netdata -D -W debug_flags=0x0000000000020000
Segmentation fault

real    0m5.812s
user    0m0.025s
sys     0m0.028s

As you can see, it runs for a bit and seems so collect metrics (including postgres, I can see them in the instance they are streamed to) for one or two times, then crashes.

@vlvkobal
Copy link
Contributor

Recompile netdata with CFLAGS="-O1 -ggdb -DNETDATA_INTERNAL_CHECKS=1" ./netdata-installer.sh in order to get debug messages in the debug.log.

@davidmehren
Copy link
Author

Sorry, I did that yesterday and didn't realize the updater ran this morning and broke it.
See here for the log: https://gist.github.com/davidmehren/fd9a70211259e33437260a3516b2b926

@ilyam8 ilyam8 added priority/high Super important issue and removed needs triage Issues which need to be manually labelled labels Apr 13, 2019
@vlvkobal
Copy link
Contributor

There is a problem with chart naming. If names are not provided explicitly, charts are named automatically using chart IDs. The issue is that due to troubles of handling special characters in JavaScript names are 'normalised' in a way that any non-alphanumeric character except "." is converted to "_". In this particular case, there are problematic charts which include kif-mastodon and kif_mastodon in their IDs.

When a chart is created with a converted name which already exists in the netdata database, the pointer to the name remains null and it leads to segmentation fault when streaming to a master.

@cakrit
Copy link
Contributor

cakrit commented Apr 15, 2019

The replacement isn't done in javascript, but here.

Duplicate names are supposed to be handled in https://github.com/netdata/netdata/blob/master/database/rrdset.c#L152

@vlvkobal can you test your theory of duplicate names and see where and how we get the SEGFAULT?

@vlvkobal
Copy link
Contributor

The replacement isn't done in javascript, but here.

I didn't say that it is done in JavaScript, I said that it is done due to the JavaScript. Yes, exactly, it is done in rrdset_strncpyz_name().

Duplicate names are supposed to be handled in /database/rrdset.c@master#L152

Yes, but the returned value from rrdset_set_name() is not taken into account in rrdset.c L713.

@vlvkobal can you test your theory of duplicate names and see where and how we get the SEGFAULT?

I did it a few days before, and my previous comment was a result of the investigation. I wrote already that the segmentation fault happens when a slave is trying to push an unnamed chart to a master.

My previous message was written mostly as a memo. What we need is to decide what to do with the problem. The simplest workaround for the segmentation fault is to add a check for an empty name in streaming and don't push unnamed charts, but it doesn't solve the main problem - name duplication.

@ilyam8
Copy link
Member

ilyam8 commented Apr 16, 2019

and don't push unnamed charts

kif-mastodon and kif_mastodon are two unique names, it will be a bug to not push one of them.

rrdset_strncpyz_name()

netdata/database/rrdset.c

Lines 127 to 128 in 2192f26

if(c != '.' && !isalnum(c))
c = '_';

Why only dots and alphanumeric?

@vlvkobal vlvkobal added this to the v1.15 milestone Apr 17, 2019
@cakrit cakrit added the size:1 label Apr 17, 2019
@cakrit cakrit modified the milestones: v1.15, v1.14 Apr 17, 2019
@vlvkobal
Copy link
Contributor

#5882 was merged. @davidmehren please test it. It should resolve the issue, though badly named charts won't be shown.

@vlvkobal
Copy link
Contributor

The discussion on the main problem will continue in #5883.

@davidmehren
Copy link
Author

I can confirm that netdata does not crash anymore.
In the meantime I have found out that one of the databases was left over from a manual test and can be deleted.

Thanks everyone for the quick help!

@vlvkobal
Copy link
Contributor

@davidmehren thank you for your report and assistance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug priority/high Super important issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants