Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.9.6.1] Panic on writes #5262

Closed
joshliberty opened this issue Jan 4, 2016 · 28 comments
Closed

[0.9.6.1] Panic on writes #5262

joshliberty opened this issue Jan 4, 2016 · 28 comments
Labels
Milestone

Comments

@joshliberty
Copy link

I'm getting a lot of these messages in my influx log:

[http] 2016/01/04 09:17:09 10.60.195.206 - root [04/Jan/2016:09:17:09 -0600] POST /write?db=MYDB&rp=MY_RETENSION_POLICY HTTP/1.1 200 23 - python-requests/2.4.3 CPython/2.7.3 Linux/3.2.0-4-amd64 38e409cc-b2f6-11e5-8053-000000000000 7.080412ms [panic:runtime error: index out of range]

The way I work is by writing batches from many nodes in the system. Each batch can contain anything between 1-40 points.

There's nothing special about the batches that fail - I turned on tracing to figure out what points are failing, and when I write the same data using the admin UI or cURL - everything works as expected.

It happens on tsm1 and bz1 both.

@otoolep
Copy link
Contributor

otoolep commented Jan 4, 2016

@caligula1989 - we really need to see the data.

Can you enable write-tracing? https://github.com/influxdb/influxdb/blob/master/etc/config.sample.toml#L220

@joshliberty
Copy link
Author

This is an example:

[http] 2016/01/04 09:07:32 10.60.195.198 - root [04/Jan/2016:09:07:32 -0600] POST /write?db=mydb&rp=keep_7_days HTTP/1.1 200 23 - python-requests/2.4.1 CPython/2.7.3 Linux/3.2.0-4-amd64 e137956b-b2f4-11e5-8069-000000000000 755.753µs [panic :runtime error: index out of range]
[httpd] 2016/01/04 09:07:32 write body received by handler: requests,APIKey=samplekey,APIVersion=4,AccountDirector=Adam\ Cohen,AccountDirectorID=526996,AccountManager=Adam\ Cohen,AccountManagerID=526996,DeveloperClass=PROFESSIONAL,ErrorTypeID=undefined,OrganizationID=3532,OrganizationName=Example\ Org\ Inc.,StatusCode=200 PerfectMatch=1i,Price=0.05,RunTime=2273i,Warnings=17i,counter=1i 1451920042530366976

@otoolep
Copy link
Contributor

otoolep commented Jan 4, 2016

What version of InfluxDB are you running?

@joshliberty
Copy link
Author

0.9.6.1

@beckettsean
Copy link
Contributor

@caligula1989 let me see if I can summarize the issue:

  • you write batches of 1-40 points using the python library and sometimes they fail
  • those same points work when written individually using curl or the admin UI
  • you have tried both bz1 and tsm engines with the same result

What percentage of writes are failing with the error? Is it 1%, 5%, 25%, 95%?

Did you delete all data when switching between engines or did you just update the config and restart the process?

Have you tried writing one of the failed batches using curl? If so, did the batch succeed?

Have you tried resubmitting the same batch via the python client?

Can you share more than one example of a failed batch? There's nothing obviously syntactically wrong with the single point example you provided, but perhaps in aggregate something will become clear.

@joshliberty
Copy link
Author

Yeah, you summarized it correctly.

  1. In terms of percentage, about 60% of writes fail - however most points end up being written (I correlate this using a different source). Understanding how this happens will require some more analyzing on my end - but my initial guess would be that the nodes automatically retry failed batches.
  2. I did not delete all the data, only updated the config & restarted.
  3. I did write using curl, it worked.
  4. I did not re-try using the python client, I will.
  5. Here is another example:

[http] 2016/01/04 09:09:12 10.60.195.198 - root [04/Jan/2016:09:09:12 -0600] POST /write?db=mydb&rp=keep_7_days HTTP/1.1 200 23 - python-requests/2.4.1 CPython/2.7.3 Linux/3.2.0-4-amd64 1cc18025-b2f5-11e5-815d-000000000000 4.351949ms [panic:runtime error: index out of range]
[httpd] 2016/01/04 09:09:12 write body received by handler: requests,APIKey=samplekey,APIVersion=4,AccountDirector=Ron\ Leon,AccountDirectorID=770193,AccountManager=Summer\ Lowan,AccountManagerID=527000,DeveloperClass=BASIC,ErrorTypeID=undefined,OrganizationID=4421,OrganizationName=Nokur,StatusCode=200 PossibleMatch=1i,Price=0.01,RunTime=118i,Warnings=0i,counter=1i 1451920095924807936
requests,APIKey=samplekey,APIVersion=4,AccountDirector=Ron\ Leon,AccountDirectorID=770193,AccountManager=Summer\ Lowan,AccountManagerID=527000,DeveloperClass=SOCIAL,ErrorTypeID=undefined,OrganizationID=6443,OrganizationName=https://corg.co,StatusCode=200 PossibleMatch=1i,Price=0.005,RunTime=122i,Warnings=0i,counter=1i 1451920096353157888
requests,APIKey=samplekey,APIVersion=4,AccountDirector=Ron\ Leon,AccountDirectorID=770193,AccountManager=Summer\ Lowan,AccountManagerID=527000,DeveloperClass=BASIC,ErrorTypeID=undefined,OrganizationID=4421,OrganizationName=Nokur,StatusCode=200 PerfectMatch=1i,Price=0.01,RunTime=445i,Warnings=0i,counter=1i 1451920096732312064
requests,APIKey=samplekey,APIVersion=4,AccountDirector=Ron\ Leon,AccountDirectorID=770193,AccountManager=Summer\ Lowan,AccountManagerID=527000,DeveloperClass=BASIC,ErrorTypeID=undefined,OrganizationID=4421,OrganizationName=Nokur,StatusCode=200 PossibleMatch=1i,Price=0.01,RunTime=264i,Warnings=0i,counter=1i 1451920098559104000
requests,APIKey=samplekey,APIVersion=3,AccountDirector=Adam\ Cohen,AccountDirectorID=526996,AccountManager=Adam\ Cohen,AccountManagerID=526996,DeveloperClass=ADVANCED,ErrorTypeID=undefined,OrganizationID=5753,OrganizationName=Example\ Org\ Inc.,StatusCode=200 PerfectMatch=1i,Price=0.02,RunTime=2290i,Warnings=0i,counter=1i 1451920101426490112
requests,APIKey=samplekey,APIVersion=4,AccountDirector=Adam\ Cohen,AccountDirectorID=526996,AccountManager=Adam\ Cohen,AccountManagerID=526996,DeveloperClass=BASIC,ErrorTypeID=undefined,OrganizationID=6967,OrganizationName=LYmi,StatusCode=200 PerfectMatch=1i,Price=0.01,RunTime=340i,Warnings=1i,counter=1i 1451920105786765056
requests,APIKey=samplekey,APIVersion=3,AccountDirector=Adam\ Cohen,AccountDirectorID=526996,AccountManager=Adam\ Cohen,AccountManagerID=526996,DeveloperClass=ADVANCED,ErrorTypeID=undefined,OrganizationID=5753,OrganizationName=Example\ Org\ Inc.,StatusCode=200 NoMatch=1i,Price=0.02,RunTime=1827i,Warnings=0i,counter=1i 1451920107794617088
requests,APIKey=samplekey,APIVersion=undefined,AccountDirector=undefined,AccountDirectorID=undefined,AccountManager=undefined,AccountManagerID=undefined,DeveloperClass=undefined,ErrorTypeID=undefined,OrganizationID=undefined,OrganizationName=undefined,StatusCode=undefined RunTime=1944i,counter=1i 1451920108552012032
requests,APIKey=samplekey,APIVersion=undefined,AccountDirector=undefined,AccountDirectorID=undefined,AccountManager=undefined,AccountManagerID=undefined,DeveloperClass=undefined,ErrorTypeID=undefined,OrganizationID=undefined,OrganizationName=undefined,StatusCode=undefined RunTime=1552i,counter=1i 1451920109220024064
requests,APIKey=samplekey,APIVersion=4,AccountDirector=Ron\ Leon,AccountDirectorID=770193,AccountManager=Summer\ Lowan,AccountManagerID=527000,DeveloperClass=BASIC,ErrorTypeID=undefined,OrganizationID=4421,OrganizationName=Nokur,StatusCode=200 PossibleMatch=1i,Price=0.01,RunTime=216i,Warnings=0i,counter=1i 1451920116098835968
requests,APIKey=samplekey,APIVersion=4,AccountDirector=Adam\ Cohen,AccountDirectorID=526996,AccountManager=Adam\ Cohen,AccountManagerID=526996,DeveloperClass=BASIC,ErrorTypeID=undefined,OrganizationID=6967,OrganizationName=LYmi,StatusCode=200 PerfectMatch=1i,Price=0.01,RunTime=136i,Warnings=1i,counter=1i 1451920133530590976
requests,APIKey=samplekey,APIVersion=4,AccountDirector=Summer\ Pathway,AccountDirectorID=770251,AccountManager=Josh\ Liberty,AccountManagerID=526992,DeveloperClass=PROFESSIONAL,ErrorTypeID=BadRequestError,OrganizationID=309,OrganizationName=NoDeal,StatusCode=400 Errors=1i,Price=0.0,RunTime=8i,Warnings=1i,counter=1i 1451920135939287040
requests,APIKey=samplekey,APIVersion=undefined,AccountDirector=undefined,AccountDirectorID=undefined,AccountManager=undefined,AccountManagerID=undefined,DeveloperClass=undefined,ErrorTypeID=undefined,OrganizationID=undefined,OrganizationName=undefined,StatusCode=undefined RunTime=2269i,counter=1i 1451920139440521984
requests,APIKey=samplekey,APIVersion=4,AccountDirector=Adam\ Cohen,AccountDirectorID=526996,AccountManager=Adam\ Cohen,AccountManagerID=526996,DeveloperClass=BASIC,ErrorTypeID=undefined,OrganizationID=6967,OrganizationName=LYmi,StatusCode=200 PerfectMatch=1i,Price=0.01,RunTime=181i,Warnings=1i,counter=1i 1451920147871250944

@rossmcdonald
Copy link
Contributor

@caligula1989 In addition to trying out the Python client, can you give us more information on how you are writing data? So far I've been unable to reproduce the issue with this simple script (also using the requests library):

import requests

with open('batch.txt', 'rb') as fd:
    d = fd.read()
    res = requests.post('http://localhost:8086/write?db=requests', data=d)
    print res

Where batch.txt is the example data you provided. This returns successfully:

python test.py; tail -n1 /var/log/influxdb/influxd.log
<Response [204]>
[http] 2016/01/04 21:49:29 127.0.0.1 - - [04/Jan/2016:21:49:29 +0000] POST /write?db=requests HTTP/1.1 204 0 - python-requests/2.2.1 CPython/2.7.6 Linux/3.13.0-65-generic 07e17d0b-b32d-11e5-800b-000000000000 1.385469ms

Though you may be writing data in another way. Also, can you tell us your:

  • Operating system and version
  • Output from a uname -a

@rossmcdonald rossmcdonald changed the title Panic on writes [0.9.6.1] Panic on writes Jan 4, 2016
@joshliberty
Copy link
Author

@rossmcdonald
I'm writing my data using the influxdb-python library: https://github.com/influxdata/influxdb-python, version 2.9.2. The code I'm using is as follows:

def worker(self):
        batch = []
        last_flush = time.time()
        while True:
            try:
                item = self.queue.get(timeout=5)
                batch.append(item)
                self.queue.task_done()
            except:
                pass  # Queue empty
            batch_ready = len(batch) >= self.batch_size
            manual_flush = self.flush
            time_flush = time.time() - last_flush > self.max_flush_waiting_time

            if batch_ready or len(batch) > 0 and (manual_flush or time_flush):
                try:
                    if batch_ready:
                        reason = "batch being ready"
                    elif manual_flush:
                        reason = "manual flush"
                    else:
                        reason = "flush waiting time being reached"

                    logger.debug("Flushing {} points to influx db due to {}".format(len(batch), reason))
                    self.client.write_points(batch, retention_policy=settings.INFLUX_RETENTION_POLICY)
                    batch = []
                    last_flush = time.time()
                except Exception as e:
                    logger.exception("Error flushing to Influx DB")

Each "point" is a dictionary of this format:

{
    'fields': {
        'Errors': None or integer,
        'PossibleMatch': None or integer,
        'Warnings': None or integer,
        'Price': float,
        'counter': integer,
        'PerfectMatch': None or integer,
        'NoMatch': None,
        'RunTime': integer
    },
    'measurement': 'requests',
    'tags': {
        'ErrorTypeID': string,
        'AccountDirectorID': string,
        'OrganizationName': string,
        'AccountDirector': string,
        'OrganizationID': string,
        'AccountManagerID': string,
        'APIVersion': string,
        'APIKey': string,
        'DeveloperClass': string,
        'AccountManager': string,
        'StatusCode': string
    },
    'time': 1451920147871250944 L
}

My system is debian wheezy, 64bit.

$ uname -a: 
Linux webutils 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u2 x86_64 GNU/Linux

@rossmcdonald
Copy link
Contributor

@caligula1989 Thanks for sending. I don't see any issues with your methodology above, so I don't think the way you are writing points is the issue. So far I haven't been able to reproduce the issue, even though I'm using the exact same InfluxDB python client version and InfluxDB version (though a different data set).

Do you have any estimates on how often the panic occurs? Does it happen every time, or is it more of an intermittent issue? I'd be curious to see if you have the same issue using just the requests library, and not the InfluxDB client library to rule that out as an issue.

Also, how are you launching the worker() function above? Are there multiple threads consuming from the same queue?

@joshliberty
Copy link
Author

@rossmcdonald
I've narrowed it down to "something on the nodes" - when a node is restarted, all queries from it function as expected for a while, and then some of the writes start failing, until most of them are failing.

Unfortunately I'm not able to pinpoint what exactly happens on the node when it starts happening - I was not able to reproduce it in a development environment. I'll keep trying.

In any case - if I can get some background as to what the error message actually means on the server side, it might be helpful in order to debug it on my end & find a way to reproduce it consistently.

@rossmcdonald
Copy link
Contributor

@caligula1989 Regarding background on the error message, unfortunately I can't say for sure without a larger stack trace that pinpoints what routine is throwing the panic. Typically a:

[panic:runtime error: index out of range]

Means that the code is trying to access an invalid index from an array, which isn't super helpful on its own. Are you receiving any extra error messages returned by the Python client?

The fact that a restart seems to resolve the issue leads me to believe this may be caused by a lock. Are you executing any queries when you start to see issues? And does it get to a point where all of the writes are failing?

@joshliberty
Copy link
Author

No, I'm not getting anything else, except the fact that the status code is 200.

Is there a way I can turn on more detailed logs that will show full stack traces?

@joshliberty
Copy link
Author

Closed by mistake

@joshliberty joshliberty reopened this Jan 6, 2016
@beckettsean
Copy link
Contributor

Is there a way I can turn on more detailed logs that will show full stack traces?

The system logs all stack traces by default. I've not seen a case where only the panic message appears without the rest of the trace. Perhaps this is a new SystemD interaction issue, @rossmcdonald would know better if that's a reasonable guess.

It is possible to log all writes, but that's very expensive on the log directory. Set write-tracing = true under the [http] section of the config file. That might point to what happens with the failed writes.

all queries from it function as expected for a while, and then some of the writes start failing, until most of them are failing.

So do queries eventually start failing? If so, how? Do they log the start but never finish? Do they not even show up in the logs? Do they timeout?

Does the system ever OOM?

When the writes fail, you mentioned that some points still get through. So over time, more batches fail with a panic, but some of those points do persist? Over time, do more points persist from the write batches showing panics, or do fewer points persist, or is about the same throughout?

@joshliberty
Copy link
Author

@beckettsean
The logs do not include stack traces, they're logs by the HTTPD handler, in the following format:

[http] 2016/01/07 00:39:05 10.60.195.208 - root [07/Jan/2016:00:39:05 -0600] POST /write?db=mydb&rp=keep_7_days HTTP/1.1 200 23 - python-requests/2.4.3 CPython/2.7.3 Linux/3.2.0-4-amd64 58d69ce1-b509-11e5-b13e-000000000000 689.542µs [panic:runtime error: index out of range]
[http] 2016/01/07 00:39:05 10.4.25.170 - root [07/Jan/2016:00:39:05 -0600] POST /write?db=mydb&rp=keep_7_days HTTP/1.1 200 23 - python-requests/2.4.1 CPython/2.7.3 Linux/3.2.0-4-amd64 58dc233e-b509-11e5-b13f-000000000000 5.531122ms [panic:runtime error: index out of range]
[http] 2016/01/07 00:39:05 10.60.195.210 - root [07/Jan/2016:00:39:05 -0600] POST /write?db=mydb&rp=keep_7_days HTTP/1.1 200 23 - python-requests/2.4.3 CPython/2.7.3 Linux/3.2.0-4-amd64 58f77961-b509-11e5-b140-000000000000 2.37363ms [panic:runtime error: index out of range]
[http] 2016/01/07 00:39:05 10.60.195.208 - root [07/Jan/2016:00:39:05 -0600] POST /write?db=mydb&rp=keep_7_days HTTP/1.1 200 23 - python-requests/2.4.3 CPython/2.7.3 Linux/3.2.0-4-amd64 58ff4862-b509-11e5-b141-000000000000 4.800856ms [panic:runtime error: index out of range]
[http] 2016/01/07 00:39:05 10.60.195.208 - root [07/Jan/2016:00:39:05 -0600] POST /write?db=mydb&rp=keep_7_days HTTP/1.1 200 23 - python-requests/2.4.3 CPython/2.7.3 Linux/3.2.0-4-amd64 59004c30-b509-11e5-b142-000000000000 3.322324ms [panic:runtime error: index out of range]
[http] 2016/01/07 00:39:05 10.60.195.206 - root [07/Jan/2016:00:39:05 -0600] POST /write?db=mydb&rp=keep_7_days HTTP/1.1 200 23 - python-requests/2.4.3 CPython/2.7.3 Linux/3.2.0-4-amd64 59017d5b-b509-11e5-b143-000000000000 1.152101ms [panic:runtime error: index out of range]
[http] 2016/01/07 00:39:05 10.60.195.208 - root [07/Jan/2016:00:39:05 -0600] POST /write?db=mydb&rp=keep_7_days HTTP/1.1 200 23 - python-requests/2.4.3 CPython/2.7.3 Linux/3.2.0-4-amd64 59072fe7-b509-11e5-b144-000000000000 8.480476ms [panic:runtime error: index out of range]
[http] 2016/01/07 00:39:05 10.60.195.210 - root [07/Jan/2016:00:39:05 -0600] POST /write?db=mydb&rp=keep_7_days HTTP/1.1 204 0 - python-requests/2.4.3 CPython/2.7.3 Linux/3.2.0-4-amd64 5924aaf1-b509-11e5-b145-000000000000 34.053037ms
[http] 2016/01/07 00:39:06 10.60.195.208 - root [07/Jan/2016:00:39:05 -0600] POST /write?db=mydb&rp=keep_7_days HTTP/1.1 200 23 - python-requests/2.4.3 CPython/2.7.3 Linux/3.2.0-4-amd64 592a8e90-b509-11e5-b146-000000000000 5.950387ms [panic:runtime error: index out of range]

Enabling write-tracing simply logs the http payload, but doesn't provide any additional information such as where the exception occurred. I have copied some failing points in my previous comments - nothing special about them.

The queries definitely finish - in fact, they finish MUCH quicker than normal queries. Depending on payload size, it can be anything between microseconds (see above, 689.542µs, for example), and up to around 5ms - much less than succesfull queries.

The system does not OOM.

Some points persist, but I believe this is because, as you can see from my code snippet above - when a write results in an error (and getting status code 200 is an error), the "batch" list is not emptied, so the node will try the points again - until a write suceeds and the list is flushed.

@beckettsean
Copy link
Contributor

@pauldix @jwilder any thoughts as to what could panic in the write path but still return a 200 HTTP code and leave the process intact?

@jwilder
Copy link
Contributor

jwilder commented Jan 8, 2016

That looks like the recovery handler in the HTTP service: https://github.com/influxdata/influxdb/blob/master/services/httpd/handler.go#L842

We need to add some better logging there.

@joshliberty
Copy link
Author

This is the full stack trace:

2016/01/15 15:47:52 http: panic serving 10.60.195.208:34209: runtime error: index out of range
goroutine 384 [running]:
net/http.func·011()
        /usr/local/go/src/net/http/server.go:1130 +0xbb
github.com/influxdb/influxdb/models.scanTagsValue(0xc20de60834, 0x10, 0xf5cc, 0x10, 0x10, 0x0, 0x0, 0x0)
        /tmp/tmp.gie0pepR9B/src/github.com/influxdb/influxdb/models/points.go:437 +0x2b6
github.com/influxdb/influxdb/models.scanTags(0xc20de60834, 0x10, 0xf5cc, 0x10, 0xc20a7c2aa8, 0x64, 0x64, 0x0, 0xf682, 0x0, ...)
        /tmp/tmp.gie0pepR9B/src/github.com/influxdb/influxdb/models/points.go:393 +0x29b
github.com/influxdb/influxdb/models.scanKey(0xc20de60834, 0x10, 0xf5cc, 0x9, 0x101b560, 0x0, 0x0, 0x0, 0x0, 0x0)
        /tmp/tmp.gie0pepR9B/src/github.com/influxdb/influxdb/models/points.go:270 +0x23d
github.com/influxdb/influxdb/models.parsePoint(0xc20de60834, 0x10, 0xf5cc, 0xece2b6008, 0x31de6525, 0x101b560, 0xbd6ea0, 0x1, 0x0, 0x0, ...)
        /tmp/tmp.gie0pepR9B/src/github.com/influxdb/influxdb/models/points.go:193 +0x86
github.com/influxdb/influxdb/models.ParsePointsWithPrecision(0xc20de60000, 0xe865, 0xfe00, 0xece2b6008, 0x31de6525, 0x101b560, 0xbd6ea0, 0x1, 0x0, 0x0, ...)
        /tmp/tmp.gie0pepR9B/src/github.com/influxdb/influxdb/models/points.go:172 +0x499
github.com/influxdb/influxdb/services/httpd.(*Handler).serveWriteLine(0xc20808cb80, 0x7fd1c2f3b070, 0xc208e12540, 0xc208d544e0, 0xc20de60000, 0xe865, 0xfe00, 0x0)
        /tmp/tmp.gie0pepR9B/src/github.com/influxdb/influxdb/services/httpd/handler.go:489 +0x1e4
github.com/influxdb/influxdb/services/httpd.(*Handler).serveWrite(0xc20808cb80, 0x7fd1c2f3b070, 0xc208e12540, 0xc208d544e0, 0x0)
        /tmp/tmp.gie0pepR9B/src/github.com/influxdb/influxdb/services/httpd/handler.go:395 +0x6e8
github.com/influxdb/influxdb/services/httpd.*Handler.(github.com/influxdb/influxdb/services/httpd.serveWrite)·fm(0x7fd1c2f3b070, 0xc208e12540, 0xc208d544e0, 0x0)
        /tmp/tmp.gie0pepR9B/src/github.com/influxdb/influxdb/services/httpd/handler.go:112 +0x4f
github.com/influxdb/influxdb/services/httpd.func·003(0x7fd1c2f3b070, 0xc208e12540, 0xc208d544e0)
        /tmp/tmp.gie0pepR9B/src/github.com/influxdb/influxdb/services/httpd/handler.go:712 +0x78
net/http.HandlerFunc.ServeHTTP(0xc20805c9a0, 0x7fd1c2f3b070, 0xc208e12540, 0xc208d544e0)
        /usr/local/go/src/net/http/server.go:1265 +0x41
github.com/influxdb/influxdb/services/httpd.func·004(0x7fd1c2f3af98, 0xc208e124e0, 0xc208d544e0)
        /tmp/tmp.gie0pepR9B/src/github.com/influxdb/influxdb/services/httpd/handler.go:773 +0x288
net/http.HandlerFunc.ServeHTTP(0xc20802b310, 0x7fd1c2f3af98, 0xc208e124e0, 0xc208d544e0)
        /usr/local/go/src/net/http/server.go:1265 +0x41
github.com/influxdb/influxdb/services/httpd.func·005(0x7fd1c2f3af98, 0xc208e124e0, 0xc208d544e0)
        /tmp/tmp.gie0pepR9B/src/github.com/influxdb/influxdb/services/httpd/handler.go:782 +0xba
net/http.HandlerFunc.ServeHTTP(0xc20805c9c0, 0x7fd1c2f3af98, 0xc208e124e0, 0xc208d544e0)
        /usr/local/go/src/net/http/server.go:1265 +0x41
github.com/influxdb/influxdb/services/httpd.func·006(0x7fd1c2f3af98, 0xc208e124e0, 0xc208d544e0)
        /tmp/tmp.gie0pepR9B/src/github.com/influxdb/influxdb/services/httpd/handler.go:820 +0x3f4
net/http.HandlerFunc.ServeHTTP(0xc20802b6a0, 0x7fd1c2f3af98, 0xc208e124e0, 0xc208d544e0)
        /usr/local/go/src/net/http/server.go:1265 +0x41
github.com/influxdb/influxdb/services/httpd.func·007(0x7fd1c2f3af98, 0xc208e124e0, 0xc208d544e0)
        /tmp/tmp.gie0pepR9B/src/github.com/influxdb/influxdb/services/httpd/handler.go:830 +0x179
net/http.HandlerFunc.ServeHTTP(0xc20802b6c0, 0x7fd1c2f3af98, 0xc208e124e0, 0xc208d544e0)
        /usr/local/go/src/net/http/server.go:1265 +0x41
github.com/influxdb/influxdb/services/httpd.func·008(0x7fd1c2f3af60, 0xc20e3e2140, 0xc208d544e0)
        /tmp/tmp.gie0pepR9B/src/github.com/influxdb/influxdb/services/httpd/handler.go:838 +0x123
net/http.HandlerFunc.ServeHTTP(0xc20805c9e0, 0x7fd1c2f3af60, 0xc20e3e2140, 0xc208d544e0)
        /usr/local/go/src/net/http/server.go:1265 +0x41
github.com/bmizerany/pat.(*PatternServeMux).ServeHTTP(0xc208038020, 0x7fd1c2f3af60, 0xc20e3e2140, 0xc208d544e0)
        /tmp/tmp.gie0pepR9B/src/github.com/bmizerany/pat/mux.go:109 +0x21c
github.com/influxdb/influxdb/services/httpd.(*Handler).ServeHTTP(0xc20808cb80, 0x7fd1c2f3af60, 0xc20e3e2140, 0xc208d544e0)
        /tmp/tmp.gie0pepR9B/src/github.com/influxdb/influxdb/services/httpd/handler.go:181 +0x36e
net/http.serverHandler.ServeHTTP(0xc208a9e2a0, 0x7fd1c2f3af60, 0xc20e3e2140, 0xc208d544e0)
        /usr/local/go/src/net/http/server.go:1703 +0x19a
net/http.(*conn).serve(0xc20e3e20a0)
        /usr/local/go/src/net/http/server.go:1204 +0xb57
created by net/http.(*Server).Serve
        /usr/local/go/src/net/http/server.go:1751 +0x35

@jsternberg
Copy link
Contributor

@caligula1989 are any of the tags you are writing empty? It looks like the line you are panicking on doesn't have a bounds check when it starts reading the tag value, so I think something like this may cause a panic: cpu,host=

jsternberg added a commit that referenced this issue Jan 15, 2016
A panic would happen if you wrote the following:
    cpu,host=

There was a missing bounds check when scanning the tag value.

Fixes #5262.
@beckettsean
Copy link
Contributor

@jsternberg in an earlier comment, @caligula1989 shared a write that caused the issue. It doesn't have any empty tag values, but there are some escaped spaces in tag values. Perhaps one of those is not evaluating properly?

@jwilder
Copy link
Contributor

jwilder commented Jan 15, 2016

@beckettsean It could be a new line in the tag value as well right after the equals. Using the cli w/ 0.9.6 or later and writing insert cpu,host= produces the exact panic seen. This is a regression introduced in 7ce385f. This does not happen in 0.9.5 and returns ERR: unable to parse 'cpu,host=': missing fields.

@jsternberg
Copy link
Contributor

@beckettsean unfortunately I can't reproduce the issue with the write used above, which is really strange. The stacktrace also points to the issue I mentioned being the problem, but I don't see any evidence that there's an errant newline in the input.

@joshliberty
Copy link
Author

Wouldn't we see a newline in the payload dump?

I've reviewed the code of the http handler, and while I have limited knowledge of go in general and the influxdb codebase in particular - it SEEMS to me that the write of the dump doesn't happen at the same time as the request log, so if there are many requests, they wouldn't be subsequent in the log file, and the I wouldn't have pasted the right data being saved.

Please correct me if I'm wrong (I could very well be).

@jsternberg
Copy link
Contributor

@caligula1989 you're right that we would expect a newline in the input, but from my (very) limited knowledge of looking at these, I've seen a few other bug reports where the logging output seemed to have nothing to do with the backtrace.

So it's possible the request log written right before the panic has nothing to do with it.

jsternberg added a commit that referenced this issue Jan 16, 2016
A panic would happen if you wrote the following:
    cpu,host=

There was a missing bounds check when scanning the tag value.

Fixes #5262.
@joshliberty
Copy link
Author

@jsternberg Interesting. That's also what I would expect when looking at this:
https://github.com/influxdata/influxdb/blob/master/services/httpd/handler.go#L385

Only when the request actually ends, is the request line printed. If two requests come in simultaneously, what's to ensure that the printed payload actually belongs to the subsequent request line?

Seems like the best course of action for me is to log the request data when I get these errors on the client side. I'll report my findings.

@joshliberty
Copy link
Author

So, I have enabled detailed logging on the nodes, and indeed - a tag contained a newline. I've addressed the issue on my end, and will check whether the issue is eliminated.

In any case, better handling is essential here. There's absolutely no reason for such a bug to not have a straightforward error message.

@jwilder
Copy link
Contributor

jwilder commented Jan 17, 2016

@caligula1989 #5370 will cause writes like that to return a parse error and not panic.

@joshliberty
Copy link
Author

Indeed it solved the issue.

@jsternberg jsternberg added this to the 0.10.0 milestone Jan 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants