Mongodb constantly at 100%+ of CPU #22

jasongrishkoff · 2018-02-17T19:38:44Z

I've got a 2-core system running on Digital Ocean, and after ~3 days of tracking data, everything totally freezes up with Mongodb stuck at 100% on both cores. I'm using redis-oplog for oplog tailing, but that doesn't seem to help :(

Anyone else experiencing similar?

jasongrishkoff · 2018-02-17T19:42:33Z

If I run meteor mongo and then db.currentOp({secs_running: {$gte: 300}}) I get two queries. Strangely both of them have query: {} so it's hard to say what they're running.

{
        "inprog" : [
                {
                        "desc" : "rsSync",
                        "threadId" : "140578758936320",
                        "active" : true,
                        "opid" : 5,
                        "secs_running" : 468,
                        "microsecs_running" : NumberLong(468308691),
                        "op" : "none",
                        "ns" : "local.oplog.rs",
                        "query" : {

                        },
                        "numYields" : 0,
                        "locks" : {

                        },
                        "waitingForLock" : false,
                        "lockStats" : {
                                "Global" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(19),
                                                "w" : NumberLong(3),
                                                "R" : NumberLong(1),
                                                "W" : NumberLong(1)
                                        }
                                },
                                "Database" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(7),
                                                "w" : NumberLong(2),
                                                "W" : NumberLong(1)
                                        }
                                },
                                "Collection" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(6)
                                        }
                                },
                                "Metadata" : {
                                        "acquireCount" : {
                                                "w" : NumberLong(1)
                                        }
                                },
                                "oplog" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(1),
                                                "w" : NumberLong(2)
                                        }
                                }
                        }
                },
                {
                        "desc" : "ReplBatcher",
                        "threadId" : "140578607867648",
                        "active" : true,
                        "opid" : 9,
                        "secs_running" : 468,
                        "microsecs_running" : NumberLong(468305992),
                        "op" : "none",
                        "ns" : "local.oplog.rs",
                        "query" : {

                        },
                        "numYields" : 0,
                        "locks" : {

                        },
                        "waitingForLock" : false,
                        "lockStats" : {
                                "Global" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(2)
                                        }
                                },
                                "Database" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(1)
                                        }
                                },
                                "oplog" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(1)
                                        }
                                }
                        }
                }
        ],
        "ok" : 1
}

jehartzog · 2018-02-18T02:39:37Z

Are you using the modifications included with PR #16 & #17? I saw the same insane mongo container CPU usage, I didn't quite fix it but reduced it to manageable levels by reducing processing intervals.

Before those changes it would fail due to running out of CPU resources after a couple days. Now I can keep 7 days worth of data on a t2.micro unlimited container, and CPU is usage sits around ~90%, so my web UI stays responsive enough for me. This is still kind of expensive ($13/month for excess CPU credits in addition to $8/month for the instance) but works well enough for now.

lmachens · 2018-02-18T08:15:25Z

I have high CPU usage too but the UI response fine, so I didn't spend to much time on investigation.
Maybe some mongodb indexes are missing and the queries are taking to long?

jasongrishkoff · 2018-02-18T14:08:06Z

Hi @jehartzog and @lmachens thanks for getting back. After reporting this I did a reboot, and for the last 24 hours it's been hanging out around 100% (of a possible 200%), so things are running fairly smoothly. Perhaps some indexes just needed some time to work out; or, similarly, redis-oplog had to kick into gear. The good news is that it seems to be working fairly stable on a $15/month Digital Ocean droplet (2gb ram and 2 CPU). Will close this issue for now, and re-open if it crashes again.

jasongrishkoff · 2018-02-19T19:30:13Z

Okay so about 5 hours ago it looks like mongodb might have crashed because of excessive CPU usage? Not sure... this is all I see in the logs:

(node:17383) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 1): MongoError: connection 0 to 127.0.0.1:3001 timed out

Grr.

jehartzog · 2018-02-19T20:36:54Z

Double check that you are using code from the two PR's I added above, they helped manage (but not completely fix) this exact issue for me.

jasongrishkoff · 2018-02-20T08:09:33Z

@jehartzog Yep, I most certainly am running those -- I only set this server up last week. Here's what I tried:

A mup deploy, which we've discussed in another thread. This worked for a bit, but then stopped
So, I decided to set up an Ubuntu server and run Meteor on there directly rather than via mup. This worked swell for ~3 days, but stopped because the mongodb instance was using 100% CPU.
I upgraded the server to have 2 cores and made sure I had oplog working well. This seemed to peg mongodb at 100%, and again it worked for a few days.
Then yesterday it stopped, and the logs showed that MongoError above. So, I rebooted everything 13 hours ago when I made that post, and it seems to have worked since then

Stumped for now.

lmachens · 2018-02-20T08:38:33Z

You can adjust metricsLifetime. If you limit it to like 3 days, the cpu usage should be better.

jasongrishkoff · 2018-02-20T08:45:49Z

I believe my settings.json already has it set to 3 days:

{
        "metricsLifetime": 259200000
}

My Digital Ocean droplet has 60gb of space -- of which it says 6gb has been used so far. The Meteor node process is at ~1% CPU and the mongodb one consistently at 100%. I'm guessing that mongodb is the culprit -- especially because the period where Kadira went down syncs perfectly with an error message related to mongodb:
(node:17383) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 1): MongoError: connection 0 to 127.0.0.1:3001 timed out

lmachens · 2018-02-20T08:54:05Z

My CPU usage is always at 100% too, but it is fine for me because of 2 cores. No issues in the UI. The aggregations are the reason for the high CPU.
Maybe you can try out what happens when you deactivate/comment out some aggregations in rma server?

jasongrishkoff · 2018-02-20T09:29:20Z

I went ahead and changed the first contsant from 60 * 1000 to 5 * 60 * 1000 and when my server restarted mongodb went from 100% to 200% (so using max of both cores). I'm guessing that's because of this:

Meteor.startup(() => {
  runShort();
  runMedium();
  runLong();
});

Which gets it to run when Meteor first reboots. So I gave it a few minutes to settle down, but... 30 minutes later and it hasn't died down yet.

Edit -- completely shut down and restarted Meteor so that mongodb would reset. Launched with just runShort() at 5 minute intervals. CPU usage on mongodb spikes to 100% when the server first starts; 3 minutes later it's still pegged at 100%.

jasongrishkoff · 2018-02-21T17:40:52Z

Looks like my mongodb is now at 200% again -- both processors completely hogged up and the UI pretty much unresponsive. I'm going to expand to 3 CPUs and see what happens.

Edit -- that didn't work, as their package was 3 CPUs and 1gb of ram, which isn't enough ram to boot Meteor properly.

jasongrishkoff · 2018-02-23T08:02:09Z

Been running it with 4 cores and 8gb of ram and she's finally happy. But I'm not super happy about the $40/month that's setting me back! Going to shop around for another host at some point.

jehartzog · 2018-02-23T12:51:32Z

Unless you're experiencing some massively different customer usage than I'm used to (~200 peak active simultaneous users), you may want to dig more into fixing the issue causing the mongo container CPU usage. I don't think you'll be able to find a significantly cheaper hosting service that can do the CPU processing that is currently required.

jasongrishkoff · 2018-02-24T16:17:16Z

Okay so here's an observation -- in the server/rma/server.js file, each of runShort, runMedium and runLong get fired at the same time when Meteor starts. I put a little timer on each. At startup, this is how long each one took to run:

runMedium = 196 seconds
runShort = 216 seconds
runLong = 250 seconds

Once they were done, runShort was able to run on its own again and took a total of 348 seconds! As soon as it was done, it started instantly again -- this time taking 480 seconds. Sure enough, the instant it was done another runShort fired, this time taking 579 seconds. So... getting longer each time, heading in the direction of 10+ minutes per runShort(). Not so short, hey!

Each one of them takes up at least 100% of a CPU -- and in my case, with 4 cores, the CPU was at 400% when all 3 functions were running at the same time. It also looks like whenever runMedium fires (30 minutes), runShort will also fire. And whenever runLong fires (3 hours), both runShort and runMedium will fire. So I've constantly got my CPU maxed out at 100% with runShort(), then every 30 minutes it bumps up to 200%+, and then every 3 hours it gets hammered at a full 400%.

Some steps to take to figure this out and mitigate the issue:

Make sure none of them run at exactly the same time
Figure out why the runShort(), which is set to run every 60 seconds, takes more like 600 seconds

I'll report back if I make any progress -- this is my first time digging into what actually happens with each aggregation, so if you've got some thoughts in the meanwhile that'd be great :)

jasongrishkoff · 2018-02-24T19:31:53Z

Even with 4 cores, my mongodb keeps crashing and the whole site becomes unusable :(

This could be because I've got so many methods / publications? For example, there'll often be 25k to 35k methods per hour that it has to aggregate. Are all these queries indexed?

Edit -- I see in lib/collections.js that there's MethodTraces.rawCollection().createIndex({ appId: 1, host: 1, startTime: 1 }), but if I run db.methodTraces.getIndexes() the only index that exists is for _id. So... these queries might not be indexed.

Edit 2 -- apparently db.methodTraces and db.methodsTraces are different (one with "s", one without. Not sure what's going on there).

Edit 3 -- indeed, there's a lack of indexing going on here. Check that COLLSCAN below.

db.methodsMetrics.find({"id" : {"appId" : "5MiwLbzwmiauYGecP","host" : "b8ae8d37d76e","name" : "getBlogNames","time" : ISODate("2018-02-24T20:02:00Z"), "res" : "1min"}}).explain()
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "meteor.methodsMetrics",
                "indexFilterSet" : false,
                "parsedQuery" : {
                        "id" : {
                                "$eq" : {
                                        "appId" : "5MiwLbzwmiauYGecP",
                                        "host" : "b8ae8d37d76e",
                                        "name" : "getBlogNames",
                                        "time" : ISODate("2018-02-24T20:02:00Z"),
                                        "res" : "1min"
                                }
                        }
                },
                "winningPlan" : {
                        "stage" : "COLLSCAN",
                        "filter" : {
                                "id" : {
                                        "$eq" : {
                                                "appId" : "5MiwLbzwmiauYGecP",
                                                "host" : "b8ae8d37d76e",
                                                "name" : "getBlogNames",
                                                "time" : ISODate("2018-02-24T20:02:00Z"),
                                                "res" : "1min"
                                        }
                                }
                        },
                        "direction" : "forward"
                },
                "rejectedPlans" : [ ]
        },
        "ok" : 1
}

jasongrishkoff · 2018-02-25T09:28:06Z

WOW. Okay so I added some indexes and it's like night and day. Instead of runShort() taking 2+ minutes, it now takes ~500 milliseconds. Meanwhile my mongodb CPU is at almost 0%. I'll put in a pull request for the changes.

This helps solve lmachens#22

jehartzog · 2018-02-25T16:29:13Z

@jasongrishkoff This looks great, thanks for keeping up the hunt to figure this out!

Are you 100% sure those indexes are the only things you changed to fix the CPU usage? For one thing, number of those indexes already exist. I also added the indexes you recommend and I'm not seeing a change in CPU usage.

jasongrishkoff · 2018-02-25T18:44:15Z

@jehartzog you added the indexes in this pull request, right? #23

I'm 99.99999999% positive that was the only change, and it was like night and day.

When you fire db.methodsTraces.getIndexes() do you get one that looks like this?

        {
                "v" : 1,
                "key" : {
                        "id" : 1
                },
                "name" : "id_1",
                "ns" : "meteor.methodsTraces"
        }

jehartzog · 2018-02-25T21:50:52Z

I did use the indexes you added in #23, the only thing I did different was add them in the background.

These are the only 4 collections that you edited in the PR, but methodsTraces is not one of those changed. I'm not sure why I would check indexes on that collection to ensure I applied the PR correctly :D.

  ErrorMetrics.rawCollection().createIndex({id:1});                                                                    
  SystemMetrics.rawCollection().createIndex({id:1});                                                                   
  MethodsMetrics.rawCollection().createIndex({id:1});                                                                  
  PubMetrics.rawCollection().createIndex({id:1});

jasongrishkoff · 2018-02-26T06:55:22Z

@jehartzog my apologies, I definitely should have said db.methodsMetrics.getIndexes(), which should give you a response as such:

        {
                "v" : 1,
                "key" : {
                        "id" : 1
                },
                "name" : "id_1",
                "ns" : "meteor.methodsMetrics"
        }

I tested it by doing db.methodsMetrics.dropIndex('id_1') and the negative impact was instant:

I20180226-06:49:30.941(0)? finished runShort 444ms <-- with the index
I20180226-06:52:21.866(0)? finished runShort 111369ms <-- without the index

jehartzog · 2018-02-26T13:02:59Z

I went back and reverified everything, don't know what I was smoking before but your solution is 100% effective. I had somehow missed the adding the index for pubMetrics, even though I had a history of the correct mongo command and had verified it. I probably had switched db at some point and not realized it. Thanks for working with me on it!

jasongrishkoff · 2018-02-26T13:07:58Z

@jehartzog Nice! Hopefully it saves you some money on that extra CPU you were paying for :)

lmachens · 2018-02-27T07:42:00Z

Good job guys! I created a new version with your changes. https://github.com/lmachens/meteor-apm-server/blob/master/CHANGELOG.md#108---2018-02-27

jasongrishkoff closed this as completed Feb 18, 2018

jasongrishkoff reopened this Feb 19, 2018

jasongrishkoff closed this as completed Feb 25, 2018

jasongrishkoff added a commit to jasongrishkoff/meteor-apm-server that referenced this issue Feb 25, 2018

Update collections.js

86de506

This helps solve lmachens#22

jasongrishkoff mentioned this issue Feb 25, 2018

Adding indexes to the collections #23

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mongodb constantly at 100%+ of CPU #22

Mongodb constantly at 100%+ of CPU #22

jasongrishkoff commented Feb 17, 2018

jasongrishkoff commented Feb 17, 2018

jehartzog commented Feb 18, 2018

lmachens commented Feb 18, 2018

jasongrishkoff commented Feb 18, 2018

jasongrishkoff commented Feb 19, 2018 •

edited

Loading

jehartzog commented Feb 19, 2018

jasongrishkoff commented Feb 20, 2018

lmachens commented Feb 20, 2018

jasongrishkoff commented Feb 20, 2018 •

edited

Loading

lmachens commented Feb 20, 2018

jasongrishkoff commented Feb 20, 2018 •

edited

Loading

jasongrishkoff commented Feb 21, 2018 •

edited

Loading

jasongrishkoff commented Feb 23, 2018 •

edited

Loading

jehartzog commented Feb 23, 2018

jasongrishkoff commented Feb 24, 2018 •

edited

Loading

jasongrishkoff commented Feb 24, 2018 •

edited

Loading

jasongrishkoff commented Feb 25, 2018

jehartzog commented Feb 25, 2018

jasongrishkoff commented Feb 25, 2018 •

edited

Loading

jehartzog commented Feb 25, 2018

jasongrishkoff commented Feb 26, 2018 •

edited

Loading

jehartzog commented Feb 26, 2018

jasongrishkoff commented Feb 26, 2018

lmachens commented Feb 27, 2018

Mongodb constantly at 100%+ of CPU #22

Mongodb constantly at 100%+ of CPU #22

Comments

jasongrishkoff commented Feb 17, 2018

jasongrishkoff commented Feb 17, 2018

jehartzog commented Feb 18, 2018

lmachens commented Feb 18, 2018

jasongrishkoff commented Feb 18, 2018

jasongrishkoff commented Feb 19, 2018 • edited Loading

jehartzog commented Feb 19, 2018

jasongrishkoff commented Feb 20, 2018

lmachens commented Feb 20, 2018

jasongrishkoff commented Feb 20, 2018 • edited Loading

lmachens commented Feb 20, 2018

jasongrishkoff commented Feb 20, 2018 • edited Loading

jasongrishkoff commented Feb 21, 2018 • edited Loading

jasongrishkoff commented Feb 23, 2018 • edited Loading

jehartzog commented Feb 23, 2018

jasongrishkoff commented Feb 24, 2018 • edited Loading

jasongrishkoff commented Feb 24, 2018 • edited Loading

jasongrishkoff commented Feb 25, 2018

jehartzog commented Feb 25, 2018

jasongrishkoff commented Feb 25, 2018 • edited Loading

jehartzog commented Feb 25, 2018

jasongrishkoff commented Feb 26, 2018 • edited Loading

jehartzog commented Feb 26, 2018

jasongrishkoff commented Feb 26, 2018

lmachens commented Feb 27, 2018

jasongrishkoff commented Feb 19, 2018 •

edited

Loading

jasongrishkoff commented Feb 20, 2018 •

edited

Loading

jasongrishkoff commented Feb 20, 2018 •

edited

Loading

jasongrishkoff commented Feb 21, 2018 •

edited

Loading

jasongrishkoff commented Feb 23, 2018 •

edited

Loading

jasongrishkoff commented Feb 24, 2018 •

edited

Loading

jasongrishkoff commented Feb 24, 2018 •

edited

Loading

jasongrishkoff commented Feb 25, 2018 •

edited

Loading

jasongrishkoff commented Feb 26, 2018 •

edited

Loading