Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mongodb constantly at 100%+ of CPU #22

Closed
jasongrishkoff opened this issue Feb 17, 2018 · 24 comments
Closed

Mongodb constantly at 100%+ of CPU #22

jasongrishkoff opened this issue Feb 17, 2018 · 24 comments

Comments

@jasongrishkoff
Copy link
Contributor

I've got a 2-core system running on Digital Ocean, and after ~3 days of tracking data, everything totally freezes up with Mongodb stuck at 100% on both cores. I'm using redis-oplog for oplog tailing, but that doesn't seem to help :(

Anyone else experiencing similar?

@jasongrishkoff
Copy link
Contributor Author

If I run meteor mongo and then db.currentOp({secs_running: {$gte: 300}}) I get two queries. Strangely both of them have query: {} so it's hard to say what they're running.

{
        "inprog" : [
                {
                        "desc" : "rsSync",
                        "threadId" : "140578758936320",
                        "active" : true,
                        "opid" : 5,
                        "secs_running" : 468,
                        "microsecs_running" : NumberLong(468308691),
                        "op" : "none",
                        "ns" : "local.oplog.rs",
                        "query" : {

                        },
                        "numYields" : 0,
                        "locks" : {

                        },
                        "waitingForLock" : false,
                        "lockStats" : {
                                "Global" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(19),
                                                "w" : NumberLong(3),
                                                "R" : NumberLong(1),
                                                "W" : NumberLong(1)
                                        }
                                },
                                "Database" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(7),
                                                "w" : NumberLong(2),
                                                "W" : NumberLong(1)
                                        }
                                },
                                "Collection" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(6)
                                        }
                                },
                                "Metadata" : {
                                        "acquireCount" : {
                                                "w" : NumberLong(1)
                                        }
                                },
                                "oplog" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(1),
                                                "w" : NumberLong(2)
                                        }
                                }
                        }
                },
                {
                        "desc" : "ReplBatcher",
                        "threadId" : "140578607867648",
                        "active" : true,
                        "opid" : 9,
                        "secs_running" : 468,
                        "microsecs_running" : NumberLong(468305992),
                        "op" : "none",
                        "ns" : "local.oplog.rs",
                        "query" : {

                        },
                        "numYields" : 0,
                        "locks" : {

                        },
                        "waitingForLock" : false,
                        "lockStats" : {
                                "Global" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(2)
                                        }
                                },
                                "Database" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(1)
                                        }
                                },
                                "oplog" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(1)
                                        }
                                }
                        }
                }
        ],
        "ok" : 1
}

@jehartzog
Copy link
Collaborator

Are you using the modifications included with PR #16 & #17? I saw the same insane mongo container CPU usage, I didn't quite fix it but reduced it to manageable levels by reducing processing intervals.

Before those changes it would fail due to running out of CPU resources after a couple days. Now I can keep 7 days worth of data on a t2.micro unlimited container, and CPU is usage sits around ~90%, so my web UI stays responsive enough for me. This is still kind of expensive ($13/month for excess CPU credits in addition to $8/month for the instance) but works well enough for now.

@lmachens
Copy link
Owner

I have high CPU usage too but the UI response fine, so I didn't spend to much time on investigation.
Maybe some mongodb indexes are missing and the queries are taking to long?

@jasongrishkoff
Copy link
Contributor Author

Hi @jehartzog and @lmachens thanks for getting back. After reporting this I did a reboot, and for the last 24 hours it's been hanging out around 100% (of a possible 200%), so things are running fairly smoothly. Perhaps some indexes just needed some time to work out; or, similarly, redis-oplog had to kick into gear. The good news is that it seems to be working fairly stable on a $15/month Digital Ocean droplet (2gb ram and 2 CPU). Will close this issue for now, and re-open if it crashes again.

@jasongrishkoff
Copy link
Contributor Author

jasongrishkoff commented Feb 19, 2018

Okay so about 5 hours ago it looks like mongodb might have crashed because of excessive CPU usage? Not sure... this is all I see in the logs:

(node:17383) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 1): MongoError: connection 0 to 127.0.0.1:3001 timed out

Grr.

@jehartzog
Copy link
Collaborator

Double check that you are using code from the two PR's I added above, they helped manage (but not completely fix) this exact issue for me.

@jasongrishkoff
Copy link
Contributor Author

@jehartzog Yep, I most certainly am running those -- I only set this server up last week. Here's what I tried:

  1. A mup deploy, which we've discussed in another thread. This worked for a bit, but then stopped
  2. So, I decided to set up an Ubuntu server and run Meteor on there directly rather than via mup. This worked swell for ~3 days, but stopped because the mongodb instance was using 100% CPU.
  3. I upgraded the server to have 2 cores and made sure I had oplog working well. This seemed to peg mongodb at 100%, and again it worked for a few days.
  4. Then yesterday it stopped, and the logs showed that MongoError above. So, I rebooted everything 13 hours ago when I made that post, and it seems to have worked since then

screen shot 2018-02-20 at 10 07 25 am

Stumped for now.

@lmachens
Copy link
Owner

You can adjust metricsLifetime. If you limit it to like 3 days, the cpu usage should be better.

@jasongrishkoff
Copy link
Contributor Author

jasongrishkoff commented Feb 20, 2018

I believe my settings.json already has it set to 3 days:

{
        "metricsLifetime": 259200000
}

My Digital Ocean droplet has 60gb of space -- of which it says 6gb has been used so far. The Meteor node process is at ~1% CPU and the mongodb one consistently at 100%. I'm guessing that mongodb is the culprit -- especially because the period where Kadira went down syncs perfectly with an error message related to mongodb:
(node:17383) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 1): MongoError: connection 0 to 127.0.0.1:3001 timed out

@lmachens
Copy link
Owner

My CPU usage is always at 100% too, but it is fine for me because of 2 cores. No issues in the UI. The aggregations are the reason for the high CPU.
Maybe you can try out what happens when you deactivate/comment out some aggregations in rma server?

@jasongrishkoff
Copy link
Contributor Author

jasongrishkoff commented Feb 20, 2018

I went ahead and changed the first contsant from 60 * 1000 to 5 * 60 * 1000 and when my server restarted mongodb went from 100% to 200% (so using max of both cores). I'm guessing that's because of this:

Meteor.startup(() => {
  runShort();
  runMedium();
  runLong();
});

Which gets it to run when Meteor first reboots. So I gave it a few minutes to settle down, but... 30 minutes later and it hasn't died down yet.

Edit -- completely shut down and restarted Meteor so that mongodb would reset. Launched with just runShort() at 5 minute intervals. CPU usage on mongodb spikes to 100% when the server first starts; 3 minutes later it's still pegged at 100%.

@jasongrishkoff
Copy link
Contributor Author

jasongrishkoff commented Feb 21, 2018

Looks like my mongodb is now at 200% again -- both processors completely hogged up and the UI pretty much unresponsive. I'm going to expand to 3 CPUs and see what happens.

Edit -- that didn't work, as their package was 3 CPUs and 1gb of ram, which isn't enough ram to boot Meteor properly.

@jasongrishkoff
Copy link
Contributor Author

jasongrishkoff commented Feb 23, 2018

Been running it with 4 cores and 8gb of ram and she's finally happy. But I'm not super happy about the $40/month that's setting me back! Going to shop around for another host at some point.

@jehartzog
Copy link
Collaborator

Unless you're experiencing some massively different customer usage than I'm used to (~200 peak active simultaneous users), you may want to dig more into fixing the issue causing the mongo container CPU usage. I don't think you'll be able to find a significantly cheaper hosting service that can do the CPU processing that is currently required.

@jasongrishkoff
Copy link
Contributor Author

jasongrishkoff commented Feb 24, 2018

Okay so here's an observation -- in the server/rma/server.js file, each of runShort, runMedium and runLong get fired at the same time when Meteor starts. I put a little timer on each. At startup, this is how long each one took to run:

  • runMedium = 196 seconds
  • runShort = 216 seconds
  • runLong = 250 seconds

Once they were done, runShort was able to run on its own again and took a total of 348 seconds! As soon as it was done, it started instantly again -- this time taking 480 seconds. Sure enough, the instant it was done another runShort fired, this time taking 579 seconds. So... getting longer each time, heading in the direction of 10+ minutes per runShort(). Not so short, hey!

Each one of them takes up at least 100% of a CPU -- and in my case, with 4 cores, the CPU was at 400% when all 3 functions were running at the same time. It also looks like whenever runMedium fires (30 minutes), runShort will also fire. And whenever runLong fires (3 hours), both runShort and runMedium will fire. So I've constantly got my CPU maxed out at 100% with runShort(), then every 30 minutes it bumps up to 200%+, and then every 3 hours it gets hammered at a full 400%.

Some steps to take to figure this out and mitigate the issue:

  1. Make sure none of them run at exactly the same time
  2. Figure out why the runShort(), which is set to run every 60 seconds, takes more like 600 seconds

I'll report back if I make any progress -- this is my first time digging into what actually happens with each aggregation, so if you've got some thoughts in the meanwhile that'd be great :)

@jasongrishkoff
Copy link
Contributor Author

jasongrishkoff commented Feb 24, 2018

Even with 4 cores, my mongodb keeps crashing and the whole site becomes unusable :(

This could be because I've got so many methods / publications? For example, there'll often be 25k to 35k methods per hour that it has to aggregate. Are all these queries indexed?

Edit -- I see in lib/collections.js that there's MethodTraces.rawCollection().createIndex({ appId: 1, host: 1, startTime: 1 }), but if I run db.methodTraces.getIndexes() the only index that exists is for _id. So... these queries might not be indexed.

Edit 2 -- apparently db.methodTraces and db.methodsTraces are different (one with "s", one without. Not sure what's going on there).

Edit 3 -- indeed, there's a lack of indexing going on here. Check that COLLSCAN below.

db.methodsMetrics.find({"id" : {"appId" : "5MiwLbzwmiauYGecP","host" : "b8ae8d37d76e","name" : "getBlogNames","time" : ISODate("2018-02-24T20:02:00Z"), "res" : "1min"}}).explain()
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "meteor.methodsMetrics",
                "indexFilterSet" : false,
                "parsedQuery" : {
                        "id" : {
                                "$eq" : {
                                        "appId" : "5MiwLbzwmiauYGecP",
                                        "host" : "b8ae8d37d76e",
                                        "name" : "getBlogNames",
                                        "time" : ISODate("2018-02-24T20:02:00Z"),
                                        "res" : "1min"
                                }
                        }
                },
                "winningPlan" : {
                        "stage" : "COLLSCAN",
                        "filter" : {
                                "id" : {
                                        "$eq" : {
                                                "appId" : "5MiwLbzwmiauYGecP",
                                                "host" : "b8ae8d37d76e",
                                                "name" : "getBlogNames",
                                                "time" : ISODate("2018-02-24T20:02:00Z"),
                                                "res" : "1min"
                                        }
                                }
                        },
                        "direction" : "forward"
                },
                "rejectedPlans" : [ ]
        },
        "ok" : 1
}

@jasongrishkoff
Copy link
Contributor Author

WOW. Okay so I added some indexes and it's like night and day. Instead of runShort() taking 2+ minutes, it now takes ~500 milliseconds. Meanwhile my mongodb CPU is at almost 0%. I'll put in a pull request for the changes.

jasongrishkoff added a commit to jasongrishkoff/meteor-apm-server that referenced this issue Feb 25, 2018
@jehartzog
Copy link
Collaborator

@jasongrishkoff This looks great, thanks for keeping up the hunt to figure this out!

Are you 100% sure those indexes are the only things you changed to fix the CPU usage? For one thing, number of those indexes already exist. I also added the indexes you recommend and I'm not seeing a change in CPU usage.

@jasongrishkoff
Copy link
Contributor Author

jasongrishkoff commented Feb 25, 2018

@jehartzog you added the indexes in this pull request, right? #23

I'm 99.99999999% positive that was the only change, and it was like night and day.

When you fire db.methodsTraces.getIndexes() do you get one that looks like this?

        {
                "v" : 1,
                "key" : {
                        "id" : 1
                },
                "name" : "id_1",
                "ns" : "meteor.methodsTraces"
        }

@jehartzog
Copy link
Collaborator

I did use the indexes you added in #23, the only thing I did different was add them in the background.

These are the only 4 collections that you edited in the PR, but methodsTraces is not one of those changed. I'm not sure why I would check indexes on that collection to ensure I applied the PR correctly :D.

  ErrorMetrics.rawCollection().createIndex({id:1});                                                                    
  SystemMetrics.rawCollection().createIndex({id:1});                                                                   
  MethodsMetrics.rawCollection().createIndex({id:1});                                                                  
  PubMetrics.rawCollection().createIndex({id:1}); 

@jasongrishkoff
Copy link
Contributor Author

jasongrishkoff commented Feb 26, 2018

@jehartzog my apologies, I definitely should have said db.methodsMetrics.getIndexes(), which should give you a response as such:

        {
                "v" : 1,
                "key" : {
                        "id" : 1
                },
                "name" : "id_1",
                "ns" : "meteor.methodsMetrics"
        }

I tested it by doing db.methodsMetrics.dropIndex('id_1') and the negative impact was instant:

I20180226-06:49:30.941(0)? finished runShort 444ms <-- with the index
I20180226-06:52:21.866(0)? finished runShort 111369ms <-- without the index

@jehartzog
Copy link
Collaborator

I went back and reverified everything, don't know what I was smoking before but your solution is 100% effective. I had somehow missed the adding the index for pubMetrics, even though I had a history of the correct mongo command and had verified it. I probably had switched db at some point and not realized it. Thanks for working with me on it!

@jasongrishkoff
Copy link
Contributor Author

@jehartzog Nice! Hopefully it saves you some money on that extra CPU you were paying for :)

@lmachens
Copy link
Owner

Good job guys! I created a new version with your changes. https://github.com/lmachens/meteor-apm-server/blob/master/CHANGELOG.md#108---2018-02-27

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants