Skip to content

Operation and Maintenance Responsibilities

Fuhu Xia edited this page Feb 19, 2021 · 74 revisions

As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. This page documents the cadence of these tasks, as well as how to perform them. Where possible, we should make these tasks notification driven, meaning that the system generates a notification when an action is necessary, and instructs the operator on how to perform the action.

Compliance responsibilities

As part of Data.gov's Authority to Operate (ATO), there are a number of additional responsibilities that we include in Operation & Maintenance. Not only is it useful to have a single place where our compliance processes are documented, but it also helps us document and provide evidence for self-assessments and audits.

Compliance responsibilities should mark the control that is being addressed and include any information on how to provide evidence that the control is being met e.g. how to generate a screenshots or a link to evidence we've used in the past.

O&M Triage rotation

Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each week.

The Triage person is responsible for watching notifications and triaging any required actions based on severity. If the issue is resolved quickly (under 2 hours), the Triage person may do the work immediately. Otherwise, the Triage person should record a GitHub issue in GSA/datagov-deploy. If the issue is urgent, the Triage person should ask for help in Slack and get the issue assigned immediately. Otherwise, the issue will be scheduled into a sprint just like any other work.

The Triage role is not about doing the work of O&M but making sure the work is tracked and scheduled appropriately.

The Triage person is expected to be dedicated, full-time, to the Triage role over the course of the sprint. Majority of the work involves watching notification channels (email and slack). If there are no notifications to respond to, the Triage person should look at tech debt stories that make it easier for themselves and others to perform the triage role. The Triage person may also pick up feature work as long as they have bandwidth to do so.

Ad-hoc/daily

Responding to technical support issues (ad-hoc email)

Inbound requests from the public are sent to datagov@. The customer support folks manage and respond to these requests. When a technical fix or response is needed it is escalated to datagovhelp@ where the Operations and Maintenance (O&M) point of contact should triage the request.

  • If urgent, raise the issue in #datagov-devsecops.
  • Create a GH issue for any support issue requiring more than 30 minutes of work.

Vulnerable dependency notifications (daily email reports)

Snyk and GitHub send daily reports of outdated dependencies with known vulnerabilities. In some cases, a PR is automatically generated. In other cases, we must manually update the dependencies. Because these may represent a security issue to Data.gov, these should be triaged and addressed quickly. Follow these steps on how to triage. If the vulnerability is valid and critical, or if the dependencies cannot be addressed immediately, raise this in #datagov-devsecops.

Automated dependency updates (ad-hoc GitHub PRs)

GitHub offers a service (dependabot) that will scan the dependencies of our GitHub repos and publish Pull Requests to update dependencies. In some cases, these address security vulnerabilities and should be reviewed and merged as soon as possible.

Snyk also publishes PRs for dependency updates for security vulnerabilities. Follow this guide for triaging snyk alerts.

Harvest job report (daily email report)

If items in the queue are older than a few days, the queue should be addressed. See Stuck Harvest Fix.

Bug bounty report (ad-hoc email)

Security researchers may find security vulnerabilities in Data.gov and report them through our vulnerability disclosure policy. These are reported to HackerOne, our bug bounty service. Many reports are false positives, out of scope, or mis-assigned to data.gov, so it is important to triage before taking action.

When a report comes in, evaluate its severity.

  • If it is non-critical, let the Bug Bounty team triage and confirm it for us. Once confirmed, a tracking issue should be created in GSA/datagov-deploy and link to the bug bounty report. Please keep any sensitive information out of GitHub.
  • If the report is critical, confirm it is legit and raise it in #datagov-devsecops.

Reboot notifications (ad-hoc email)

These notifications are generated by our individual hosts. Usually an automated update that requires a reboot will generate these notifications. Run the reboot playbook.

$ ansible-playbook actions/reboot.yml

The jumpboxes are excluded from this playbook. Be sure to also restart all jumpboxes. You can find the details on the datagov-deploy readme. Once the playbook completes, reboot the jumpbox and verify connectivity afterwards.

$ sudo shutdown -r now

Update notifications (slack #datagov-notifications)

In order to keep up-to-date with third-party open source libraries, we subscribe to updates. GitHub dependabot handles some of these, but not all. The goal is not to be on the bleeding edge of these technologies. The goal is to not lag so far behind that when we must update, the work is not insurmountable.

  • Assess if the library/dependency/role should be updated. Emoji ✔️ the notification so we know it's been triaged.
  • Create a GH issue if the work involved is non-trivial.

Host notifications (ad-hoc email)

Each host in the BSP environment generates email notifications for various OS processes. The most common of these is errors in cron jobs.

  • Create a GH issue to track.
  • If urgent, raise the issue in [#datagov-deploy].

Triaging system alerts (ad-hoc email)

You will get system alerts if any of the conditions are met that warn of an upcoming outage. Acknowledge the alarm in the #datagov-alerts slack channel. If there is an outage, address accordingly, triage, and take actions if possible. Not all alerts come through the slack channel. You should review the new relic alerts daily. Here are some common action items:

For catalog memory, follow the steps in this guide.

You may need to create a ticket with BSP. These are the steps for submitting a BSP request.

If it seems that it's an FCS issue, you can follow up about the ticket and get other details via the google chat channel.

WordPress broken links report (daily email report)

Dedupe report (daily email report)

Weekly

AU-3 and AU-6 Log auditing

Review recent activity in GitHub and CircleCI for any signs of suspicious activity.

Review the system and application logs for the Data.gov platform for any evidence of suspicious activity.

From the jumpbox:

$ dsh -g all -M sudo shuf -n 100 /var/log/auth.log | sort | less
$ dsh -g all -M sudo aureport --summary --start week-ago --end yesterday --input-logs | less
$ dsh -g dashboard-web -g crm-web -g wordpress-web -M sudo shuf -n 100 /var/log/nginx/error.log | sort | less
$ dsh -g inventory-web-v1 -M sudo shuf -n 100 /var/log/inventory/ckan.error.log | sort | less
$ dsh -g catalog-next-web -M sudo shuf -n 100 /var/log/ckan/gunicorn.log | sort | less

Record that the review was performed in our audit log. If there you find any suspicious activity, follow the Incident Response Plan.

Issue Triage Notes

If the aureport job never completes, here are some tips and tricks to try:

  • Use the -c in the job like dsh -c -g all ... to run all boxes concurrently and record the boxes that respond. The ones that do not respond are keeping the command from finishing. You can find a complete list of machines at /etc/dsh/group/all
  • If you can isolate the machines that are blocking, investigate them. Check the size of the /var/log/audit/ folder by running du -h /var/log/audit/. Normal sizes that respond in a reasonable amount of time are 1-4G. If you need to remove data older than x days, use this (replace +x with +30 for removing files more than 30 days old): sudo find /var/log/audit/* -mtime +x -exec rm {} \;

Nessus host scan (report from ISSO)

Nessus are OS-level scans for common vulnerability classes. The Nessus agent runs on each host and is installed and configured by GSA/datagov-deploy-common. The agents talk to the GSA Scanning team's infrastructure and they provide a weekly report.

  • Import the CSV report into the Data.gov team drive.
  • Triage Critical, High, and Moderate items in the report (we ignore Low and Info items).
  • Determine if the issue has already been triaged
  • If any GH issues are created, link to the report in Google Drive.

Occasionally there is an issue with the agent.

The report may only provide an IP address for the given issue. Reference this google sheet for the appropriate host.

Monthly

System inventory check (report from ISSO)

The monthly inventory report contains a spreadsheet of all of our hosts, their IP addresses, what OS version they are running, and what URL they are serving. We maintain our System Inventory as a living document.

NetSparker compliance scan (report from ISSO)

When this report is received, upload the PDF reports to the Data.gov Drive and create an assessment document. For any Medium, High, or Critical issues, create GitHub issues in GSA/datagov-deploy and link to relevant bookmark within the assessment document. This allows us to keep private details secure but still track the issue on our public Kanban board.

  • Identify if there are any Medium, High, or Critical severity issues.
  • Upload reports to the Data.gov Drive.
  • Create an assessment document for summary and discussion of issues.
  • Create GitHub issues for each item in order to track progress. Keep any sensitive details in the assessment document.
  • Assign each issue to a compliance milestone
  • Create a

Background

Netsparker scans come in two flavors, authenticated and unauthenticated. Authenticated scans are done with credentials, as a logged in user, while unauthenticated scans involve no credentials, just as the public internet sees the site. GSA policy is to perform unauthenticated scans monthly and authenticated scans annually. For FY20, they are trying to do authenticated scans quarterly.

The scan is done based on seed URLs. Netsparker then crawls these URLs looking for issues and additional pages. The scan is time-boxed to 8 hours, so the report will only include what was found within that timeframe.

More information, including contact information is available in our meeting notes.

CIS compliance scan (report from ISSO)

This report validates that our hosts meet the 85% compliance rating for the CIS benchmark. The Ubuntu hardening role applies the controls for the CIS Level 2 benchmark and is applied as part of our GSA/datagov-deploy-common role. Normally no action is necessary since this role is run as part of every platform deploy.

If compliance is below 85%, apply the site.yml playbook.

  • ansible-playbook site.yml --skip-tags filebeat

Annually

AU-2 (3) Review Auditable Events

Review the list of Auditable events. Adjust the list based on system changes

In addition to performing this annually, when changes are made to the SSP, the list of Audited Events should be updated if necessary.

CP-9 Backup testing

Vulnerability pen test

The test is initiated by the GSA Scan team and the ISSO. Data.gov is not notified of the specifics of the test. Once the test is complete, the report of any findings should be recorded as issues in GSA/datagov-incident-response and prioritized against existing work based on the issue severity.

The last pen test was performed in October 2019.

Incident Response Plan test

Our ISSO usually initiates this exercise. This is a table-top exercise where we are given a scenario and run through our Incident Response Plan.

The next test is scheduled for Friday, Nov 22, 2019.

Clone this wiki locally