-
Notifications
You must be signed in to change notification settings - Fork 110
Operation and Maintenance Responsibilities
TODO:
- interview fuhu about adds
- interview nick about adds
- solr things?
- how do we want to organize this info? (which stuff should be here vs readme in repos)
As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. This page documents the cadence of these tasks, as well as how to perform them. Where possible, we should make these tasks notification driven, meaning that the system generates a notification when an action is necessary, and instructs the operator on how to perform the action.
This role is about triage and not about resolving everything that comes up. By dedicating one team member to O&M, the rest of the team can focus on the Sprint Backlog without fear of something slipping through the cracks.
As part of Data.gov's Authority to Operate (ATO), there are a number of additional responsibilities that we include in Operation & Maintenance. Not only is it useful to have a single place where our compliance processes are documented, but it also helps us document and provide evidence for self-assessments and audits.
Compliance responsibilities should mark the control that is being addressed and include any information on how to provide evidence that the control is being met e.g. how to generate a screenshots or a link to evidence we've used in the past.
Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each week.
The Triage person is responsible for watching notifications and triaging any required actions based on severity. If the issue is resolved quickly (under 2 hours), the Triage person may do the work immediately. Otherwise, the Triage person should record a GitHub issue in GSA/data.gov. If the issue is urgent, the Triage person should ask for help in Slack and get the issue assigned immediately. Otherwise, the issue will be scheduled into a sprint just like any other work.
The Triage role is not about doing the work of O&M but making sure the work is tracked and scheduled appropriately.
The Triage person is expected to be dedicated, full-time, to the Triage role over the course of the sprint. Majority of the work involves watching notification channels (email and slack). If there are no notifications to respond to, the Triage person should look at tech debt stories that make it easier for themselves and others to perform the triage role. The Triage person may also pick up feature work as long as they have bandwidth to do so.
Inbound requests from the public are sent to datagov@. The customer support folks manage and respond to these requests. When a technical fix or response is needed it is escalated to datagovhelp@ where the Operations and Maintenance (O&M) point of contact should triage the request.
- If urgent, raise the issue in #datagov-devsecops.
- Create a GH issue for any support issue requiring more than 30 minutes of work.
Snyk and GitHub send daily reports of outdated dependencies with known vulnerabilities. In some cases, a PR is automatically generated. In other cases, we must manually update the dependencies. Because these may represent a security issue to Data.gov, these should be triaged and addressed quickly. Follow these steps on how to triage. If the vulnerability is valid and critical, or if the dependencies cannot be addressed immediately, raise this in #datagov-devsecops.
GitHub offers a service (dependabot) that will scan the dependencies of our GitHub repos and publish Pull Requests to update dependencies. In some cases, these address security vulnerabilities and should be reviewed and merged as soon as possible.
Snyk also publishes PRs for dependency updates for security vulnerabilities. Follow this guide for triaging snyk alerts.
To be able to track these PRs, it's best practice to watch the repos which will notify on any updates to those repos, like:
Everyday you should get these emails:
-
Catalog - Harvesting Job Successful - Summary Notification
- The harvest job needs to be completed in <24hrs (this is due to harvest jobs that get cancelled at 24hrs do not get counted as errors)
-
Catalog - Harvesting Job - Error Notification
- Note that a lot of errors are expected; for example uploading a large GIS dataset.
These should be scanned for large numbers of errors or anything that jumps out as a systemic problem. For example, you may get harvest reports with no changes from many/multiple harvest jobs. This probably means that the harvest job has stalled out and that the job has run for 24hrs without doing any harvesting. See this issue for details.
You may create issue tickets for errors or reach out to data providers about their errors as you have time.
Security researchers may find security vulnerabilities in Data.gov and report them through our vulnerability disclosure policy. These are reported to HackerOne, our bug bounty service. Many reports are false positives, out of scope, or mis-assigned to data.gov, so it is important to triage before taking action.
When a report comes in, evaluate its severity.
- If it is non-critical, let the Bug Bounty team triage and confirm it for us. Once confirmed, a tracking issue should be created in GSA/data.gov and link to the bug bounty report. Please keep any sensitive information out of GitHub.
- If the report is critical, confirm it is legit and raise it in slack #datagov-devsecops.
Update notifications (slack #datagov-notifications)
In order to keep up-to-date with third-party open source libraries, we subscribe to updates. GitHub dependabot handles some of these, but not all. The goal is not to be on the bleeding edge of these technologies. The goal is to not lag so far behind that when we must update, the work is not insurmountable.
- Assess if the library/dependency/role should be updated. Emoji ✔️ the notification so we know it's been triaged.
- Create a GH issue if the work involved is non-trivial.
Triaging system alerts (slack #datagov-alerts)
Every notification on this channel should be acknowledged and investigated.
Tools include:
- New Relic to check performance
- Basic user testing: checking if site is down
- Requesting help from the team
If a notification is firing too frequently, consider changing alert notification settings or thresholds. You are encouraged to create tickets for issues or pain points, you don't need to do them all yourself.
Check for duplicates on the catalog.data.gov site. The following URL's should be 0 or extremely close to 0:
- https://catalog.data.gov/api/action/package_search?facet.field=[%22identifier%22]&facet.limit=-1&facet.mincount=2&rows=0
- https://catalog.data.gov/api/action/package_search?facet.field=[%22guid%22]&facet.limit=-1&facet.mincount=2&rows=0
If either have a large number of duplicates, create a bug ticket to investigate further. If desired, you can use this bash script to programmatically filter out items that exist in multiple organizations (as allowed by the metadata specs and harvester, although we should push those that use "simple" identifiers to use more unique/complex ones).
If a more detailed analysis is need, try running the de-dupe organization overview job for details per organization.
Currently investigating duplicates from CDC: https://github.com/GSA/data.gov/issues/4073
TODO - design process around new relic
- what is mvp?
- what is long term automated solution?
- may require capturing login/logout events (eg run query to check all logins this week)
Utilize New Relic log review notes for current process...
TODO - is this still needed? if so, what is still needed?
The monthly inventory report contains a spreadsheet of all of our hosts, their IP addresses, what OS version they are running, and what URL they are serving. We maintain our System Inventory as a living document.
[Proposed change] The monthly inventory report contains a spreadsheet of all of our external URLs for our apps hosted on cloud.gov. We maintain our System Inventory as a living document.
- Check that the report matches the changes in our living document.
When this report is received, upload the PDF reports to the Data.gov Drive and create an assessment document. For any Medium, High, or Critical issues, create GitHub issues in GSA/data.gov and link to relevant bookmark within the assessment document. This allows us to keep private details secure but still track the issue on our public Kanban board.
- Identify if there are any Medium, High, or Critical severity issues.
- Upload reports to the Data.gov Drive.
- Create an assessment document for summary and discussion of issues.
- Create GitHub issues for each item in order to track progress. Keep any sensitive details in the assessment document.
- Assign each issue to a compliance milestone
- Create a TODO ??
Netsparker scans come in two flavors, authenticated and unauthenticated. Authenticated scans are done with credentials, as a logged in user, while unauthenticated scans involve no credentials, just as the public internet sees the site. GSA policy is to perform unauthenticated scans monthly and authenticated scans annually. For FY20, they are trying to do authenticated scans quarterly.
The scan is done based on seed URLs. Netsparker then crawls these URLs looking for issues and additional pages. The scan is time-boxed to 8 hours, so the report will only include what was found within that timeframe.
More information, including contact information is available in our meeting notes.
Review the list of Auditable events. Adjust the list based on system changes
In addition to performing this annually, when changes are made to the SSP, the list of Audited Events should be updated if necessary.
A restore from a backup should be performed to confirm documentation and validate working. Ideally this should be a production backup restored into staging. If an emergency restore was done, then that may cover this requirement.
In catalog and inventory, this should be a complete restoration of DB and solr. This could also serve keeping staging data comparable/up to date with prod.
The test is initiated by the GSA Scan team and the ISSO. Data.gov is not notified of the specifics of the test. Once the test is complete, the report of any findings should be recorded as issues in GSA/datagov-incident-response and prioritized against existing work based on the issue severity.
Our ISSO usually initiates this exercise. This is a table-top exercise where we are given a scenario and run through our Incident Response Plan.