-
Notifications
You must be signed in to change notification settings - Fork 110
Operation and Maintenance Responsibilities
As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. This page documents the cadence of these tasks, as well as how to perform them. Where possible, we should make these tasks notification driven, meaning that the system generates a notification when an action is necessary, and instructs the operator on how to perform the action.
This role is about triage and not about resolving everything that comes up. By dedicating one team member to O&M, the rest of the team can focus on the Sprint Backlog without fear of something slipping through the cracks.
As part of Data.gov's Authority to Operate (ATO), there are a number of additional responsibilities that we include in Operation & Maintenance. Not only is it useful to have a single place where our compliance processes are documented, but it also helps us document and provide evidence for self-assessments and audits.
Compliance responsibilities should mark the control that is being addressed and include any information on how to provide evidence that the control is being met e.g. how to generate a screenshots or a link to evidence we've used in the past.
Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each week.
The Triage person is responsible for watching notifications and triaging any required actions based on severity. If the issue is resolved quickly (under 2 hours), the Triage person may do the work immediately. Otherwise, the Triage person should record a GitHub issue in GSA/data.gov. If the issue is urgent, the Triage person should ask for help in Slack and get the issue assigned immediately. Otherwise, the issue will be scheduled into a sprint just like any other work.
The Triage role is not about doing the work of O&M but making sure the work is tracked and scheduled appropriately.
The Triage person is expected to be dedicated, full-time, to the Triage role over the course of the sprint. Majority of the work involves watching notification channels (email and slack). If there are no notifications to respond to, the Triage person should look at tech debt stories that make it easier for themselves and others to perform the triage role. The Triage person may also pick up feature work as long as they have bandwidth to do so.
Inbound requests from the public are sent to datagov@. The customer support folks manage and respond to these requests. When a technical fix or response is needed it is escalated to datagovhelp@ where the Operations and Maintenance (O&M) point of contact should triage the request.
- If urgent, raise the issue in #datagov-devsecops.
- Create a GH issue for any support issue requiring more than 30 minutes of work.
Not unusual to not receive many emails.
GitHub offers a service (dependabot) that will scan the dependencies of our GitHub repos and publish Pull Requests to update dependencies. In some cases, these address security vulnerabilities and should be reviewed and merged as soon as possible.
Snyk also publishes PRs for dependency updates for security vulnerabilities. Follow this guide for triaging snyk alerts.
To be able to track these PRs, it's best practice to watch the repos which will notify on any updates to those repos, like:
There are weekly emails that are sent out, but the primary management is via GitHub PRs. The synk console can always be checked: https://app.snyk.io/org/data.gov.
Everyday you should get these emails:
-
Catalog - Harvesting Job Successful - Summary Notification
- The harvest job needs to be completed in <24hrs (this is due to harvest jobs that get cancelled at 24hrs do not get counted as errors)
-
Catalog - Harvesting Job - Error Notification
- Note that a lot of errors are expected; for example uploading a large GIS dataset.
- Validation errors can be safely ignored.
- Note that a lot of errors are expected; for example uploading a large GIS dataset.
Please see Examples of Harvest Job Errors.
Also see a list of known broken harvest sources.
These should be scanned for large numbers of errors or anything that jumps out as a systemic problem. For example, you may get harvest reports with no changes from many/multiple harvest jobs. This probably means that the harvest job has stalled out and that the job has run for 24hrs without doing any harvesting. See this issue for details.
You may create issue tickets for errors or reach out to data providers about their errors as you have time.
Update notifications (slack #datagov-notifications)
In order to keep up-to-date with third-party open source libraries, we subscribe to updates. GitHub dependabot handles some of these, but not all. The goal is not to be on the bleeding edge of these technologies. The goal is to not lag so far behind that when we must update, the work is not insurmountable.
- Assess if the library/dependency/role should be updated. Emoji ✔️ the notification so we know it's been triaged.
- Create a GH issue if the work involved is non-trivial.
Triaging system alerts (slack #datagov-alerts)
Every notification on this channel should be acknowledged and investigated.
Tools include:
- New Relic to check performance
- Basic user testing: checking if site is down
- Requesting help from the team
If a notification is firing too frequently, consider changing alert notification settings or thresholds. You are encouraged to create tickets for issues or pain points, you don't need to do them all yourself.
Issues are automatically created in the following cases and need to be triaged:
- Catalog Issues
- Inventory Issues
- Deployment Failures for catalog and inventory
- Restart Failures for catalog and inventory
- Automated CKAN Task Failures for catalog
Pull Requests are automatically created for Snyk Scans for Catalog and Inventory.
Check for duplicates on the catalog.data.gov site. The following URL's should be 0 or extremely close to 0:
-
Catalog DCAT-US API dupe check. Check
result.facets.identifier
list, ignoreidentifier
s that are likely to be common (like lowint
s, etc) as they are likely different orgs, investigate things that look like they should be globally unique (guid
s, URLS, etc) either via the Catalog UI or API-
Via the API: call
https://catalog.data.gov/api/action/package_search?q=identifier:%22{ID}%22&facet.field=[%22organization%22]
, replacing{ID}
with above -
Via the UI: Search with
identifier:{ID}
like so:
-
-
Catalog Geospatial API duplicate check. Check
result.facets.guid
list, ignoreguid
s that are likely to be common (like lowint
s, etc) as they are likely different orgs, investigate things that look like they should be globally unique (guid
s, URLS, etc) either via the Catalog UI or API-
Via the API: call
https://catalog.data.gov/api/action/package_search?q=guid:%22{ID}%22&facet.field=[%22organization%22], replacing
{ID}` with above -
Via the UI: Search with
guid:{ID}
like so:
-
There is a problem, create a bug ticket to investigate further. If desired, you can use this bash script to programmatically filter out items that exist in multiple organizations (as allowed by the metadata specs and harvester, although we should push those that use "simple" identifiers to use more unique/complex ones).
- Dupes that have different harvest sources (field:
harvest_source_title
) - Maybe that is the only case that is ok?
- Dupes that have the same harvest source (field:
harvest_source_title
) - Dupe that does not have a harvest source (
harvest_source_title
) at all - Probably more to be added later
If a more detailed analysis is need, try running the de-dupe organization overview job for details per organization.
TODO - design process around new relic
- what is mvp?
- what is long term automated solution?
- may require capturing login/logout events (eg run query to check all logins this week)
Utilize New Relic log review notes for current process...
- Verify each Solr Leader/Followers are functional
Use this command to find Solr URLs and credentials in the prod
space.
cf t -s prod && cf env catalog-web | grep solr -C 2 | grep "uri\|solr_follower_individual_urls\|password\|username"
-
Verify their Start time is in sync with Solr Memory Alert history at path
/solr/#/
-
Verify each follower stays with Solr leader at path
/solr/#/ckan/core-overview
-
Verify each Solr is responsive by running a few queries at
/solr/#/ckan/query
-
Inspect each Solr's logging for abnormal errors at
/solr/#/~logging
-
Examine the Solr Memory Utilization Graph to catch any abnormal incidences.
-
Log in to
tts-jump
AWS account* and switch role toSSBDev
, go to custom SolrAlarm dashboard to see the graph for the past 72 hours. There should not be any Solr instance that has MemoryUtilization go above 90% threshold without getting restarted. Each Solr should not restart too often (more than a few times a week)
- for help logging into AWS, see troubleshooting below
TODO - is this still needed? if so, what is still needed?
The monthly inventory report contains a spreadsheet of all of our hosts, their IP addresses, what OS version they are running, and what URL they are serving. We maintain our System Inventory as a living document.
[Proposed change] The monthly inventory report contains a spreadsheet of all of our external URLs for our apps hosted on cloud.gov. We maintain our System Inventory as a living document.
- Check that the report matches the changes in our living document.
When this report is received, upload the PDF reports to the Data.gov Drive and create an assessment document. For any Medium, High, or Critical issues, create GitHub issues in GSA/data.gov and link to relevant bookmark within the assessment document. This allows us to keep private details secure but still track the issue on our public Kanban board.
- Identify if there are any Medium, High, or Critical severity issues.
- Upload reports to the Data.gov Drive.
- Create an assessment document for summary and discussion of issues.
- Create GitHub issues for each item in order to track progress. Keep any sensitive details in the assessment document.
- Assign each issue to a compliance milestone
- Create a TODO ??
Netsparker scans come in two flavors, authenticated and unauthenticated. Authenticated scans are done with credentials, as a logged in user, while unauthenticated scans involve no credentials, just as the public internet sees the site. GSA policy is to perform unauthenticated scans monthly and authenticated scans annually. For FY20, they are trying to do authenticated scans quarterly.
The scan is done based on seed URLs. Netsparker then crawls these URLs looking for issues and additional pages. The scan is time-boxed to 8 hours, so the report will only include what was found within that timeframe.
More information, including contact information is available in our meeting notes.
Review the list of Auditable events. Adjust the list based on system changes
In addition to performing this annually, when changes are made to the SSP, the list of Audited Events should be updated if necessary.
A restore from a backup should be performed to confirm documentation and validate working. Ideally this should be a production backup restored into staging. If an emergency restore was done, then that may cover this requirement.
In catalog and inventory, this should be a complete restoration of DB and solr. This could also serve keeping staging data comparable/up to date with prod.
The test is initiated by the GSA Scan team and the ISSO. Data.gov is not notified of the specifics of the test. Once the test is complete, the report of any findings should be recorded as issues in GSA/datagov-incident-response and prioritized against existing work based on the issue severity.
Our ISSO usually initiates this exercise. This is a table-top exercise where we are given a scenario and run through our Incident Response Plan.
Utilize Log review notes to understand what ERRORS are expected in logs. Add new expected errors to the page above.
When DB-Solr-Sync job reports: N packages without harvest_object need to be mannually deleted
, you will need to run these commands:
Important: Catalog Fetch must be idling Do not perform this action while the harvester (aka
catalog-fetch
) is running.
-
Run "ckan geodatagov harvest-object-relink". This will fix packages that has a good but not current harvest_object_id.
- the exact command is:
cf run-task catalog-admin --command "ckan geodatagov db-solr-sync --dryrun" -k 1500M -m 2G
- The above command performs a dryrun of the operation. Remove the
--dryrun
flag to perform the actual operation.
- the exact command is:
-
After this, perform a manual re-run of
db-solr-sync
Github action. -
The remaining datasets will need to be purged.
-
To do this, run a batch delete using CKANapi.
- Gather list of datasets without harvest object IDs using the dryrun command above. Format those into a text file with a single pacakge_id per line.
- Work with Fuhu to run the batch delete script
- This is temp until he can put a gist to be run self-service
-
After the above steps are run, the count of running the dryrun task again should be 0.
- For more info on logging in to AWS, visit this Google Drive document: https://docs.google.com/document/d/1mwASz1SDiGcpbeSTTILrliDsUKzg1mjy2u11JmvFW2k/edit?usp=sharing