-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Monitoring for Platform System to Support ATN Data Ingestion Pipeline #126
Comments
I added a GitHub ticket for what I mentioned at the ATN data meeting today summarizing the request, but assigned a Blocked status since we have not decided prioritize it at the moment. Let me know if you have add'l questions! @MathewBiddle @mmckinzie @conniekot |
I do want to note and capture that some amount of work and troubleshooting related to this will need to be done to address and fix the recent observation that some active Wildlife Computer deployments stopped flowing into the portal at the end of October. It's been hard to tell what is going on, so we do need to add in some better error logging to assist w/ the troubleshooting. Separate but related to this larger effort that this ticket is referencing. I'm not really sure where to provide updates on this or new observed issues, but maybe a ticket in the "Automate ingestion of different tag types" Milestone that hangs around for tracking maintenance and troubleshooting required to fix existing data streams? Or create a new ticket per issue that we need to pull in developers for? For example, I created this one #117 to capture the effort of having to adjust the data pipeline to accommodate GPE3 situations. |
Thank you for adding this @iwensu0313 To help me further understand this issue, it would be beneficial if we could understand exactly which point in the process we are running into issues.
I'm trying to understand where exactly datasets have been having issues. If it's strictly for visualization purposes, we should discuss how much value add that provides and if it's something we need to support at the moment. I think we should start thinking about new activities in this kind of a framework to help us prioritize what should be resourced. |
It appears that the issue is happening in the connection between Wildlife Computers API and our ingestion system. We have checked the WC API and there is data. Then we check our ingestion system and some stuff is found, some stuff is not. We are not sure why yet. We've recently been improving the health of our systems/servers/storage as they are getting overloaded and offloading into new ones (we do this periodically but more so lately and is an example of regular maintenance that needs to be done as part of overall cyberinfra maintenance for ALL projects we support). But again, not exactly sure what the issue is yet. But the amount of human troubleshooting time for ingests seems to be increasing. So I would say Ingest I like that framing! |
Who is requesting this?
Axiom
What is being requested?
Build out improved monitoring for ATN data types (platforms) and error logging to assist in troubleshooting.
What is the requested deadline and why?
No response
What is the current status quo (i.e., what happens if this does not get done)?
Over the past 4-6 months there have been issues related to the auto-ingestion pipeline (e.g. a small % of deployments not making its way through to the portal, active deployments stopping).
Through troubleshooting, sometimes the issue is due to missing metadata, sometimes re-running the ingestion pipeline resolves it (but lately it doesn't always do the trick). There has been an increase in time spent troubleshooting data ingestion pipeline for ATN DAC and it's unclear what the cause of the issue is.
Building transparency through monitoring and better logging can help us become more efficient with ongoing maintenance of the ATN data ingestion pipeline.
What indicates this is done (i.e., how do we know this is complete)?
Provide a description or any other important information.
No response
The text was updated successfully, but these errors were encountered: