Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operations cleanup #3589

Merged
merged 2 commits into from
Jan 2, 2024
Merged

Conversation

dbutenhof
Copy link
Member

@dbutenhof dbutenhof commented Dec 25, 2023

This addresses several issues encountered while monitoring the migration of tarballs from the passthrough server backup directories to the new production server.

First, I've seen PUT /upload problems more frequently than anticipated, and when transferring thousands of tarballs the error details get easily hidden: I've improved the way they're captured and reported at the end. Also, having observed many of the NGINX html format response messages, I decided to try scraping the text for the <title> tag text, which seems to contain the real error message, using BeautifulSoup.

Second, I ran into a set of tarballs from 2020 which seem to have metadata.log files which don't contain run.controller values. These, it turns out, fall into a hole in intake processing. Without a metadata.log at all, we just ignore the problem and use a default "controller" of unknown, but if the specific value is missing we fail the upload entirely with a poorly worded error message. It makes more sense to treat a missing run.controller the same way as a missing metadata.log.

Third, I've seen indexing failures on large "batches" (trying to index thousands of datasets in one run of the indexer) blowing up with memory problems that don't reproduce. Although it's not obvious from glancing through the main indexer loop, it seems likely there's a memory leak somewhere that's gradually building up. Since I can't find it (and I'm on vacation, so I didn't look excessively hard), I took another approach I'd considered earlier anyway and rejiggered the Sync.update to allow adding a SQL LIMIT to the query for READY datasets. This shouldn't have much impact on throughput as the indexer is serial and restarts every minute if it's not already/still busy, but it may keep the memory buildup below the danger threshold.

Only the migration utility changes have actually been tested "live", but the tests run.

@dbutenhof dbutenhof added Server Contrib Indexing API Of and relating to application programming interfaces to services and functions Database Operations Related to operation and monitoring of a service labels Dec 25, 2023
@dbutenhof dbutenhof requested a review from webbnh December 25, 2023 02:17
@dbutenhof dbutenhof self-assigned this Dec 25, 2023
webbnh

This comment was marked as resolved.

This addresses several issues encountered while monitoring the migration of
tarballs from the passthrough server backup directories to the new production
server.

First, I've seen `PUT /upload` problems more frequently than anticipated, and
when transferring thousands of tarballs the error details get easily hidden:
I've improved the way they're captured and reported at the end. Also, having
observed many of the NGINX `html` format response messages, I decided to try
scaping the text for the `<title>` tag text, which seems to contain the real
error message, using BeautifulSoup.

Second, I ran into a set of tarballs from 2020 which seem to have
`metadata.log` files which don't contain `run.controller` values. These, it
turns out, fall into a hole in intake processing. Without a `metadata.log` at
all, we just ignore the problem and use a default "controller" of `unknown`,
but if the specific value is missing we fail the upload entirely with a
poorly worded error message. It makes more sense to treat a missing
`run.controller` the same way as a missing `metadata.log`.

Third, I've seen indexing failures on large "batches" (trying to index
thousands of datasets in one run of the indexer) blowing up with memory
problems that don't reproduce. Although it's not obvious from glancing through
the main indexer loop, it seems likely there's a memory leak somewhere that's
gradually building up. Since I can't find it (and I'm on vacation, so I didn't
look excessively hard), I took another approach I'd considered earlier anyway
and rejiggered the `Sync.update` to allow adding a SQL `LIMIT` to the query
for `READY` datasets. This shouldn't have much impact on throughput as the
indexer is serial and restarts every minute if it's not already/still busy,
but it may keep the memory buildup below the danger threshold.

Only the migration utility changes have actually been tested "live", but the
tests run.
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@dbutenhof dbutenhof merged commit 010037d into distributed-system-analysis:main Jan 2, 2024
4 checks passed
@dbutenhof dbutenhof deleted the bigindex branch January 2, 2024 23:35
webbnh pushed a commit that referenced this pull request Jan 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Of and relating to application programming interfaces to services and functions Contrib Database Indexing Operations Related to operation and monitoring of a service Server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants