Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] Prod drill and ready for full prod (Fetch All Individual Stories) #45

Open
1 of 13 tasks
rivernews opened this issue Oct 2, 2022 · 0 comments
Open
1 of 13 tasks

Comments

@rivernews
Copy link
Owner

rivernews commented Oct 2, 2022

From last issue #25

We're now able to scrape stories of all historical 818 landings since 2021 Aug. The fetch rate is properly throttled. But now the cost is too high. What can we do?

  • Stronger feature
    • Fast track - disable change detection for now - story can we de-dup for now? If story html already in html, skip it. It'll significantly boost our first-time processing. Skip story fetch if S3 already exists  #41
      • But since each S3 story you prepended landing page timestamp, so you can't use S3 to check; instead we can check in DDB since we create story items there. s3Key won't work for the same reason. The story URL is better.
    • Detect change & censorship: A lot of stories are duplicates [because landing page fetch per 12h and hasn't changed much in between]... do we still fetch them? I guess we better do, but we want to store them all, not overwrite each other.
  • Save $$
    • Can we move the "random wait" logic into Sfn wait? Seems Sfn wait doesn't charge? YES! Looking at Wait API, we can use SecondsPath to specify the field it should wait. We now just have to pre-compute wait time, then send it to Sfn input! While this will increase Sfn cost by adding one more step, we will significantly save lambda compute time. Indeed our 1. cost is from lambda, $3.6, comparing to Sfn $0.23 and S3 $0.1. We can probably make the lambda sparse. Sfn cost will remain, although it's affordable.
    • Can we lower function memory to save? Currently 128 MB provisioned, used 4xMB.
    • Let story sfn step fetch more stories not just single one. One IP definitely makes sense to access more.
  • Easier to debug
    • We might get banned by Slack API. Especially, when S3 batch copy you trigger S3 event all at once. Can we not show log from it if it's stable? Or at least not log to Slack (but log to CloudWatch) Landing S3 trigger disable slack log & only log to cloudWatch #40
    • Add lambda invocation id (request id) to event description, will help pin the log
    • Add env tag in log, so we can tell especially in slack, whether it's prod or dev resource. Add env to lambda logger #42
    • Can we optimize our log msg?
    • Better way to query "metadata processed" landing items? Or even better, a field lastEventName to query (but then it could be similar to scan)? Or just a opposite to isDocTypeWaitingForMetadata, like isDocTypeMetadataDone.
    • Better way to query all stories items associate with a landing item?
    • Sfn map improvement: pre-determine wait time and put in Sfn input so it's clearer.
  • Do the same above for prod (ready!)
@rivernews rivernews changed the title [Epic] Prod drill and ready for full prod [Epic] Prod drill and ready for full prod (Fetch All Individual Stories) Oct 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant