[Epic] Prod drill and ready for full prod (Fetch All Individual Stories) #45

rivernews · 2022-10-02T06:37:49Z

From last issue #25

We're now able to scrape stories of all historical 818 landings since 2021 Aug. The fetch rate is properly throttled. But now the cost is too high. What can we do?

Stronger feature
- Fast track - disable change detection for now - story can we de-dup for now? If story html already in html, skip it. It'll significantly boost our first-time processing. Skip story fetch if S3 already exists #41
  - But since each S3 story you prepended landing page timestamp, so you can't use S3 to check; instead we can check in DDB since we create story items there. s3Key won't work for the same reason. The story URL is better.
- Detect change & censorship: A lot of stories are duplicates [because landing page fetch per 12h and hasn't changed much in between]... do we still fetch them? I guess we better do, but we want to store them all, not overwrite each other.
Save $$
- Can we move the "random wait" logic into Sfn wait? Seems Sfn wait doesn't charge? YES! Looking at Wait API, we can use SecondsPath to specify the field it should wait. We now just have to pre-compute wait time, then send it to Sfn input! While this will increase Sfn cost by adding one more step, we will significantly save lambda compute time. Indeed our 1. cost is from lambda, $3.6, comparing to Sfn $0.23 and S3 $0.1. We can probably make the lambda sparse. Sfn cost will remain, although it's affordable.
- Can we lower function memory to save? Currently 128 MB provisioned, used 4xMB.
- Let story sfn step fetch more stories not just single one. One IP definitely makes sense to access more.
Easier to debug
- We might get banned by Slack API. Especially, when S3 batch copy you trigger S3 event all at once. Can we not show log from it if it's stable? Or at least not log to Slack (but log to CloudWatch) Landing S3 trigger disable slack log & only log to cloudWatch #40
- Add lambda invocation id (request id) to event description, will help pin the log
- Add env tag in log, so we can tell especially in slack, whether it's prod or dev resource. Add env to lambda logger #42
- Can we optimize our log msg?
- Better way to query "metadata processed" landing items? Or even better, a field lastEventName to query (but then it could be similar to scan)? Or just a opposite to isDocTypeWaitingForMetadata, like isDocTypeMetadataDone.
- Better way to query all stories items associate with a landing item?
- Sfn map improvement: pre-determine wait time and put in Sfn input so it's clearer.

Do the same above for prod (ready!)

The text was updated successfully, but these errors were encountered:

rivernews changed the title ~~[Epic] Prod drill and ready for full prod~~ [Epic] Prod drill and ready for full prod (Fetch All Individual Stories) Oct 2, 2022

rivernews mentioned this issue Oct 2, 2022

Tip of Progress: Re-sync project status #22

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] Prod drill and ready for full prod (Fetch All Individual Stories) #45

[Epic] Prod drill and ready for full prod (Fetch All Individual Stories) #45

rivernews commented Oct 2, 2022 •

edited

Loading

[Epic] Prod drill and ready for full prod (Fetch All Individual Stories) #45

[Epic] Prod drill and ready for full prod (Fetch All Individual Stories) #45

Comments

rivernews commented Oct 2, 2022 • edited Loading

rivernews commented Oct 2, 2022 •

edited

Loading