You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're now able to scrape stories of all historical 818 landings since 2021 Aug. The fetch rate is properly throttled. But now the cost is too high. What can we do?
Stronger feature
Fast track - disable change detection for now - story can we de-dup for now? If story html already in html, skip it. It'll significantly boost our first-time processing. Skip story fetch if S3 already exists #41
But since each S3 story you prepended landing page timestamp, so you can't use S3 to check; instead we can check in DDB since we create story items there. s3Key won't work for the same reason. The story URL is better.
Detect change & censorship: A lot of stories are duplicates [because landing page fetch per 12h and hasn't changed much in between]... do we still fetch them? I guess we better do, but we want to store them all, not overwrite each other.
Save $$
Can we move the "random wait" logic into Sfn wait? Seems Sfn wait doesn't charge? YES! Looking at Wait API, we can use SecondsPath to specify the field it should wait. We now just have to pre-compute wait time, then send it to Sfn input! While this will increase Sfn cost by adding one more step, we will significantly save lambda compute time. Indeed our 1. cost is from lambda, $3.6, comparing to Sfn $0.23 and S3 $0.1. We can probably make the lambda sparse. Sfn cost will remain, although it's affordable.
Can we lower function memory to save? Currently 128 MB provisioned, used 4xMB.
Let story sfn step fetch more stories not just single one. One IP definitely makes sense to access more.
Easier to debug
We might get banned by Slack API. Especially, when S3 batch copy you trigger S3 event all at once. Can we not show log from it if it's stable? Or at least not log to Slack (but log to CloudWatch) Landing S3 trigger disable slack log & only log to cloudWatch #40
Add lambda invocation id (request id) to event description, will help pin the log
Add env tag in log, so we can tell especially in slack, whether it's prod or dev resource. Add env to lambda logger #42
Can we optimize our log msg?
Better way to query "metadata processed" landing items? Or even better, a field lastEventName to query (but then it could be similar to scan)? Or just a opposite to isDocTypeWaitingForMetadata, like isDocTypeMetadataDone.
Better way to query all stories items associate with a landing item?
Sfn map improvement: pre-determine wait time and put in Sfn input so it's clearer.
Do the same above for prod (ready!)
The text was updated successfully, but these errors were encountered:
rivernews
changed the title
[Epic] Prod drill and ready for full prod
[Epic] Prod drill and ready for full prod (Fetch All Individual Stories)
Oct 2, 2022
From last issue #25
We're now able to scrape stories of all historical 818 landings since 2021 Aug. The fetch rate is properly throttled. But now the cost is too high. What can we do?
s3Key
won't work for the same reason. The story URL is better.SecondsPath
to specify the field it should wait. We now just have to pre-compute wait time, then send it to Sfn input! While this will increase Sfn cost by adding one more step, we will significantly save lambda compute time. Indeed our 1. cost is from lambda, $3.6, comparing to Sfn $0.23 and S3 $0.1. We can probably make the lambda sparse. Sfn cost will remain, although it's affordable.story
sfn step fetch more stories not just single one. One IP definitely makes sense to access more.lastEventName
to query (but then it could be similar to scan)? Or just a opposite toisDocTypeWaitingForMetadata
, likeisDocTypeMetadataDone
.The text was updated successfully, but these errors were encountered: