-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fetch individual story pages #15
Comments
All above are valuable, but could be over-complicated at this point. What we want to do is just download stories, just fetch HTML. Not parsing anything, so actually no scraping on story. Now, where should the fetch logic be placed?
🍓 It seems that 1. way seem better. We want to be, or at least previously we want to be, careful here, because later we could be doing some same logic, once we decide to scrape story page. But look, landing page and story page process could be quite different:
Yes, the Input part is the same - both are URL. But the scraping goal is very different. We previously were exploring using a single SQS pipeline to handle both process. That's where we start thinking very detail at practical scene, and start 🍓 feeling 2. way may actually be better:
|
Break it down more?
In this way, you can reuse the "fetching" logic for SQS.
In the future, we can create S3-event bridge for scraping story. |
Now we come up with a cloud-component plan, we need to implement it. Should we either
After rethinking about it, instead of trying to land on an optimized solution, let's leave some redundancy. San terraform code you want to leave it there for POC as well so let's not change that. Just start adding on top of existing stuff.
|
Root pull request (now actually dev): https://github.com/rivernews/media-literacy/pull/28/files Next steps
|
Plan
Trigger
Synchronous Logic as Sfn
main()
, and pass it downstreamSync.Once
thenonce.Do
.func init()
, here's an aws doc example. What's its Pros and Cons?messageGroupId
UseRandomize intervalsend_message()
's acc incremental, randomizeddelay
instead of sleeping in consumer, so we don't get bill the lambda exec time for sleeping. (send_message()
can delay up tp 15 minutes)SQS
Story Parser lambda
Fetch page into RAMStore into DynamoDB (optional)- we understand that every read/write to S3 cost $$, but there's no reason to add data-modeling complexity at this stage, especially when we have very limited time. Let's skip DynamoDB part. Just do the S3.De-dupIt might be necessary to de-dup by story content - simply do a story URL de-dup can be enough. Of course only if we want to do censorship tracing, then we could do MD5 checksum, but only after we did data modeling. Raw HTML contains noise like ads that changes frequently making MD5 useless; sanitized, normalized data will be better for de-dup. But it's more fine-tuned, takes more time to implement. Let's de-prioritize this for now.Phase I: Just do a hard check and stop if already scrapedPhase II: Determine several attributes, then do MD5 hash to determine duplicationAdditional lambda like aggregation that needs joining all concurrent process like word cloud, etc.
The text was updated successfully, but these errors were encountered: