Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch individual story pages #15

Closed
8 of 19 tasks
rivernews opened this issue Aug 21, 2021 · 6 comments
Closed
8 of 19 tasks

Fetch individual story pages #15

rivernews opened this issue Aug 21, 2021 · 6 comments

Comments

@rivernews
Copy link
Owner

rivernews commented Aug 21, 2021

Plan

image

Trigger

Synchronous Logic as Sfn

  • Abandon SQS - lambda processing because polling is too expensive
  • Use Sfn Map type to batch scrape website #21
  • Story Scraper Kick off Lambda #18
    • Go way to reuse S3 client, etc? Singleton?
      • Opt1: Instantiate session in main(), and pass it downstream
      • Opt2: Assign a global variable. but be aware of thread safe - people mention using Sync.Once then once.Do.
      • Opt3: Some mentioned func init(), here's an aws doc example. What's its Pros and Cons?
    • Pull html from S3 - s3 utilities
    • extract all links <-- html
    • Multiplexing links into different messageGroupId
    • Use send_message()'s acc incremental, randomized delay instead of sleeping in consumer, so we don't get bill the lambda exec time for sleeping. (send_message() can delay up tp 15 minutes) Randomize interval
    • De-dup implementation - for now, just check exist
    • Fetch & archive story!
      SQS
  • FIFO, potentially unlimited lambda concurrency
    Story Parser lambda
  • Fetch page into RAM
  • Store into DynamoDB (optional) - we understand that every read/write to S3 cost $$, but there's no reason to add data-modeling complexity at this stage, especially when we have very limited time. Let's skip DynamoDB part. Just do the S3.
  • De-dup It might be necessary to de-dup by story content - simply do a story URL de-dup can be enough. Of course only if we want to do censorship tracing, then we could do MD5 checksum, but only after we did data modeling. Raw HTML contains noise like ads that changes frequently making MD5 useless; sanitized, normalized data will be better for de-dup. But it's more fine-tuned, takes more time to implement. Let's de-prioritize this for now.
    • Phase I: Just do a hard check and stop if already scraped
    • Phase II: Determine several attributes, then do MD5 hash to determine duplication
  • Lastly archive if not duplicated

Additional lambda like aggregation that needs joining all concurrent process like word cloud, etc.

@rivernews
Copy link
Owner Author

Short term advancement

  • Create reusable scraper module, at the go func level (not lambda level).

    • scrape_base.go to start off = archive + parser
    • So we can get prepared for scraping stories in landing pages.
  • Clean up and amend issues

Modularizing scraping based on a "scrape pattern"

image

@rivernews
Copy link
Owner Author

rivernews commented Aug 18, 2022

All above are valuable, but could be over-complicated at this point.

What we want to do is just download stories, just fetch HTML. Not parsing anything, so actually no scraping on story.

Now, where should the fetch logic be placed?

  1. Most optimized way is to integrate with landing page scraper. You are already scraping links there, and posting to slack. Why not fetch the stories there too?
  2. Modularized way is to only read from S3 landing page HTML, almost like offline. You do have to repeat what's done in landing page scraper though - the extract story links part, lot of duplicate logic indeed.

🍓 It seems that 1. way seem better. We want to be, or at least previously we want to be, careful here, because later we could be doing some same logic, once we decide to scrape story page. But look, landing page and story page process could be quite different:

  • (Fetch landing page)
  • Landing page scraping: we want the story links and titles. Input= URL == fixed news site homepage
  • (Fetch story pages)
  • Story page scraping: we want the story main text content. Input= URL == story links

Yes, the Input part is the same - both are URL. But the scraping goal is very different.

We previously were exploring using a single SQS pipeline to handle both process. That's where we start thinking very detail at practical scene, and start 🍓 feeling 2. way may actually be better:

  • We want concurrent processing, with the potential goal of switching IPs.
  • How to organize story HTMLs in S3? By date? Because later we may want to do daily word cloud, so grouping stories together ahead of time could be useful. But then, it'll be hard to de-dup, so stories HTML better store in its own independent location. But then landing page associated stories links we already spent the time to calculate ➡️ maybe store the today stories in a separate JSON is better idea, the JSON can also store other landing page metadata. Problem solved!
  • We want to take care of all history fetched landing pages. So it makes sense to use SQS and pipe a S3 directory over to let it flow, VS add logic to existing landing page scraper but then only newer landing page stories start get fetched.

@rivernews
Copy link
Owner Author

rivernews commented Aug 21, 2022

Break it down more?

  1. Fetch landing page
  2. Parse landing page, store metadata (including story urls) in JSON
  3. Fetch story pages

In this way, you can reuse the "fetching" logic for SQS.

  1. Cronjob: send landing page URL to SQS
  2. SQS-Lambda: fetch the landing page, store at S3
    • S3 dir s3://media-literacy-archives/{redacted}/daily-headlines/2022-08-21T12:15:42Z/landing.html
  3. S3-event bridge: landing page S3 directory new files (HTML) created, triggers landing page scraper
    • How to determine S3 dir?
  4. Lambda: scrape landing page:
    • extract stories title and link, store them on S3 as JSON
    • landing page metadata JSON created in S3 directory (later used for word cloud grouping story for a day / landing page), trigger fetch story logic
    • (read JSON) send story URLs to SQS

In the future, we can create S3-event bridge for scraping story.

@rivernews
Copy link
Owner Author

rivernews commented Aug 21, 2022

Now we come up with a cloud-component plan, we need to implement it. Should we either

  • Modify existing, seeking the minimal change; OR
  • Start from scratch, reuse golang if necessary.

After rethinking about it, instead of trying to land on an optimized solution, let's leave some redundancy. San terraform code you want to leave it there for POC as well so let's not change that. Just start adding on top of existing stuff.

  1. Let's kick start by read through S3 dir landing page {then set up event when S3 create new landing file}
  2. Parse landing page (yes, even if we did, we only post to Slack, did not preserve the outcome, so had to do it over again)

@rivernews
Copy link
Owner Author

rivernews commented Sep 17, 2022

Root pull request (now actually dev): https://github.com/rivernews/media-literacy/pull/28/files
We should probably create separate PR for each specific issues.

Next steps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant