Fetch individual story pages #15

rivernews · 2021-08-21T19:15:56Z

Plan

Trigger

From Slack command - arg = landing page S3 key #16
From S3 bucket - landing page triggers

Synchronous Logic as Sfn

Additional lambda like aggregation that needs joining all concurrent process like word cloud, etc.

The text was updated successfully, but these errors were encountered:

item in #15

rivernews · 2022-08-18T05:26:09Z

Short term advancement

Create reusable scraper module, at the go func level (not lambda level).
- scrape_base.go to start off = archive + parser
- So we can get prepared for scraping stories in landing pages.
Clean up and amend issues

Modularizing scraping based on a "scrape pattern"

rivernews · 2022-08-18T05:44:14Z

All above are valuable, but could be over-complicated at this point.

What we want to do is just download stories, just fetch HTML. Not parsing anything, so actually no scraping on story.

Now, where should the fetch logic be placed?

Most optimized way is to integrate with landing page scraper. You are already scraping links there, and posting to slack. Why not fetch the stories there too?
Modularized way is to only read from S3 landing page HTML, almost like offline. You do have to repeat what's done in landing page scraper though - the extract story links part, lot of duplicate logic indeed.

🍓 It seems that 1. way seem better. We want to be, or at least previously we want to be, careful here, because later we could be doing some same logic, once we decide to scrape story page. But look, landing page and story page process could be quite different:

(Fetch landing page)
Landing page scraping: we want the story links and titles. Input= URL == fixed news site homepage
(Fetch story pages)
Story page scraping: we want the story main text content. Input= URL == story links

Yes, the Input part is the same - both are URL. But the scraping goal is very different.

We previously were exploring using a single SQS pipeline to handle both process. That's where we start thinking very detail at practical scene, and start 🍓 feeling 2. way may actually be better:

We want concurrent processing, with the potential goal of switching IPs.
How to organize story HTMLs in S3? By date? Because later we may want to do daily word cloud, so grouping stories together ahead of time could be useful. But then, it'll be hard to de-dup, so stories HTML better store in its own independent location. But then landing page associated stories links we already spent the time to calculate ➡️ maybe store the today stories in a separate JSON is better idea, the JSON can also store other landing page metadata. Problem solved!
We want to take care of all history fetched landing pages. So it makes sense to use SQS and pipe a S3 directory over to let it flow, VS add logic to existing landing page scraper but then only newer landing page stories start get fetched.

rivernews · 2022-08-21T07:56:31Z

Break it down more?

Fetch landing page
Parse landing page, store metadata (including story urls) in JSON
Fetch story pages

In this way, you can reuse the "fetching" logic for SQS.

Cronjob: send landing page URL to SQS
SQS-Lambda: fetch the landing page, store at S3
- S3 dir s3://media-literacy-archives/{redacted}/daily-headlines/2022-08-21T12:15:42Z/landing.html
S3-event bridge: landing page S3 directory new files (HTML) created, triggers landing page scraper
- How to determine S3 dir?
Lambda: scrape landing page:
- extract stories title and link, store them on S3 as JSON
- landing page metadata JSON created in S3 directory (later used for word cloud grouping story for a day / landing page), trigger fetch story logic
- (read JSON) send story URLs to SQS

In the future, we can create S3-event bridge for scraping story.

rivernews · 2022-08-21T08:00:52Z

Now we come up with a cloud-component plan, we need to implement it. Should we either

Modify existing, seeking the minimal change; OR
Start from scratch, reuse golang if necessary.

After rethinking about it, instead of trying to land on an optimized solution, let's leave some redundancy. San terraform code you want to leave it there for POC as well so let's not change that. Just start adding on top of existing stuff.

Let's kick start by read through S3 dir landing page {then set up event when S3 create new landing file}
- (3/3) Create one-time trigger all historical landing page - fetch all stories #25
Parse landing page (yes, even if we did, we only post to Slack, did not preserve the outcome, so had to do it over again)
- Generate landing page metadata JSON. Where should we put in S3 dir?
  - (1/3) Create S3 landing page file trigger - invoke lambda - generate JSON #23
- (Let fetch story be part of landing page parsing, or? But then there're lots of stories like >100 not great to do it all at once - so better just store stories URL in metadata JSON, then we can take the time to decide what to do; better use different IP, either Sfn mapping or SQS. Sfn Mapping can control concurrency out of the box, but not SQS)
  - (2/3) Create landing metadata JSON trigger - invoke lambda - fetch stories #24
  - Let's keep it simple, Sfn mapping or SQS, either works, just choose one first.

rivernews · 2022-09-17T06:04:52Z

Root pull request (now actually dev): https://github.com/rivernews/media-literacy/pull/28/files
We should probably create separate PR for each specific issues.

Next steps

"Optional" enhancement mentioned in (2/3) Create landing metadata JSON trigger - invoke lambda - fetch stories #24
Go broad: word cloud for story title text!
Go deeper: Extracting story text

rivernews · 2022-10-02T07:10:02Z

Closing since

Are all done.

rivernews added a commit that referenced this issue Aug 31, 2021

Implement S3 Pull; pulling landing page in stories fetch lambda

aff9d43

item in #15

rivernews mentioned this issue Aug 31, 2021

Implement S3 Pull; pulling landing page in stories fetch lambda #20

Merged

rivernews mentioned this issue Aug 18, 2022

Tip of Progress: Re-sync project status #22

Open

5 tasks

rivernews mentioned this issue Sep 17, 2022

(1/3) Create S3 landing page file trigger - invoke lambda - generate JSON #23

Closed

1 task

rivernews closed this as completed Oct 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch individual story pages #15

Fetch individual story pages #15

rivernews commented Aug 21, 2021 •

edited

Loading

rivernews commented Aug 18, 2022

rivernews commented Aug 18, 2022 •

edited

Loading

rivernews commented Aug 21, 2022 •

edited

Loading

rivernews commented Aug 21, 2022 •

edited

Loading

rivernews commented Sep 17, 2022 •

edited

Loading

rivernews commented Oct 2, 2022

Fetch individual story pages #15

Fetch individual story pages #15

Comments

rivernews commented Aug 21, 2021 • edited Loading

Plan

Trigger

Synchronous Logic as Sfn

rivernews commented Aug 18, 2022

Modularizing scraping based on a "scrape pattern"

rivernews commented Aug 18, 2022 • edited Loading

rivernews commented Aug 21, 2022 • edited Loading

rivernews commented Aug 21, 2022 • edited Loading

rivernews commented Sep 17, 2022 • edited Loading

Next steps

rivernews commented Oct 2, 2022

rivernews commented Aug 21, 2021 •

edited

Loading

rivernews commented Aug 18, 2022 •

edited

Loading

rivernews commented Aug 21, 2022 •

edited

Loading

rivernews commented Aug 21, 2022 •

edited

Loading

rivernews commented Sep 17, 2022 •

edited

Loading