Backend Scraper #87

bloudermilk · 2020-02-24T16:11:56Z

This reverts commit 8f9e916.

bloudermilk · 2020-02-24T16:19:21Z

@leventov heads up, I double-reverted your last commit so we can work on it in this PR before merging to master. I'll use a PR review to share my notes!

bloudermilk

@leventov provided my initial thoughts on this implementation inline, include a few questions for you. I was really impressed with the level of documentation you provided, especially the "decisions" log which could be really helpful down the road when new contributors join.

backend/doc/decisions/5-use-yarn.md

backend/doc/decisions/6-use-fastify-server.md

bloudermilk · 2020-02-24T16:24:15Z

backend/src/app.js

+
+const { pgBossQueue } = require('./pg');
+
+fastify.route({


Can you elaborate on how you were planning to trigger this job? I had envisioned this working w/o any kind of web server to trigger jobs. For example, using a "cron job" (but the Heroku equivalent) to queue a job every N minutes.

Netlify hook should call this endpoint. Even cron job (which will kick off for periodic batch re-scraping of all organisations' data, for example), will need to call some simplistic web endpoint which just puts a job into a queue.

I don't see how more direct it could be. Push jobs directly to PostgreSQL via a query? graphile/worker supports this, but not pg-boss (at least, it doesn't document this), so it's a slippery slope. But I also don't see this as a problem.

Even with direct SQL query to create a job, you need some basic auth/protection, I suppose, eventually, so you can't really avoid having a thin web layer.

Cron batch processing in Heroku can also be done by starting a dyno, which will do the batch processing and finish. But this won't work for first-time scraping of newly added orgs, there should be a worker listening in the background. And since we need this type of worker anyway, it can handle batch scraping, too.

will need to call some simplistic web endpoint which just puts a job into a queue.

The way we usually do this is to run a script within the project that has the job of simply queueing the jobs you want to run. To elaborate, the script would work like this:

$ npm run queue-jobs

Fetch organizations from Airtable that either A) have not be scraped or B) are "outdated"

For each organization

Queue a job to process that organization

The script runs in a couple seconds and is only responsible for adding the right jobs to the queue. The worker does the rest. This avoids needing the web interface, keeps the logic of which orgs to scrape internal to the app, and removes the dependency on Zapier.

bloudermilk · 2020-02-24T16:25:15Z

backend/src/pg.js

+const pgLocalConnectionString = `postgres://postgres:postgres@${host}:5432/postgres`;
+
+// See https://stackoverflow.com/a/28489160
+const isProduction = !!(process.env._ && (process.env._.indexOf('heroku') >= 0));


Any reason we can't just use NODE_ENV to find out if we're running in production?

I guess it requires more setup actions by the developers locally (export NODE_ENV=dev) and Heroku setup (heroku config:set NODE_ENV="production"), imagining some developer wants to test the deployment in their own free Heroku account (like I do right now, but other developers may want to do this, too).

The above line doesn't require anything like that.

@leventov Heroku sets that variable by default, and I would assume we never set it locally. It's standard practice to use NODE_ENV in NodeJS apps.

backend/src/worker.js

…e discussion from 5-use-yarn.md

bloudermilk · 2020-02-25T13:25:03Z

@leventov would be great to catch up this week on a call before you get too much further into this. I have some thoughts on the architecture that I'd like to discuss.

After thinking more about this over the last week, I'm leaning toward your original idea to push all data back to Airtable. Mainly because you made a good point that some of these attributes we really want in Airtable anyway. It also avoids needing to worry about pulling it into Gatsby from a different source, maintaining an API, and keeps everything viewable within Airtable.

I was thinking we could add new columns to "Organizations":

Twitter Followers - number of twitter followers
Enrichment Data - JSON blob of any other data that we don't need in the Airtable UI but want in Gatsby. For example, I'd like to wire up the Crunchbase ODM API

I was thinking we would have another queue for updating Airtable, which allows us to leverage some great pg-boss features:

rate limiting - so we can control the speed at which we hit the Airtable API (we're limited to 5rps)
batching - we can update several records in one API call with the Airtable API

bloudermilk · 2020-02-28T12:23:15Z

@leventov looks like you've made a ton of progress over the last few days! Let me know when you're ready for me to dig in for another review :)

leventov · 2020-02-29T08:43:46Z

@bloudermilk I'm currently developing the first part of the cron pull thing - backing up the data from Airtable into the backend. (The next step is to see what backed up organizations don't have corresponding entries in scraping_results table - thus determining the organizations requiring the first-time scraping.)

I realized that there will be quite a few more tables in the backend database that I previously thought - probably, we are going to eventually back up most (or all) Airtable's tables (there are currently 11 of them) and model relationships between them. So, writing SQL by hand is not sustainable.

From raw SQL, I jumped to using Sequelize ORM. I didn't research the topic and didn't consider any alternatives. That was a mistake - I've sunk quite some time learning Sequelize and looking for ways to work around the problems with it: I bumped into several, including sequelize/sequelize#11967, sequelize/sequelize#11968, sequelize/sequelize#3759.

I think we should optimize the backend (and the Climatescape project in general) for enabling people to accomplish something useful quickly with a little time investment (e. g. if they just have a short-lasting inspiration to contribute to the project). Therefore, I think, using an ORM with a relatively steep learning curve is not optimal.

So currently I think about switching from Sequelize to either Knex.js (a query builder library, not an ORM) or Objection.js which advertises itself as a lightweight ORM, or a "relational query builder".

What do you think? Do you have experience with either of these technologies?

bloudermilk · 2020-03-03T13:52:40Z

@leventov thanks for explaining your thought process. I completely agree with this:

I think we should optimize the backend (and the Climatescape project in general) for enabling people to accomplish something useful quickly with a little time investment

I only have experience personally with Sequelize and agree it's not a great library. It's hard for me to say what the better option is between the other two without digging in, but here are my thoughts:

Knex seems to be much more popular (7X downloads, 2X GitHub stars)
Given our limited scope in the short term, a query builder is probably adequate
Objection, being built on Knex, should be easy to move toward down the line if we need more features

One thing I'm not clear on is why we need to mirror tables from Airtable at this time. It seems to be since we're pulling from the Airtable API on each run we should just put that data in the queue?

Also wanted to clarify one point that we only talked about briefly. You'll see we have a lot of tables in Airtable for enrichment right now, but that's mainly due to limitations/workflow. For the enrichment output from the backend, I think we would be much better off writing to columns directly on Organizations.

leventov · 2020-03-03T16:23:18Z

@bloudermilk thanks for the feedback. I'll proceed with using Knex.js.

One thing I'm not clear on is why we need to mirror tables from Airtable at this time. It seems to be since we're pulling from the Airtable API on each run we should just put that data in the queue?

I conceived first-time scraping kickoff as follows:

Backup all organizations from Airtable to Postgres, into organizations table, with the following schema:
- id - Airtable's org ID
- fields - JSONB, all Airtable's textual fields, corresponds to fields in Airtable API response. Only textual fields, no images (yet)
- created_at - the time org was first backed up in Postgres
- updated_at - the time org was updated in Postgres the last time
- deleted - we don't delete orgs in Postgres, only mark them as deleted.
Find the orgs that need first-time scraping by joining organizations and scraping_results tables, possibly taking into account timestamp columns in both tables.

I think this approach is robust (e. g. in face of some pg-boss failures), it doesn't require any extra fields in Airtable.

Additional benefits of having a backup of data in the backend:

Well, this is a backup - some sort of protection against data loss in Airtable
Maybe we can run some ad-hoc or periodic analysis on this data that Airtable's interface and/or API doesn't permit

bloudermilk · 2020-03-04T17:09:57Z

@leventov I'll defer to you on the decision to mirror Airtable since you're the one implementing it. I may have done it a different way, but that doesn't make it the "right way" by any means!

leventov · 2020-03-09T06:55:02Z

@bloudermilk instead of "Twitter Followers" column, I suggest to add "Climatescape Rank", which will be equal to Twitter Followers count for now, but could be changed in the future to account for other factors of organization's importance. Twitter Followers per se could be stored inside the JSON stored in "Enrichment Data" column.

bloudermilk · 2020-03-09T09:48:22Z

@leventov agree with your suggestion. Maybe we can just call it "Rank"?

leventov · 2020-03-09T13:45:09Z

Maybe we can just call it "Rank"?

Sounds good

… Twitter user objects in scraping_results table

…pingJobs.js, add setupPgBossQueue() function and pass pgBossQueue around explicitly

bloudermilk

@leventov this looks really close! Would you be able to connect some time this week to walk me through everything? It's a lot of code and I just want to make sure I'm understanding it all.

.idea/inspectionProfiles/Project_Default.xml

backend/.eslintrc.js

backend/src/app.js

bloudermilk · 2020-03-20T16:17:10Z

@leventov I see the updates in Airtable–awesome! Should we move the Heroku app over to the Climatescape team or create a new app?

Is this ready to merge?

leventov · 2020-03-21T07:15:59Z

@bloudermilk thanks for review

Revert "Revert "Add backend, part of #40""

29879c6

This reverts commit 8f9e916.

bloudermilk assigned bloudermilk and leventov Feb 24, 2020

bloudermilk linked an issue Feb 24, 2020 that may be closed by this pull request

Epic: automated data scraping pipeline #40

Closed

bloudermilk commented Feb 24, 2020

View reviewed changes

leventov added 3 commits February 25, 2020 09:24

Fix backend readme to mention yarn 1.x, not 2 install, and link to th…

56c3e7d

…e discussion from 5-use-yarn.md

More links from decision docs to discussions.

d6ccd83

Add comment to isProduction

4f70864

bloudermilk mentioned this pull request Feb 25, 2020

Epic: automated data scraping pipeline #40

Closed

leventov added 8 commits February 26, 2020 08:54

backend: add airtable.js

6b3a569

backend: reformat pg.js according to the project's style

14c7c1b

backend: reformat setupScraping

3ca9d7c

backend: reformat worker.js

34a99c9

backend: migrate to sequelize

b68663e

backend: add backupAirtable.js, setupAirtableBackup.js

91c6899

backend: update yarn instructions in README

7166a87

backend: in backupAirtable.js, write orgs to Postgres

6fb93a4

Update decision docs

003bbc8

leventov added 5 commits March 5, 2020 18:02

Migrate to Knex, add backupAirtable.test.js

9563e73

Add decision doc about pushing data to Airtable from backend

ae3615e

Add determineOrgsToScrapeFirstTime function

e5369a1

Add addFirstTimeScrapingJobs() function

53f1866

Connect to Twitter API

337cf32

leventov added 2 commits March 9, 2020 09:01

Refactor twitter.js and addFirstTimeScrapingJobs.js

127f3d7

Complete first-time scraping of Twitter followers

450614e

leventov added 4 commits March 10, 2020 11:16

backend: refactoring: store just Twitter followers -> store the whole…

651595b

… Twitter user objects in scraping_results table

backend: refactor: extract twitterUserObjectScraping.js from worker.js

1a4d346

backend: refactor: extract firstTimeScraping.js from addFirstTimeScra…

e87151c

…pingJobs.js, add setupPgBossQueue() function and pass pgBossQueue around explicitly

Fix bug in getTwitterScreenName()

de6bd42

bloudermilk commented Mar 16, 2020

View reviewed changes

.idea/inspectionProfiles/Project_Default.xml Show resolved Hide resolved

backend/.eslintrc.js Show resolved Hide resolved

backend/src/app.js Show resolved Hide resolved

leventov merged commit 73563ab into master Mar 21, 2020

leventov deleted the add-backend branch March 21, 2020 07:15

leventov mentioned this pull request Mar 28, 2020

Support different ways of ordering organizations within categories #174

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend Scraper #87

Backend Scraper #87

bloudermilk commented Feb 24, 2020

bloudermilk commented Feb 24, 2020

bloudermilk left a comment

bloudermilk Feb 24, 2020

leventov Feb 25, 2020

bloudermilk Feb 25, 2020

bloudermilk Feb 24, 2020

leventov Feb 25, 2020

bloudermilk Feb 25, 2020

bloudermilk commented Feb 25, 2020

bloudermilk commented Feb 28, 2020

leventov commented Feb 29, 2020

bloudermilk commented Mar 3, 2020

leventov commented Mar 3, 2020

bloudermilk commented Mar 4, 2020

leventov commented Mar 9, 2020

bloudermilk commented Mar 9, 2020

leventov commented Mar 9, 2020

bloudermilk left a comment

bloudermilk commented Mar 20, 2020

leventov commented Mar 21, 2020


		const { pgBossQueue } = require('./pg');

		fastify.route({

Backend Scraper #87

Backend Scraper #87

Conversation

bloudermilk commented Feb 24, 2020

bloudermilk commented Feb 24, 2020

bloudermilk left a comment

Choose a reason for hiding this comment

bloudermilk Feb 24, 2020

Choose a reason for hiding this comment

leventov Feb 25, 2020

Choose a reason for hiding this comment

bloudermilk Feb 25, 2020

Choose a reason for hiding this comment

bloudermilk Feb 24, 2020

Choose a reason for hiding this comment

leventov Feb 25, 2020

Choose a reason for hiding this comment

bloudermilk Feb 25, 2020

Choose a reason for hiding this comment

bloudermilk commented Feb 25, 2020

bloudermilk commented Feb 28, 2020

leventov commented Feb 29, 2020

bloudermilk commented Mar 3, 2020

leventov commented Mar 3, 2020

bloudermilk commented Mar 4, 2020

leventov commented Mar 9, 2020

bloudermilk commented Mar 9, 2020

leventov commented Mar 9, 2020

bloudermilk left a comment

Choose a reason for hiding this comment

bloudermilk commented Mar 20, 2020

leventov commented Mar 21, 2020