Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend Scraper #87

Merged
merged 24 commits into from
Mar 21, 2020
Merged

Backend Scraper #87

merged 24 commits into from
Mar 21, 2020

Conversation

bloudermilk
Copy link
Member

Re #40

@bloudermilk
Copy link
Member Author

@leventov heads up, I double-reverted your last commit so we can work on it in this PR before merging to master. I'll use a PR review to share my notes!

@bloudermilk bloudermilk linked an issue Feb 24, 2020 that may be closed by this pull request
Copy link
Member Author

@bloudermilk bloudermilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leventov provided my initial thoughts on this implementation inline, include a few questions for you. I was really impressed with the level of documentation you provided, especially the "decisions" log which could be really helpful down the road when new contributors join.

backend/doc/decisions/5-use-yarn.md Show resolved Hide resolved
backend/doc/decisions/6-use-fastify-server.md Outdated Show resolved Hide resolved

const { pgBossQueue } = require('./pg');

fastify.route({
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on how you were planning to trigger this job? I had envisioned this working w/o any kind of web server to trigger jobs. For example, using a "cron job" (but the Heroku equivalent) to queue a job every N minutes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Netlify hook should call this endpoint. Even cron job (which will kick off for periodic batch re-scraping of all organisations' data, for example), will need to call some simplistic web endpoint which just puts a job into a queue.

I don't see how more direct it could be. Push jobs directly to PostgreSQL via a query? graphile/worker supports this, but not pg-boss (at least, it doesn't document this), so it's a slippery slope. But I also don't see this as a problem.

Even with direct SQL query to create a job, you need some basic auth/protection, I suppose, eventually, so you can't really avoid having a thin web layer.

Cron batch processing in Heroku can also be done by starting a dyno, which will do the batch processing and finish. But this won't work for first-time scraping of newly added orgs, there should be a worker listening in the background. And since we need this type of worker anyway, it can handle batch scraping, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need to call some simplistic web endpoint which just puts a job into a queue.

The way we usually do this is to run a script within the project that has the job of simply queueing the jobs you want to run. To elaborate, the script would work like this:

$ npm run queue-jobs

  • Fetch organizations from Airtable that either A) have not be scraped or B) are "outdated"
  • For each organization
    • Queue a job to process that organization

The script runs in a couple seconds and is only responsible for adding the right jobs to the queue. The worker does the rest. This avoids needing the web interface, keeps the logic of which orgs to scrape internal to the app, and removes the dependency on Zapier.

const pgLocalConnectionString = `postgres://postgres:postgres@${host}:5432/postgres`;

// See https://stackoverflow.com/a/28489160
const isProduction = !!(process.env._ && (process.env._.indexOf('heroku') >= 0));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we can't just use NODE_ENV to find out if we're running in production?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it requires more setup actions by the developers locally (export NODE_ENV=dev) and Heroku setup (heroku config:set NODE_ENV="production"), imagining some developer wants to test the deployment in their own free Heroku account (like I do right now, but other developers may want to do this, too).

The above line doesn't require anything like that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leventov Heroku sets that variable by default, and I would assume we never set it locally. It's standard practice to use NODE_ENV in NodeJS apps.

backend/src/worker.js Outdated Show resolved Hide resolved
@bloudermilk
Copy link
Member Author

@leventov would be great to catch up this week on a call before you get too much further into this. I have some thoughts on the architecture that I'd like to discuss.

After thinking more about this over the last week, I'm leaning toward your original idea to push all data back to Airtable. Mainly because you made a good point that some of these attributes we really want in Airtable anyway. It also avoids needing to worry about pulling it into Gatsby from a different source, maintaining an API, and keeps everything viewable within Airtable.

I was thinking we could add new columns to "Organizations":

  • Twitter Followers - number of twitter followers
  • Enrichment Data - JSON blob of any other data that we don't need in the Airtable UI but want in Gatsby. For example, I'd like to wire up the Crunchbase ODM API

I was thinking we would have another queue for updating Airtable, which allows us to leverage some great pg-boss features:

  • rate limiting - so we can control the speed at which we hit the Airtable API (we're limited to 5rps)
  • batching - we can update several records in one API call with the Airtable API

@bloudermilk
Copy link
Member Author

@leventov looks like you've made a ton of progress over the last few days! Let me know when you're ready for me to dig in for another review :)

@leventov
Copy link
Collaborator

@bloudermilk I'm currently developing the first part of the cron pull thing - backing up the data from Airtable into the backend. (The next step is to see what backed up organizations don't have corresponding entries in scraping_results table - thus determining the organizations requiring the first-time scraping.)

I realized that there will be quite a few more tables in the backend database that I previously thought - probably, we are going to eventually back up most (or all) Airtable's tables (there are currently 11 of them) and model relationships between them. So, writing SQL by hand is not sustainable.

From raw SQL, I jumped to using Sequelize ORM. I didn't research the topic and didn't consider any alternatives. That was a mistake - I've sunk quite some time learning Sequelize and looking for ways to work around the problems with it: I bumped into several, including sequelize/sequelize#11967, sequelize/sequelize#11968, sequelize/sequelize#3759.

I think we should optimize the backend (and the Climatescape project in general) for enabling people to accomplish something useful quickly with a little time investment (e. g. if they just have a short-lasting inspiration to contribute to the project). Therefore, I think, using an ORM with a relatively steep learning curve is not optimal.

So currently I think about switching from Sequelize to either Knex.js (a query builder library, not an ORM) or Objection.js which advertises itself as a lightweight ORM, or a "relational query builder".

What do you think? Do you have experience with either of these technologies?

@bloudermilk
Copy link
Member Author

@leventov thanks for explaining your thought process. I completely agree with this:

I think we should optimize the backend (and the Climatescape project in general) for enabling people to accomplish something useful quickly with a little time investment

I only have experience personally with Sequelize and agree it's not a great library. It's hard for me to say what the better option is between the other two without digging in, but here are my thoughts:

  • Knex seems to be much more popular (7X downloads, 2X GitHub stars)
  • Given our limited scope in the short term, a query builder is probably adequate
  • Objection, being built on Knex, should be easy to move toward down the line if we need more features

One thing I'm not clear on is why we need to mirror tables from Airtable at this time. It seems to be since we're pulling from the Airtable API on each run we should just put that data in the queue?

Also wanted to clarify one point that we only talked about briefly. You'll see we have a lot of tables in Airtable for enrichment right now, but that's mainly due to limitations/workflow. For the enrichment output from the backend, I think we would be much better off writing to columns directly on Organizations.

@leventov
Copy link
Collaborator

leventov commented Mar 3, 2020

@bloudermilk thanks for the feedback. I'll proceed with using Knex.js.

One thing I'm not clear on is why we need to mirror tables from Airtable at this time. It seems to be since we're pulling from the Airtable API on each run we should just put that data in the queue?

I conceived first-time scraping kickoff as follows:

  1. Backup all organizations from Airtable to Postgres, into organizations table, with the following schema:
    • id - Airtable's org ID
    • fields - JSONB, all Airtable's textual fields, corresponds to fields in Airtable API response. Only textual fields, no images (yet)
    • created_at - the time org was first backed up in Postgres
    • updated_at - the time org was updated in Postgres the last time
    • deleted - we don't delete orgs in Postgres, only mark them as deleted.
  2. Find the orgs that need first-time scraping by joining organizations and scraping_results tables, possibly taking into account timestamp columns in both tables.

I think this approach is robust (e. g. in face of some pg-boss failures), it doesn't require any extra fields in Airtable.

Additional benefits of having a backup of data in the backend:

  • Well, this is a backup - some sort of protection against data loss in Airtable
  • Maybe we can run some ad-hoc or periodic analysis on this data that Airtable's interface and/or API doesn't permit

@bloudermilk
Copy link
Member Author

@leventov I'll defer to you on the decision to mirror Airtable since you're the one implementing it. I may have done it a different way, but that doesn't make it the "right way" by any means!

@leventov
Copy link
Collaborator

leventov commented Mar 9, 2020

@bloudermilk instead of "Twitter Followers" column, I suggest to add "Climatescape Rank", which will be equal to Twitter Followers count for now, but could be changed in the future to account for other factors of organization's importance. Twitter Followers per se could be stored inside the JSON stored in "Enrichment Data" column.

@bloudermilk
Copy link
Member Author

@leventov agree with your suggestion. Maybe we can just call it "Rank"?

@leventov
Copy link
Collaborator

leventov commented Mar 9, 2020

Maybe we can just call it "Rank"?

Sounds good

Copy link
Member Author

@bloudermilk bloudermilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leventov this looks really close! Would you be able to connect some time this week to walk me through everything? It's a lot of code and I just want to make sure I'm understanding it all.

.idea/inspectionProfiles/Project_Default.xml Show resolved Hide resolved
backend/.eslintrc.js Show resolved Hide resolved
backend/src/app.js Show resolved Hide resolved
@bloudermilk
Copy link
Member Author

@leventov I see the updates in Airtable–awesome! Should we move the Heroku app over to the Climatescape team or create a new app?

Is this ready to merge?

@leventov leventov merged commit 73563ab into master Mar 21, 2020
@leventov leventov deleted the add-backend branch March 21, 2020 07:15
@leventov
Copy link
Collaborator

@bloudermilk thanks for review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Epic: automated data scraping pipeline
2 participants