-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backend Scraper #87
Backend Scraper #87
Conversation
This reverts commit 8f9e916.
@leventov heads up, I double-reverted your last commit so we can work on it in this PR before merging to master. I'll use a PR review to share my notes! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leventov provided my initial thoughts on this implementation inline, include a few questions for you. I was really impressed with the level of documentation you provided, especially the "decisions" log which could be really helpful down the road when new contributors join.
backend/src/app.js
Outdated
|
||
const { pgBossQueue } = require('./pg'); | ||
|
||
fastify.route({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on how you were planning to trigger this job? I had envisioned this working w/o any kind of web server to trigger jobs. For example, using a "cron job" (but the Heroku equivalent) to queue a job every N minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Netlify hook should call this endpoint. Even cron job (which will kick off for periodic batch re-scraping of all organisations' data, for example), will need to call some simplistic web endpoint which just puts a job into a queue.
I don't see how more direct it could be. Push jobs directly to PostgreSQL via a query? graphile/worker supports this, but not pg-boss (at least, it doesn't document this), so it's a slippery slope. But I also don't see this as a problem.
Even with direct SQL query to create a job, you need some basic auth/protection, I suppose, eventually, so you can't really avoid having a thin web layer.
Cron batch processing in Heroku can also be done by starting a dyno, which will do the batch processing and finish. But this won't work for first-time scraping of newly added orgs, there should be a worker
listening in the background. And since we need this type of worker anyway, it can handle batch scraping, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will need to call some simplistic web endpoint which just puts a job into a queue.
The way we usually do this is to run a script within the project that has the job of simply queueing the jobs you want to run. To elaborate, the script would work like this:
$ npm run queue-jobs
- Fetch organizations from Airtable that either A) have not be scraped or B) are "outdated"
- For each organization
- Queue a job to process that organization
The script runs in a couple seconds and is only responsible for adding the right jobs to the queue. The worker does the rest. This avoids needing the web interface, keeps the logic of which orgs to scrape internal to the app, and removes the dependency on Zapier.
backend/src/pg.js
Outdated
const pgLocalConnectionString = `postgres://postgres:postgres@${host}:5432/postgres`; | ||
|
||
// See https://stackoverflow.com/a/28489160 | ||
const isProduction = !!(process.env._ && (process.env._.indexOf('heroku') >= 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason we can't just use NODE_ENV
to find out if we're running in production?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it requires more setup actions by the developers locally (export NODE_ENV=dev
) and Heroku setup (heroku config:set NODE_ENV="production"
), imagining some developer wants to test the deployment in their own free Heroku account (like I do right now, but other developers may want to do this, too).
The above line doesn't require anything like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leventov Heroku sets that variable by default, and I would assume we never set it locally. It's standard practice to use NODE_ENV
in NodeJS apps.
@leventov would be great to catch up this week on a call before you get too much further into this. I have some thoughts on the architecture that I'd like to discuss. After thinking more about this over the last week, I'm leaning toward your original idea to push all data back to Airtable. Mainly because you made a good point that some of these attributes we really want in Airtable anyway. It also avoids needing to worry about pulling it into Gatsby from a different source, maintaining an API, and keeps everything viewable within Airtable. I was thinking we could add new columns to "Organizations":
I was thinking we would have another queue for updating Airtable, which allows us to leverage some great pg-boss features:
|
@leventov looks like you've made a ton of progress over the last few days! Let me know when you're ready for me to dig in for another review :) |
@bloudermilk I'm currently developing the first part of the cron pull thing - backing up the data from Airtable into the backend. (The next step is to see what backed up organizations don't have corresponding entries in I realized that there will be quite a few more tables in the backend database that I previously thought - probably, we are going to eventually back up most (or all) Airtable's tables (there are currently 11 of them) and model relationships between them. So, writing SQL by hand is not sustainable. From raw SQL, I jumped to using Sequelize ORM. I didn't research the topic and didn't consider any alternatives. That was a mistake - I've sunk quite some time learning Sequelize and looking for ways to work around the problems with it: I bumped into several, including sequelize/sequelize#11967, sequelize/sequelize#11968, sequelize/sequelize#3759. I think we should optimize the backend (and the Climatescape project in general) for enabling people to accomplish something useful quickly with a little time investment (e. g. if they just have a short-lasting inspiration to contribute to the project). Therefore, I think, using an ORM with a relatively steep learning curve is not optimal. So currently I think about switching from Sequelize to either Knex.js (a query builder library, not an ORM) or Objection.js which advertises itself as a lightweight ORM, or a "relational query builder". What do you think? Do you have experience with either of these technologies? |
@leventov thanks for explaining your thought process. I completely agree with this:
I only have experience personally with Sequelize and agree it's not a great library. It's hard for me to say what the better option is between the other two without digging in, but here are my thoughts:
One thing I'm not clear on is why we need to mirror tables from Airtable at this time. It seems to be since we're pulling from the Airtable API on each run we should just put that data in the queue? Also wanted to clarify one point that we only talked about briefly. You'll see we have a lot of tables in Airtable for enrichment right now, but that's mainly due to limitations/workflow. For the enrichment output from the backend, I think we would be much better off writing to columns directly on Organizations. |
@bloudermilk thanks for the feedback. I'll proceed with using Knex.js.
I conceived first-time scraping kickoff as follows:
I think this approach is robust (e. g. in face of some pg-boss failures), it doesn't require any extra fields in Airtable. Additional benefits of having a backup of data in the backend:
|
@leventov I'll defer to you on the decision to mirror Airtable since you're the one implementing it. I may have done it a different way, but that doesn't make it the "right way" by any means! |
@bloudermilk instead of "Twitter Followers" column, I suggest to add "Climatescape Rank", which will be equal to Twitter Followers count for now, but could be changed in the future to account for other factors of organization's importance. Twitter Followers per se could be stored inside the JSON stored in "Enrichment Data" column. |
@leventov agree with your suggestion. Maybe we can just call it "Rank"? |
Sounds good |
… Twitter user objects in scraping_results table
…pingJobs.js, add setupPgBossQueue() function and pass pgBossQueue around explicitly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leventov this looks really close! Would you be able to connect some time this week to walk me through everything? It's a lot of code and I just want to make sure I'm understanding it all.
@leventov I see the updates in Airtable–awesome! Should we move the Heroku app over to the Climatescape team or create a new app? Is this ready to merge? |
@bloudermilk thanks for review |
Re #40