-
Notifications
You must be signed in to change notification settings - Fork 20
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #87 from climatescape/add-backend
Backend Scraper
- Loading branch information
Showing
39 changed files
with
10,016 additions
and
0 deletions.
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
node_modules | ||
npm-debug.log | ||
yarn-error.log | ||
.env |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
module.exports = { | ||
env: { | ||
browser: false, | ||
node: true, | ||
es6: true, | ||
}, | ||
plugins: ["prettier", "jest"], | ||
extends: [ | ||
"airbnb", | ||
"plugin:prettier/recommended", | ||
"plugin:jest/recommended", | ||
], | ||
rules: { | ||
"no-console": "off", | ||
"no-return-await": "off", // See https://youtrack.jetbrains.com/issue/WEB-39569 | ||
"func-names": ["error", "as-needed"], | ||
"no-unused-vars": ["error", { "args": "none"}], | ||
"jest/valid-expect": "off", // jest-expect-message adds a parameter to expect() | ||
}, | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
node_modules | ||
npm-debug.log | ||
yarn-error.log |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# The version should stay in sync with package.json | ||
# -alpine version doesn't work because yarn needs Python during installation of packages | ||
FROM node:12.15.0 | ||
|
||
# Install app dependencies. | ||
COPY package.json yarn.lock ./ | ||
|
||
# ignore-engines to skip trying to install fsevents on Linux | ||
RUN yarn config set ignore-engines true && yarn install | ||
|
||
RUN yarn global add pm2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# 12.1 is the current version in Heroku, it should be periodically checked and synced (see https://data.heroku.com/) | ||
FROM postgres:12.1-alpine | ||
|
||
COPY postgres-init.sql /docker-entrypoint-initdb.d |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
web: node src/web.js | ||
worker: node src/worker.js |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# Backend | ||
Backend automatically scrapes data for Climatescape website. | ||
|
||
## Overview | ||
Backend is written in [Node.js](doc/decisions/2-use-node.md) and is deployed on [Heroku](doc/decisions/1-use-heroku.md). | ||
It consists of two apps: *web* (shouldn't be confused with Climatescape website itself) and *worker* for [background | ||
processing](doc/decisions/3-background-task-processing.md). web pushes jobs to a persistent [pg-boss]( | ||
doc/decisions/4-use-pg-boss-queue.md) queue backed up with Postgres. | ||
|
||
## Local setup | ||
|
||
1. Install Node 12.15.0 using [`nvm`](https://github.com/nvm-sh/nvm#install--update-script) | ||
2. [Install yarn 1.x](https://classic.yarnpkg.com/en/docs/install) | ||
3. run `yarn config set ignore-engines true && yarn install` | ||
4. Install and start [Docker Desktop](https://www.docker.com/products/docker-desktop) | ||
|
||
If you have problems installing dependencies (running `yarn` command) on Mac OS, try the following: | ||
1. Follow instructions on [this page](https://github.com/nodejs/node-gyp/blob/master/macOS_Catalina.md) | ||
2. `brew install libpq` and follow instructions about modifying `PATH`, `LDFLAGS`, `CPPFLAGS`, and `PKG_CONFIG_PATH` | ||
variables printed by Homebrew in the end of the installation. | ||
|
||
Run tests via `yarn test`. | ||
|
||
For faster testing or debug loop, first start `db` and `worker` containers separately: `docker-compose up -d db worker`, | ||
and then run `yarn jest`. | ||
|
||
For full formation testing, use `docker-compose up -d` and ping the web via | ||
``` | ||
curl -X POST https://127.0.0.1:3000/twitterUserObject --header 'Content-type: application/json' --data '{"orgId":"climatescape", "twitterScreenName":"climatescape"}' | ||
``` | ||
To enter Postgres container for debugging, use `docker exec -it backend_db_1 psql -U postgres` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
Decided to implement automatic data scraping backend as a customly built system deployed on Heroku, rather than using | ||
a specialized platform for scraping such as Apify, because we think that the backend system will eventually outgrow mere | ||
data scraping. | ||
|
||
See [this message](https://github.com/climatescape/climatescape.org/issues/40#issuecomment-583264142) and the preceding | ||
messages in the thread. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
Decided to use Node.JS platform for the backend system ([deployed in Heroku](1-use-heroku.md)) for two main reasons: | ||
- This is a platform familiar to the key stakeholder of the project, Brendan. | ||
- Uniformity with the frontend (Netlify) part of the project and potentially sharing some model code in the future. | ||
|
||
See [this message in Slack](https://climatescape.slack.com/archives/CT42YRV3P/p1580385766005700) and messages around it. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
The automatic data scraping system [deployed on Heroku](1-use-heroku.md) is organized as follows: | ||
|
||
1. A "worker" dyno processes jobs (scraping tasks) from queue(s) in background. | ||
2. Scraping tasks are populated to the queue(s) via [one-off dynos]( | ||
https://devcenter.heroku.com/articles/one-off-dynos) which are scheduled to be run periodically via [Heroku | ||
Scheduler](https://devcenter.heroku.com/articles/scheduler). For example, a one-off script can pull the data from the | ||
Climatescape's Airtable and schedule initial scraping tasks for newly added orgs. | ||
|
||
This decision to use a background worker with task queue(s) is driven by the [Heroku | ||
documentation](https://devcenter.heroku.com/articles/background-jobs-queueing) which justifies this approach as scalable | ||
and reliable. worker fetches the scraping tasks in background, does the scraping (e. g. accesses Twitter API), and puts | ||
the results into the Postgres database. | ||
|
||
As an alternative to the "pull approach" (scraping tasks are populated by scheduled one-off dynos), a "push approach" | ||
was considered: the backend maintains a web interface (on a separate dyno) which Climatescape website (via [Netlify | ||
Functions](https://docs.netlify.com/functions/overview)) or [Zapier](https://zapier.com/home). | ||
|
||
The pull-based approach was chosen because it has the following advantages: | ||
- The dependency on Netlify Functions or Zapier can be avoided, reducing the number of concepts that developers have to | ||
learn and environments to manage. On the other hand, even with the push approach, one-off scripts in Heroku could be | ||
needed anyway to schedule periodic re-scraping of information about all organizations. | ||
- The backend doesn't need to expose a POST or PUT interface, so no need to worry about protection and authentication. | ||
Some form of backend web interface might be eventually added, e. g. for monitoring of the number of scraping tasks in | ||
the queue(s), but that interface could probably be read-only so authentication might not be required. | ||
|
||
The decision to use background worker was done [here]( | ||
https://github.com/climatescape/climatescape.org/issues/40#issuecomment-584658556) (see also a few preceding messages in | ||
that thread), with the push-based approach. Then, we decided to use the pull-based instead of push-based approach in | ||
[this discussion](https://github.com/climatescape/climatescape.org/pull/87#discussion_r383368702). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
Decided to use [pg-boss](https://github.com/timgit/pg-boss) queue for managing [background task processing]( | ||
3-background-task-processing.md) because: | ||
- It uses Postgres for persistence which is used in the backend system anyway (for storing the scraping results), thus | ||
reducing the operational variability of the backend, compared to if a separate persistent queue such as RabbitMQ was | ||
used. | ||
- pg-boss is chosen over [graphile-worker](https://github.com/graphile/worker) because it supports multiple queues | ||
(at least I couldn't figure out how to create multiple different queues from the docs of graphile-worker, as of the | ||
version 0.4.0). We will need multiple queues for managing processing with different priorities, e. g. periodic batch | ||
scraping vs. first-time scraping for orgs just added to the website. | ||
|
||
See [this message](https://github.com/climatescape/climatescape.org/issues/40#issuecomment-585112177). | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
Decided to use yarn package manager instead of npm because of the [issue](https://github.com/yarnpkg/yarn/issues/5962) | ||
with node-gyp package installation: with node 12.15.0 and Mac OS X, which I was able to resolve only in yarn by running | ||
``` | ||
$ yarn global add node-gyp | ||
$ yarn global remove node-gyp | ||
``` | ||
|
||
As described [here](https://github.com/yarnpkg/yarn/issues/5962#issuecomment-435576447). | ||
|
||
See [extra context and discussion](https://github.com/climatescape/climatescape.org/pull/87#discussion_r383366463) about | ||
the issue and the choice of yarn over npm. | ||
|
||
Using Yarn 1.x because there are yet some problems with using Yarn 2 in Heroku. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
In order to make use of the scraping results in the Gatsby-generated Climatescape website, we decided to push the | ||
scraped data back to Airtable in the [background worker](3-background-task-processing.md), *in addition* to storing this | ||
data in the Postgres database. | ||
|
||
An alternative approach is to connect Gatsby to Postgres via [gatsby-source-pg]( | ||
https://www.gatsbyjs.org/packages/gatsby-source-pg/) plugin. | ||
|
||
We chose pushing data to Airtable from the backend to simplify the Gatsby setup, considering that some data pushing from | ||
backend to Airtable is needed anyway to enable sorting the organizations in the Airtable content management interface | ||
according to their weight ("Climatescape rank"), which is [one of the goals of the scraping automation project]( | ||
https://github.com/climatescape/climatescape.org/issues/40#issue-558680900). | ||
|
||
See [this message](https://github.com/climatescape/climatescape.org/pull/87#issuecomment-590864830). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
version: '3.7' | ||
services: | ||
web: | ||
build: . | ||
depends_on: | ||
- db | ||
volumes: | ||
- ./src:/opt/src | ||
working_dir: /opt/src | ||
command: pm2-dev web.js | ||
ports: | ||
- '3000:3000' | ||
environment: | ||
WITHIN_CONTAINER: 'true' | ||
worker: | ||
build: . | ||
depends_on: | ||
- db | ||
volumes: | ||
- ./src:/opt/src | ||
working_dir: /opt/src | ||
command: pm2-dev worker.js | ||
environment: | ||
WITHIN_CONTAINER: 'true' | ||
db: | ||
build: | ||
context: . | ||
dockerfile: Dockerfile.postgres | ||
volumes: | ||
- dbdata:/var/lib/postgresql/data | ||
environment: | ||
POSTGRES_DB: postgres | ||
POSTGRES_USER: postgres | ||
POSTGRES_PASSWORD: postgres | ||
ports: | ||
- published: 5432 | ||
target: 5432 | ||
healthcheck: | ||
test: ["CMD-SHELL", "pg_isready -U postgres"] | ||
interval: 1s | ||
timeout: 5s | ||
retries: 10 | ||
|
||
volumes: | ||
dbdata: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
{ | ||
"engines": { | ||
"node": "12.15.0", | ||
"yarn": "1.x" | ||
}, | ||
"dependencies": { | ||
"airtable": "^0.8.1", | ||
"dotenv": "^8.2.0", | ||
"fastify": "^2.12.0", | ||
"global": "^4.4.0", | ||
"knex": "^0.20.10", | ||
"lodash": "^4.17.15", | ||
"log-timestamp": "^0.3.0", | ||
"node-gyp": "^6.1.0", | ||
"p-memoize": "^4.0.0", | ||
"pg": "^7.18.1", | ||
"pg-boss": "^4.0.0-beta5", | ||
"pg-hstore": "^2.3.3", | ||
"pg-pool": "^2.0.10", | ||
"sequelize": "^5.21.5", | ||
"swagger": "^0.7.5", | ||
"twitter-lite": "^0.9.4", | ||
"url": "^0.11.0" | ||
}, | ||
"devDependenciesComment": "Have to use eslint 5.x until https://youtrack.jetbrains.com/issue/WEB-43692 is fixed", | ||
"devDependencies": { | ||
"eslint": "^5.16.0", | ||
"eslint-plugin-jest": "^23.8.0", | ||
"jest": "^25.1.0", | ||
"jest-expect-message": "^1.0.2", | ||
"supertest": "^4.0.2", | ||
"wait-for-expect": "^3.0.2" | ||
}, | ||
"optionalDependenciesComment": "explicitly making fsevents optional to skip trying to install fsevents on Linux in Docker image", | ||
"optionalDependencies": { | ||
"fsevents": "2.1.2" | ||
}, | ||
"scripts": { | ||
"about:test": "echo '--forceExit is needed because of https://github.com/facebook/jest/issues/9473'", | ||
"test": "docker-compose up -d db worker && docker-compose rm -fsv web && jest --detectOpenHandles --runInBand --forceExit", | ||
"about:jest": "echo '--forceExit is needed because of https://github.com/facebook/jest/issues/9473'", | ||
"jest": "jest --detectOpenHandles --runInBand --forceExit" | ||
}, | ||
"jest": { | ||
"setupFilesAfterEnv": [ | ||
"jest-expect-message" | ||
], | ||
"verbose": true | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
-- Needed as per https://github.com/timgit/pg-boss/blob/master/docs/usage.md#database-installation | ||
CREATE EXTENSION pgcrypto; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
const { setupPgBossQueue } = require("./pg") | ||
const { | ||
addFirstTimeTwitterUserObjectScrapingJobs, | ||
} = require("./twitterUserObjectScraping") | ||
|
||
async function addFirstTimeScrapingJobs() { | ||
const pgBossQueue = await setupPgBossQueue() | ||
await addFirstTimeTwitterUserObjectScrapingJobs(pgBossQueue) | ||
} | ||
|
||
if (require.main === module) { | ||
;(async () => { | ||
try { | ||
await addFirstTimeScrapingJobs() | ||
} catch (e) { | ||
console.error("Error adding first-time scraping jobs", e) | ||
} | ||
})() | ||
} | ||
|
||
module.exports = { addFirstTimeScrapingJobs } |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
const Airtable = require("airtable") | ||
const { configureEnvironment, sleep } = require("./utils") | ||
|
||
configureEnvironment() | ||
Airtable.configure({ | ||
endpointUrl: "https://api.airtable.com", | ||
apiKey: process.env.AIRTABLE_API_KEY, | ||
}) | ||
const airtableBase = Airtable.base("appNYMWxGF1jMaf5V") | ||
|
||
/** | ||
* This function is adapted from airtable.js: https://github.com/Airtable/airtable.js/blob/ | ||
* 31fd0a089fee87832760f35c7270eae283972e35/lib/query.js#L116-L135 to include waits to avoid hitting Airtable's rate | ||
* limits accidentally (see https://github.com/Airtable/airtable.js/issues/30), and add more logging. | ||
* @returns {Promise<Array<Object>>} | ||
*/ | ||
function fetchAllRecords(airtableQuery) { | ||
return new Promise((resolve, reject) => { | ||
const allRecords = [] | ||
airtableQuery.eachPage( | ||
function page(pageRecords, fetchNextPage) { | ||
allRecords.push(...pageRecords) | ||
console.log( | ||
`Fetched ${pageRecords.length} records, now ${allRecords.length} in total` | ||
) | ||
// The rate limit is 5 rps, but we don't try to be even close to that because there may be concurrent operations | ||
// with Airtable API happening in the backend system, e. g. writing scraped data to Airtable. | ||
sleep(1000) | ||
.then(() => fetchNextPage()) | ||
.catch(error => reject(error)) | ||
}, | ||
function error(err) { | ||
if (err) { | ||
reject(err) | ||
} else { | ||
console.log(`Fetched ${allRecords.length} records in total`) | ||
resolve(allRecords) | ||
} | ||
} | ||
) | ||
}) | ||
} | ||
|
||
module.exports = { airtableBase, fetchAllRecords } |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
const fastify = require("fastify")({ logger: true }) | ||
|
||
const { setupPgBossQueue } = require("./pg") | ||
const { TWITTER_USER_OBJECT } = require("./twitterUserObjectScraping") | ||
|
||
async function buildFastify() { | ||
const pgBossQueue = await setupPgBossQueue() | ||
|
||
fastify.route({ | ||
method: "POST", | ||
url: "/twitterUserObject", | ||
schema: { | ||
body: { | ||
type: "object", | ||
required: ["orgId", "twitterScreenName"], | ||
properties: { | ||
orgId: { type: "string" }, | ||
twitterScreenName: { type: "string" }, | ||
}, | ||
}, | ||
}, | ||
handler(req, res) { | ||
fastify.log.info( | ||
"Received request to scrape Twitter user object: ", | ||
req.body | ||
) | ||
return pgBossQueue.publish(TWITTER_USER_OBJECT, req.body) | ||
}, | ||
}) | ||
|
||
await fastify.listen(process.env.PORT || 3000, "0.0.0.0") | ||
fastify.log.info(`server listening on ${fastify.server.address().port}`) | ||
return fastify | ||
} | ||
|
||
module.exports = buildFastify |
Oops, something went wrong.