Skip to content

Commit

Permalink
Merge pull request #87 from climatescape/add-backend
Browse files Browse the repository at this point in the history
Backend Scraper
  • Loading branch information
leventov authored Mar 21, 2020
2 parents 2462a54 + de6bd42 commit 73563ab
Show file tree
Hide file tree
Showing 39 changed files with 10,016 additions and 0 deletions.
14 changes: 14 additions & 0 deletions .idea/inspectionProfiles/Project_Default.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 4 additions & 0 deletions backend/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
node_modules
npm-debug.log
yarn-error.log
.env
20 changes: 20 additions & 0 deletions backend/.eslintrc.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
module.exports = {
env: {
browser: false,
node: true,
es6: true,
},
plugins: ["prettier", "jest"],
extends: [
"airbnb",
"plugin:prettier/recommended",
"plugin:jest/recommended",
],
rules: {
"no-console": "off",
"no-return-await": "off", // See https://youtrack.jetbrains.com/issue/WEB-39569
"func-names": ["error", "as-needed"],
"no-unused-vars": ["error", { "args": "none"}],
"jest/valid-expect": "off", // jest-expect-message adds a parameter to expect()
},
}
3 changes: 3 additions & 0 deletions backend/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
node_modules
npm-debug.log
yarn-error.log
11 changes: 11 additions & 0 deletions backend/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# The version should stay in sync with package.json
# -alpine version doesn't work because yarn needs Python during installation of packages
FROM node:12.15.0

# Install app dependencies.
COPY package.json yarn.lock ./

# ignore-engines to skip trying to install fsevents on Linux
RUN yarn config set ignore-engines true && yarn install

RUN yarn global add pm2
4 changes: 4 additions & 0 deletions backend/Dockerfile.postgres
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# 12.1 is the current version in Heroku, it should be periodically checked and synced (see https://data.heroku.com/)
FROM postgres:12.1-alpine

COPY postgres-init.sql /docker-entrypoint-initdb.d
2 changes: 2 additions & 0 deletions backend/Procfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
web: node src/web.js
worker: node src/worker.js
31 changes: 31 additions & 0 deletions backend/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Backend
Backend automatically scrapes data for Climatescape website.

## Overview
Backend is written in [Node.js](doc/decisions/2-use-node.md) and is deployed on [Heroku](doc/decisions/1-use-heroku.md).
It consists of two apps: *web* (shouldn't be confused with Climatescape website itself) and *worker* for [background
processing](doc/decisions/3-background-task-processing.md). web pushes jobs to a persistent [pg-boss](
doc/decisions/4-use-pg-boss-queue.md) queue backed up with Postgres.

## Local setup

1. Install Node 12.15.0 using [`nvm`](https://github.com/nvm-sh/nvm#install--update-script)
2. [Install yarn 1.x](https://classic.yarnpkg.com/en/docs/install)
3. run `yarn config set ignore-engines true && yarn install`
4. Install and start [Docker Desktop](https://www.docker.com/products/docker-desktop)

If you have problems installing dependencies (running `yarn` command) on Mac OS, try the following:
1. Follow instructions on [this page](https://github.com/nodejs/node-gyp/blob/master/macOS_Catalina.md)
2. `brew install libpq` and follow instructions about modifying `PATH`, `LDFLAGS`, `CPPFLAGS`, and `PKG_CONFIG_PATH`
variables printed by Homebrew in the end of the installation.

Run tests via `yarn test`.

For faster testing or debug loop, first start `db` and `worker` containers separately: `docker-compose up -d db worker`,
and then run `yarn jest`.

For full formation testing, use `docker-compose up -d` and ping the web via
```
curl -X POST https://127.0.0.1:3000/twitterUserObject --header 'Content-type: application/json' --data '{"orgId":"climatescape", "twitterScreenName":"climatescape"}'
```
To enter Postgres container for debugging, use `docker exec -it backend_db_1 psql -U postgres`
6 changes: 6 additions & 0 deletions backend/doc/decisions/1-use-heroku.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Decided to implement automatic data scraping backend as a customly built system deployed on Heroku, rather than using
a specialized platform for scraping such as Apify, because we think that the backend system will eventually outgrow mere
data scraping.

See [this message](https://github.com/climatescape/climatescape.org/issues/40#issuecomment-583264142) and the preceding
messages in the thread.
5 changes: 5 additions & 0 deletions backend/doc/decisions/2-use-node.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Decided to use Node.JS platform for the backend system ([deployed in Heroku](1-use-heroku.md)) for two main reasons:
- This is a platform familiar to the key stakeholder of the project, Brendan.
- Uniformity with the frontend (Netlify) part of the project and potentially sharing some model code in the future.

See [this message in Slack](https://climatescape.slack.com/archives/CT42YRV3P/p1580385766005700) and messages around it.
29 changes: 29 additions & 0 deletions backend/doc/decisions/3-background-task-processing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
The automatic data scraping system [deployed on Heroku](1-use-heroku.md) is organized as follows:

1. A "worker" dyno processes jobs (scraping tasks) from queue(s) in background.
2. Scraping tasks are populated to the queue(s) via [one-off dynos](
https://devcenter.heroku.com/articles/one-off-dynos) which are scheduled to be run periodically via [Heroku
Scheduler](https://devcenter.heroku.com/articles/scheduler). For example, a one-off script can pull the data from the
Climatescape's Airtable and schedule initial scraping tasks for newly added orgs.

This decision to use a background worker with task queue(s) is driven by the [Heroku
documentation](https://devcenter.heroku.com/articles/background-jobs-queueing) which justifies this approach as scalable
and reliable. worker fetches the scraping tasks in background, does the scraping (e. g. accesses Twitter API), and puts
the results into the Postgres database.

As an alternative to the "pull approach" (scraping tasks are populated by scheduled one-off dynos), a "push approach"
was considered: the backend maintains a web interface (on a separate dyno) which Climatescape website (via [Netlify
Functions](https://docs.netlify.com/functions/overview)) or [Zapier](https://zapier.com/home).

The pull-based approach was chosen because it has the following advantages:
- The dependency on Netlify Functions or Zapier can be avoided, reducing the number of concepts that developers have to
learn and environments to manage. On the other hand, even with the push approach, one-off scripts in Heroku could be
needed anyway to schedule periodic re-scraping of information about all organizations.
- The backend doesn't need to expose a POST or PUT interface, so no need to worry about protection and authentication.
Some form of backend web interface might be eventually added, e. g. for monitoring of the number of scraping tasks in
the queue(s), but that interface could probably be read-only so authentication might not be required.

The decision to use background worker was done [here](
https://github.com/climatescape/climatescape.org/issues/40#issuecomment-584658556) (see also a few preceding messages in
that thread), with the push-based approach. Then, we decided to use the pull-based instead of push-based approach in
[this discussion](https://github.com/climatescape/climatescape.org/pull/87#discussion_r383368702).
12 changes: 12 additions & 0 deletions backend/doc/decisions/4-use-pg-boss-queue.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
Decided to use [pg-boss](https://github.com/timgit/pg-boss) queue for managing [background task processing](
3-background-task-processing.md) because:
- It uses Postgres for persistence which is used in the backend system anyway (for storing the scraping results), thus
reducing the operational variability of the backend, compared to if a separate persistent queue such as RabbitMQ was
used.
- pg-boss is chosen over [graphile-worker](https://github.com/graphile/worker) because it supports multiple queues
(at least I couldn't figure out how to create multiple different queues from the docs of graphile-worker, as of the
version 0.4.0). We will need multiple queues for managing processing with different priorities, e. g. periodic batch
scraping vs. first-time scraping for orgs just added to the website.

See [this message](https://github.com/climatescape/climatescape.org/issues/40#issuecomment-585112177).

13 changes: 13 additions & 0 deletions backend/doc/decisions/5-use-yarn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Decided to use yarn package manager instead of npm because of the [issue](https://github.com/yarnpkg/yarn/issues/5962)
with node-gyp package installation: with node 12.15.0 and Mac OS X, which I was able to resolve only in yarn by running
```
$ yarn global add node-gyp
$ yarn global remove node-gyp
```

As described [here](https://github.com/yarnpkg/yarn/issues/5962#issuecomment-435576447).

See [extra context and discussion](https://github.com/climatescape/climatescape.org/pull/87#discussion_r383366463) about
the issue and the choice of yarn over npm.

Using Yarn 1.x because there are yet some problems with using Yarn 2 in Heroku.
13 changes: 13 additions & 0 deletions backend/doc/decisions/6-push-data-to-airtable.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
In order to make use of the scraping results in the Gatsby-generated Climatescape website, we decided to push the
scraped data back to Airtable in the [background worker](3-background-task-processing.md), *in addition* to storing this
data in the Postgres database.

An alternative approach is to connect Gatsby to Postgres via [gatsby-source-pg](
https://www.gatsbyjs.org/packages/gatsby-source-pg/) plugin.

We chose pushing data to Airtable from the backend to simplify the Gatsby setup, considering that some data pushing from
backend to Airtable is needed anyway to enable sorting the organizations in the Airtable content management interface
according to their weight ("Climatescape rank"), which is [one of the goals of the scraping automation project](
https://github.com/climatescape/climatescape.org/issues/40#issue-558680900).

See [this message](https://github.com/climatescape/climatescape.org/pull/87#issuecomment-590864830).
45 changes: 45 additions & 0 deletions backend/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
version: '3.7'
services:
web:
build: .
depends_on:
- db
volumes:
- ./src:/opt/src
working_dir: /opt/src
command: pm2-dev web.js
ports:
- '3000:3000'
environment:
WITHIN_CONTAINER: 'true'
worker:
build: .
depends_on:
- db
volumes:
- ./src:/opt/src
working_dir: /opt/src
command: pm2-dev worker.js
environment:
WITHIN_CONTAINER: 'true'
db:
build:
context: .
dockerfile: Dockerfile.postgres
volumes:
- dbdata:/var/lib/postgresql/data
environment:
POSTGRES_DB: postgres
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
ports:
- published: 5432
target: 5432
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 1s
timeout: 5s
retries: 10

volumes:
dbdata:
50 changes: 50 additions & 0 deletions backend/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
{
"engines": {
"node": "12.15.0",
"yarn": "1.x"
},
"dependencies": {
"airtable": "^0.8.1",
"dotenv": "^8.2.0",
"fastify": "^2.12.0",
"global": "^4.4.0",
"knex": "^0.20.10",
"lodash": "^4.17.15",
"log-timestamp": "^0.3.0",
"node-gyp": "^6.1.0",
"p-memoize": "^4.0.0",
"pg": "^7.18.1",
"pg-boss": "^4.0.0-beta5",
"pg-hstore": "^2.3.3",
"pg-pool": "^2.0.10",
"sequelize": "^5.21.5",
"swagger": "^0.7.5",
"twitter-lite": "^0.9.4",
"url": "^0.11.0"
},
"devDependenciesComment": "Have to use eslint 5.x until https://youtrack.jetbrains.com/issue/WEB-43692 is fixed",
"devDependencies": {
"eslint": "^5.16.0",
"eslint-plugin-jest": "^23.8.0",
"jest": "^25.1.0",
"jest-expect-message": "^1.0.2",
"supertest": "^4.0.2",
"wait-for-expect": "^3.0.2"
},
"optionalDependenciesComment": "explicitly making fsevents optional to skip trying to install fsevents on Linux in Docker image",
"optionalDependencies": {
"fsevents": "2.1.2"
},
"scripts": {
"about:test": "echo '--forceExit is needed because of https://github.com/facebook/jest/issues/9473'",
"test": "docker-compose up -d db worker && docker-compose rm -fsv web && jest --detectOpenHandles --runInBand --forceExit",
"about:jest": "echo '--forceExit is needed because of https://github.com/facebook/jest/issues/9473'",
"jest": "jest --detectOpenHandles --runInBand --forceExit"
},
"jest": {
"setupFilesAfterEnv": [
"jest-expect-message"
],
"verbose": true
}
}
2 changes: 2 additions & 0 deletions backend/postgres-init.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
-- Needed as per https://github.com/timgit/pg-boss/blob/master/docs/usage.md#database-installation
CREATE EXTENSION pgcrypto;
21 changes: 21 additions & 0 deletions backend/src/addFirstTimeScrapingJobs.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
const { setupPgBossQueue } = require("./pg")
const {
addFirstTimeTwitterUserObjectScrapingJobs,
} = require("./twitterUserObjectScraping")

async function addFirstTimeScrapingJobs() {
const pgBossQueue = await setupPgBossQueue()
await addFirstTimeTwitterUserObjectScrapingJobs(pgBossQueue)
}

if (require.main === module) {
;(async () => {
try {
await addFirstTimeScrapingJobs()
} catch (e) {
console.error("Error adding first-time scraping jobs", e)
}
})()
}

module.exports = { addFirstTimeScrapingJobs }
44 changes: 44 additions & 0 deletions backend/src/airtable.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
const Airtable = require("airtable")
const { configureEnvironment, sleep } = require("./utils")

configureEnvironment()
Airtable.configure({
endpointUrl: "https://api.airtable.com",
apiKey: process.env.AIRTABLE_API_KEY,
})
const airtableBase = Airtable.base("appNYMWxGF1jMaf5V")

/**
* This function is adapted from airtable.js: https://github.com/Airtable/airtable.js/blob/
* 31fd0a089fee87832760f35c7270eae283972e35/lib/query.js#L116-L135 to include waits to avoid hitting Airtable's rate
* limits accidentally (see https://github.com/Airtable/airtable.js/issues/30), and add more logging.
* @returns {Promise<Array<Object>>}
*/
function fetchAllRecords(airtableQuery) {
return new Promise((resolve, reject) => {
const allRecords = []
airtableQuery.eachPage(
function page(pageRecords, fetchNextPage) {
allRecords.push(...pageRecords)
console.log(
`Fetched ${pageRecords.length} records, now ${allRecords.length} in total`
)
// The rate limit is 5 rps, but we don't try to be even close to that because there may be concurrent operations
// with Airtable API happening in the backend system, e. g. writing scraped data to Airtable.
sleep(1000)
.then(() => fetchNextPage())
.catch(error => reject(error))
},
function error(err) {
if (err) {
reject(err)
} else {
console.log(`Fetched ${allRecords.length} records in total`)
resolve(allRecords)
}
}
)
})
}

module.exports = { airtableBase, fetchAllRecords }
36 changes: 36 additions & 0 deletions backend/src/app.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
const fastify = require("fastify")({ logger: true })

const { setupPgBossQueue } = require("./pg")
const { TWITTER_USER_OBJECT } = require("./twitterUserObjectScraping")

async function buildFastify() {
const pgBossQueue = await setupPgBossQueue()

fastify.route({
method: "POST",
url: "/twitterUserObject",
schema: {
body: {
type: "object",
required: ["orgId", "twitterScreenName"],
properties: {
orgId: { type: "string" },
twitterScreenName: { type: "string" },
},
},
},
handler(req, res) {
fastify.log.info(
"Received request to scrape Twitter user object: ",
req.body
)
return pgBossQueue.publish(TWITTER_USER_OBJECT, req.body)
},
})

await fastify.listen(process.env.PORT || 3000, "0.0.0.0")
fastify.log.info(`server listening on ${fastify.server.address().port}`)
return fastify
}

module.exports = buildFastify
Loading

0 comments on commit 73563ab

Please sign in to comment.