climatescape · leventov · Mar 21, 2020 · Feb 24, 2020 · Feb 25, 2020 · Feb 25, 2020
diff --git a/backend/.dockerignore b/backend/.dockerignore
@@ -0,0 +1,4 @@
+node_modules
+npm-debug.log
+yarn-error.log
+.env
diff --git a/backend/.gitignore b/backend/.gitignore
@@ -0,0 +1,3 @@
+node_modules
+npm-debug.log
+yarn-error.log
diff --git a/backend/Dockerfile b/backend/Dockerfile
@@ -0,0 +1,11 @@
+# The version should stay in sync with package.json
+# -alpine version doesn't work because yarn needs Python during installation of packages
+FROM node:12.15.0
+
+# Install app dependencies.
+COPY package.json yarn.lock ./
+
+# ignore-engines to skip trying to install fsevents on Linux
+RUN yarn config set ignore-engines true && yarn install
+
+RUN yarn global add pm2
diff --git a/backend/Dockerfile.postgres b/backend/Dockerfile.postgres
@@ -0,0 +1,4 @@
+# 12.1 is the current version in Heroku, it should be periodically checked and synced (see https://data.heroku.com/)
+FROM postgres:12.1-alpine
+
+COPY postgres-init.sql /docker-entrypoint-initdb.d
diff --git a/backend/Procfile b/backend/Procfile
@@ -0,0 +1,2 @@
+web: node src/web.js
+worker: node src/worker.js
diff --git a/backend/README.md b/backend/README.md
@@ -0,0 +1,29 @@
+# Backend
+Backend automatically scrapes data for Climatescape website.
+
+## Overview
+Backend is written in [Node.js](doc/decisions/2-use-node.md) and is deployed on [Heroku](doc/decisions/1-use-heroku.md).
+It consists of two apps: *web* (shouldn't be confused with Climatescape website itself) and *worker* for [background
+processing](doc/decisions/3-background-task-processing.md). web pushes jobs to a persistent [pg-boss](
+doc/decisions/4-use-pg-boss-queue.md) queue backed up with Postgres.
+
+## Local setup
+
+ 1. Install Node 12.15.0 using [`nvm`](https://github.com/nvm-sh/nvm#install--update-script)
+ 2. [Install yarn 2](https://yarnpkg.com/getting-started/install)
+ 3. Install and start [Docker Desktop](https://www.docker.com/products/docker-desktop) 
+
+If you have problems installing dependencies (running `yarn` command) on Mac OS, try the following:
+ 1. Follow instructions on [this page](https://github.com/nodejs/node-gyp/blob/master/macOS_Catalina.md)
+ 2. `brew install libpq` and follow instructions about modifying `PATH`, `LDFLAGS`, `CPPFLAGS`, and `PKG_CONFIG_PATH`
+ variables printed by Homebrew in the end of the installation.
+
+Run tests via `yarn --ignore-engines test`.
+
+For faster testing or debug loop, first start `db` and `worker` containers separately: `docker-compose up -d db worker`,
+and then run `yarn --ignore-engines jest`.
+
+For full formation testing, use `docker-compose up -d` and ping the web via
+```
+curl -X POST https://127.0.0.1:3000/twitterFollowers --header 'Content-type: application/json' --data '{"orgId":"climatescape", "twitterUrl":"twitter.com/climatescape"}'
+```
diff --git a/backend/doc/decisions/1-use-heroku.md b/backend/doc/decisions/1-use-heroku.md
@@ -0,0 +1,6 @@
+Decided to implement automatic data scraping backend as a customly built system deployed on Heroku, rather than using
+a specialized platform for scraping such as Apify, because we think that the backend system will eventually outgrow mere
+data scraping.
+
+See [this message](https://github.com/climatescape/climatescape.org/issues/40#issuecomment-583264142) and the preceding
+messages in the thread.
diff --git a/backend/doc/decisions/2-use-node.md b/backend/doc/decisions/2-use-node.md
@@ -0,0 +1,5 @@
+Decided to use Node.JS platform for the backend system ([deployed in Heroku](1-use-heroku.md)) for two main reasons:
+ - This is a platform familiar to the key stakeholder of the project, Brendan.
+ - Uniformity with the frontend (Netlify) part of the project and potentially sharing some model code in the future.
+
+See [this message in Slack](https://climatescape.slack.com/archives/CT42YRV3P/p1580385766005700) and messages around it.
diff --git a/backend/doc/decisions/3-background-task-processing.md b/backend/doc/decisions/3-background-task-processing.md
@@ -0,0 +1,13 @@
+Decided to split the automatic data scraping system [deployed on Heroku](1-use-heroku.md) into two dynos:
+ 1. "web" (should not to be confused with the Climatescape website itself)
+ 2. "worker", which does the scraping in background.
+
+web is expected to be called from the hooks on the Climatescape website or Zapier. web puts scraping jobs into
+a persistent queue and responds to the caller immediately. worker fetches the jobs in background, does the scraping
+(e. g. accesses Twitter API), and puts the results into a Postgres database.
+
+This decision is driven by the [Heroku documentation](https://devcenter.heroku.com/articles/background-jobs-queueing)
+which justifies this approach as scalable and reliable.
+
+See [this message](https://github.com/climatescape/climatescape.org/issues/40#issuecomment-584658556) and a few
+preceding messages in the thread.
diff --git a/backend/doc/decisions/4-use-pg-boss-queue.md b/backend/doc/decisions/4-use-pg-boss-queue.md
@@ -0,0 +1,12 @@
+Decided to use [pg-boss](https://github.com/timgit/pg-boss) queue for managing [background task processing](
+3-background-task-processing.md) because:
+ - It uses Postgres for persistence which is used in the backend system anyway (for storing the scraping results), thus
+ reducing the operational variability of the backend, compared to if a separate persistent queue such as RabbitMQ was
+ used.
+ - pg-boss is chosen over [graphile-worker](https://github.com/graphile/worker) because it supports multiple queues
+ (at least I couldn't figure out how to create multiple different queues from the docs of graphile-worker, as of the
+ version 0.4.0). We will need multiple queues for managing processing with different priorities, e. g. periodic batch
+ scraping vs. first-time scraping for orgs just added to the website.
+
+See [this message](https://github.com/climatescape/climatescape.org/issues/40#issuecomment-585112177).
+
diff --git a/backend/doc/decisions/5-use-yarn.md b/backend/doc/decisions/5-use-yarn.md
@@ -0,0 +1,10 @@
+Decided to use yarn package manager instead of npm because of the [issue](https://github.com/yarnpkg/yarn/issues/5962)
+with node-gyp package installation: with node 12.15.0 and Mac OS X, which I was able to resolve only in yarn by running
+```
+ $ yarn global add node-gyp
+ $ yarn global remove node-gyp
+```
+
+As described [here](https://github.com/yarnpkg/yarn/issues/5962#issuecomment-435576447).
+
+Using Yarn 1.x because there are yet some problems with using Yarn 2 in Heroku.
diff --git a/backend/doc/decisions/6-use-fastify-server.md b/backend/doc/decisions/6-use-fastify-server.md
@@ -0,0 +1,3 @@
+Chosen [Fastify](https://www.fastify.io/) as a Node web server over [Express](https://expressjs.com/) because the latter
+doesn't support async/await handlers out of the box as of version 4.x. Fastify also comes with JSON Schema verification
+and logging, which is useful.
diff --git a/backend/docker-compose.yml b/backend/docker-compose.yml
@@ -0,0 +1,45 @@
+version: '3.7'
+services:
+  web:
+    build: .
+    depends_on:
+      - db
+    volumes:
+      - ./src:/opt/src
+    working_dir: /opt/src
+    command: pm2-dev web.js
+    ports:
+      - '3000:3000'
+    environment:
+      WITHIN_CONTAINER: 'true'
+  worker:
+    build: .
+    depends_on:
+      - db
+    volumes:
+      - ./src:/opt/src
+    working_dir: /opt/src
+    command: pm2-dev worker.js
+    environment:
+      WITHIN_CONTAINER: 'true'
+  db:
+    build:
+      context: .
+      dockerfile: Dockerfile.postgres
+    volumes:
+      - dbdata:/var/lib/postgresql/data
+    environment:
+      POSTGRES_DB: postgres
+      POSTGRES_USER: postgres
+      POSTGRES_PASSWORD: postgres
+    ports:
+      - published: 5432
+        target: 5432
+    healthcheck:
+      test: ["CMD-SHELL", "pg_isready -U postgres"]
+      interval: 1s
+      timeout: 5s
+      retries: 10
+
+volumes:
+  dbdata:
diff --git a/backend/package.json b/backend/package.json
@@ -0,0 +1,30 @@
+{
+  "engines": {
+    "node": "12.15.0",
+    "yarn": "1.x"
+  },
+  "dependencies": {
+    "fastify": "^2.12.0",
+    "global": "^4.4.0",
+    "node-gyp": "^6.1.0",
+    "pg": "^7.18.1",
+    "pg-boss": "^4.0.0-beta5",
+    "pg-pool": "^2.0.10",
+    "swagger": "^0.7.5",
+    "url": "^0.11.0"
+  },
+  "devDependencies": {
+    "jest": "^25.1.0",
+    "supertest": "^4.0.2"
+  },
+  "optionalDependenciesComment": "explicitly making fsevents optional to skip trying to install fsevents on Linux in Docker image",
+  "optionalDependencies": {
+    "fsevents": "2.1.2"
+  },
+  "scripts": {
+    "about:test": "echo '--forceExit is needed because of https://github.com/facebook/jest/issues/9473'",
+    "test": "docker-compose up -d db worker && docker-compose rm -fsv web && jest --detectOpenHandles --runInBand --verbose --forceExit",
+    "about:jest": "echo '--forceExit is needed because of https://github.com/facebook/jest/issues/9473'",
+    "jest": "jest --detectOpenHandles --runInBand --verbose --forceExit"
+  }
+}
diff --git a/backend/postgres-init.sql b/backend/postgres-init.sql
@@ -0,0 +1,2 @@
+-- Needed as per https://github.com/timgit/pg-boss/blob/master/docs/usage.md#database-installation
+CREATE EXTENSION pgcrypto;
diff --git a/backend/src/app.js b/backend/src/app.js
@@ -0,0 +1,32 @@
+const fastify = require('fastify')({ logger: true });
+
+const { pgBossQueue } = require('./pg');
+
+fastify.route({
+    method: 'POST',
+    url: '/twitterFollowers',
+    schema: {
+        body: {
+            type: 'object',
+            required: ['orgId', 'twitterUrl'],
+            properties: {
+                orgId: { type: 'string' },
+                twitterUrl: { type: 'string' }
+            }
+        }
+    },
+    handler: function (req, res) {
+        fastify.log.info('Received request to scrape Twitter followers: ', req.body);
+        return pgBossQueue.publish('twitterFollowers', req.body);
+    }
+});
+
+async function buildFastify() {
+    await pgBossQueue.start();
+    await fastify.listen(process.env.PORT || 3000, '0.0.0.0');
+    fastify.log.info(`server listening on ${fastify.server.address().port}`);
+    return fastify;
+}
+
+
+module.exports = buildFastify;
diff --git a/backend/src/pg.js b/backend/src/pg.js
@@ -0,0 +1,84 @@
+const Url = require('url');
+const Pool = require('pg-pool');
+
+// WITHIN_CONTAINER is set in docker-compose.yml
+// We are *not* running within a local container when we run jest tests, e. g. via `yarn test`
+const isRunningWithinLocalContainer = process.env.WITHIN_CONTAINER === 'true';
+const host = isRunningWithinLocalContainer ? 'db' : 'localhost';
+// user, password, and the database name correspond to those set in docker-compose.yml
+const pgLocalConnectionString = `postgres://postgres:postgres@${host}:5432/postgres`;
+
+// See https://stackoverflow.com/a/28489160
+const isProduction = !!(process.env._ && (process.env._.indexOf('heroku') >= 0));
+console.log('isProduction: ' + isProduction);
+const pgConnectionString = isProduction ? process.env.DATABASE_URL : pgLocalConnectionString;
+
+const params = Url.parse(pgConnectionString);
+const auth = params.auth.split(':');
+
+const pgConfig = {
+    user: auth[0],
+    password: auth[1],
+    host: params.hostname,
+    port: params.port,
+    database: params.pathname.split('/')[1],
+    // If some problems with different SSL config for local dev and Heroku ever arise, they could be matched:
+    // 1. Run the following commands (copied from https://gist.github.com/mrw34/c97bb03ea1054afb551886ffc8b63c3b):
+    // openssl req -new -text -passout pass:abcd -subj /CN=localhost -out server.req -keyout privkey.pem
+    // openssl rsa -in privkey.pem -passin pass:abcd -out server.key
+    // openssl req -x509 -in server.req -text -key server.key -out server.crt
+    // 2. Add the following lines to Dockerfile.postgres (copied from https://stackoverflow.com/a/55072885)
+    // COPY server.key /var/lib/postgresql/server.key
+    // COPY server.crt /var/lib/postgresql/server.crt
+    //
+    // RUN chmod 600 /var/lib/postgresql/server.key
+    // RUN chown postgres:postgres /var/lib/postgresql/server.key
+    // 3. Add to db service in docker-compose.yml:
+    //     command:
+    //       -c ssl=on -c ssl_cert_file=/var/lib/postgresql/server.crt -c ssl_key_file=/var/lib/postgresql/server.key
+    ssl: isProduction
+};
+
+const pgPool = new Pool(pgConfig);
+pgPool.on('error', (err, client) => console.error(err));
+
+// Should be removed when https://github.com/brianc/node-postgres/issues/1789 is fixed
+const connectWrapper = async function() {
+    for (let nRetry = 1; ; nRetry++) {
+        try {
+            const client = await pgPool.connect();
+            if (nRetry > 1) {
+                console.info('Now successfully connected to Postgres');
+            }
+            return client;
+        } catch (e) {
+            if (e.toString().includes('ECONNREFUSED') && nRetry < 5) {
+                console.info('ECONNREFUSED connecting to Postgres, ' +
+                    'maybe container is not ready yet, will retry ' + nRetry);
+                // Wait 1 second
+                await new Promise(resolve => setTimeout(resolve, 1000));
+            } else {
+                throw e;
+            }
+        }
+    }
+}
+const pgPoolWrapper = {
+    connect: connectWrapper,
+    async query(text, values) {
+        const client = await connectWrapper();
+        try {
+            return client.query(text, values);
+        } finally {
+            client.release();
+        }
+    },
+};
+
+const PgBoss = require('pg-boss');
+
+const pgBossQueue = new PgBoss({db: { executeSql: pgPoolWrapper.query }});
+
+pgBossQueue.on('error', err => console.error(err));
+
+module.exports = {pgConfig, pgBossQueue, pgPool: pgPoolWrapper};
diff --git a/backend/src/setupScraping.js b/backend/src/setupScraping.js
@@ -0,0 +1,21 @@
+const { pgPool } = require('./pg.js');
+
+async function setupScraping() {
+    const client = await pgPool.connect();
+    try {
+        await client.query(
+            'CREATE TABLE IF NOT EXISTS scraping_results (' +
+            '  org_id text not null,' +
+            '  request_type text not null,' +
+            '  scraping_time timestamp DEFAULT current_timestamp,' +
+            '  result jsonb not null,' +
+            '  PRIMARY KEY(org_id, request_type, scraping_time)' +
+            ');'
+        );
+    } finally {
+        client.release();
+    }
+    return pgPool;
+}
+
+module.exports = setupScraping;
diff --git a/backend/src/web.js b/backend/src/web.js
@@ -0,0 +1,9 @@
+const buildFastify = require('./app');
+
+(async () => {
+    try {
+        await buildFastify();
+    } catch (e) {
+        console.error('Error starting fastify server', e);
+    }
+})();
diff --git a/backend/src/worker.js b/backend/src/worker.js
@@ -0,0 +1,43 @@
+const { pgBossQueue } = require('./pg');
+const setupScraping = require('./setupScraping');
+
+async function startWorker() {
+    await pgBossQueue.start();
+    const scrapingPool = await setupScraping();
+    await pgBossQueue.subscribe('twitterFollowers', async job => {
+        const data = job.data;
+        console.log('Scraping Twitter followers: ' + JSON.stringify(data));
+        const numTwitterFollowers = await twitterFollowers(data);
+        console.log(`Twitter followers of ${data.orgId}: ${numTwitterFollowers}`);
+        const client = await scrapingPool.connect();
+        try {
+            const result = await client.query(
+                'INSERT INTO scraping_results (org_id, request_type, result) VALUES ($1, $2, $3) ' +
+                'ON CONFLICT (org_id, request_type, scraping_time) DO UPDATE ' +
+                'SET result = $3;',
+                [data.orgId, 'twitterFollowers', numTwitterFollowers]
+            );
+            if (result.rowCount !== 1) {
+                console.error('ERROR! Expected one updated row, got ' + result.rowCount);
+            }
+            console.log(`Twitter followers for ${data.orgId} were successfully stored in the database`);
+        } catch (e) {
+            console.error(`Error while storing twitter followers for ${data.orgId} in the database`, e);
+        } finally {
+            client.release();
+        }
+    });
+}
+
+(async () => {
+    try {
+        await startWorker();
+    } catch (e) {
+        console.error('Error starting worker', e);
+    }
+})();
+
+async function twitterFollowers(data) {
+    return 100;
+}
+
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		web: node src/web.js
		worker: node src/worker.js
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		-- Needed as per https://github.com/timgit/pg-boss/blob/master/docs/usage.md#database-installation
		CREATE EXTENSION pgcrypto;