Merge pull request #87 from climatescape/add-backend

Backend Scraper
climatescape · Mar 21, 2020 · 73563ab · 73563ab
2 parents 2462a54 + de6bd42
commit 73563ab
Show file tree

Hide file tree

Showing 39 changed files with 10,016 additions and 0 deletions.
diff --git a/.idea/inspectionProfiles/Project_Default.xml b/.idea/inspectionProfiles/Project_Default.xml
diff --git a/backend/.dockerignore b/backend/.dockerignore
@@ -0,0 +1,4 @@
+node_modules
+npm-debug.log
+yarn-error.log
+.env
diff --git a/backend/.eslintrc.js b/backend/.eslintrc.js
@@ -0,0 +1,20 @@
+module.exports = {
+  env: {
+    browser: false,
+    node: true,
+    es6: true,
+  },
+  plugins: ["prettier", "jest"],
+  extends: [
+    "airbnb",
+    "plugin:prettier/recommended",
+    "plugin:jest/recommended",
+  ],
+  rules: {
+    "no-console": "off",
+    "no-return-await": "off", // See https://youtrack.jetbrains.com/issue/WEB-39569
+    "func-names": ["error", "as-needed"],
+    "no-unused-vars": ["error", { "args": "none"}],
+    "jest/valid-expect": "off", // jest-expect-message adds a parameter to expect()
+  },
+}
diff --git a/backend/.gitignore b/backend/.gitignore
@@ -0,0 +1,3 @@
+node_modules
+npm-debug.log
+yarn-error.log
diff --git a/backend/Dockerfile b/backend/Dockerfile
@@ -0,0 +1,11 @@
+# The version should stay in sync with package.json
+# -alpine version doesn't work because yarn needs Python during installation of packages
+FROM node:12.15.0
+
+# Install app dependencies.
+COPY package.json yarn.lock ./
+
+# ignore-engines to skip trying to install fsevents on Linux
+RUN yarn config set ignore-engines true && yarn install
+
+RUN yarn global add pm2
diff --git a/backend/Dockerfile.postgres b/backend/Dockerfile.postgres
@@ -0,0 +1,4 @@
+# 12.1 is the current version in Heroku, it should be periodically checked and synced (see https://data.heroku.com/)
+FROM postgres:12.1-alpine
+
+COPY postgres-init.sql /docker-entrypoint-initdb.d
diff --git a/backend/Procfile b/backend/Procfile
@@ -0,0 +1,2 @@
+web: node src/web.js
+worker: node src/worker.js
diff --git a/backend/README.md b/backend/README.md
@@ -0,0 +1,31 @@
+# Backend
+Backend automatically scrapes data for Climatescape website.
+
+## Overview
+Backend is written in [Node.js](doc/decisions/2-use-node.md) and is deployed on [Heroku](doc/decisions/1-use-heroku.md).
+It consists of two apps: *web* (shouldn't be confused with Climatescape website itself) and *worker* for [background
+processing](doc/decisions/3-background-task-processing.md). web pushes jobs to a persistent [pg-boss](
+doc/decisions/4-use-pg-boss-queue.md) queue backed up with Postgres.
+
+## Local setup
+
+ 1. Install Node 12.15.0 using [`nvm`](https://github.com/nvm-sh/nvm#install--update-script)
+ 2. [Install yarn 1.x](https://classic.yarnpkg.com/en/docs/install)
+ 3. run `yarn config set ignore-engines true && yarn install`
+ 4. Install and start [Docker Desktop](https://www.docker.com/products/docker-desktop)
+
+If you have problems installing dependencies (running `yarn` command) on Mac OS, try the following:
+ 1. Follow instructions on [this page](https://github.com/nodejs/node-gyp/blob/master/macOS_Catalina.md)
+ 2. `brew install libpq` and follow instructions about modifying `PATH`, `LDFLAGS`, `CPPFLAGS`, and `PKG_CONFIG_PATH`
+ variables printed by Homebrew in the end of the installation.
+
+Run tests via `yarn test`.
+
+For faster testing or debug loop, first start `db` and `worker` containers separately: `docker-compose up -d db worker`,
+and then run `yarn jest`.
+
+For full formation testing, use `docker-compose up -d` and ping the web via
+```
+curl -X POST https://127.0.0.1:3000/twitterUserObject --header 'Content-type: application/json' --data '{"orgId":"climatescape", "twitterScreenName":"climatescape"}'
+```
+To enter Postgres container for debugging, use `docker exec -it backend_db_1 psql -U postgres`
diff --git a/backend/doc/decisions/1-use-heroku.md b/backend/doc/decisions/1-use-heroku.md
@@ -0,0 +1,6 @@
+Decided to implement automatic data scraping backend as a customly built system deployed on Heroku, rather than using
+a specialized platform for scraping such as Apify, because we think that the backend system will eventually outgrow mere
+data scraping.
+
+See [this message](https://github.com/climatescape/climatescape.org/issues/40#issuecomment-583264142) and the preceding
+messages in the thread.
diff --git a/backend/doc/decisions/2-use-node.md b/backend/doc/decisions/2-use-node.md
@@ -0,0 +1,5 @@
+Decided to use Node.JS platform for the backend system ([deployed in Heroku](1-use-heroku.md)) for two main reasons:
+ - This is a platform familiar to the key stakeholder of the project, Brendan.
+ - Uniformity with the frontend (Netlify) part of the project and potentially sharing some model code in the future.
+
+See [this message in Slack](https://climatescape.slack.com/archives/CT42YRV3P/p1580385766005700) and messages around it.
diff --git a/backend/doc/decisions/3-background-task-processing.md b/backend/doc/decisions/3-background-task-processing.md
@@ -0,0 +1,29 @@
+The automatic data scraping system [deployed on Heroku](1-use-heroku.md) is organized as follows:
+
+ 1. A "worker" dyno processes jobs (scraping tasks) from queue(s) in background.
+ 2. Scraping tasks are populated to the queue(s) via [one-off dynos](
+ https://devcenter.heroku.com/articles/one-off-dynos) which are scheduled to be run periodically via [Heroku
+ Scheduler](https://devcenter.heroku.com/articles/scheduler). For example, a one-off script can pull the data from the
+ Climatescape's Airtable and schedule initial scraping tasks for newly added orgs.
+
+This decision to use a background worker with task queue(s) is driven by the [Heroku
+documentation](https://devcenter.heroku.com/articles/background-jobs-queueing) which justifies this approach as scalable
+and reliable. worker fetches the scraping tasks in background, does the scraping (e. g. accesses Twitter API), and puts
+the results into the Postgres database.
+
+As an alternative to the "pull approach" (scraping tasks are populated by scheduled one-off dynos), a "push approach"
+was considered: the backend maintains a web interface (on a separate dyno) which Climatescape website (via [Netlify
+Functions](https://docs.netlify.com/functions/overview)) or [Zapier](https://zapier.com/home).
+
+The pull-based approach was chosen because it has the following advantages:
+ - The dependency on Netlify Functions or Zapier can be avoided, reducing the number of concepts that developers have to
+ learn and environments to manage. On the other hand, even with the push approach, one-off scripts in Heroku could be
+ needed anyway to schedule periodic re-scraping of information about all organizations.
+ - The backend doesn't need to expose a POST or PUT interface, so no need to worry about protection and authentication.
+ Some form of backend web interface might be eventually added, e. g. for monitoring of the number of scraping tasks in
+ the queue(s), but that interface could probably be read-only so authentication might not be required.
+
+The decision to use background worker was done [here](
+https://github.com/climatescape/climatescape.org/issues/40#issuecomment-584658556) (see also a few preceding messages in
+that thread), with the push-based approach. Then, we decided to use the pull-based instead of push-based approach in
+[this discussion](https://github.com/climatescape/climatescape.org/pull/87#discussion_r383368702).
diff --git a/backend/doc/decisions/4-use-pg-boss-queue.md b/backend/doc/decisions/4-use-pg-boss-queue.md
@@ -0,0 +1,12 @@
+Decided to use [pg-boss](https://github.com/timgit/pg-boss) queue for managing [background task processing](
+3-background-task-processing.md) because:
+ - It uses Postgres for persistence which is used in the backend system anyway (for storing the scraping results), thus
+ reducing the operational variability of the backend, compared to if a separate persistent queue such as RabbitMQ was
+ used.
+ - pg-boss is chosen over [graphile-worker](https://github.com/graphile/worker) because it supports multiple queues
+ (at least I couldn't figure out how to create multiple different queues from the docs of graphile-worker, as of the
+ version 0.4.0). We will need multiple queues for managing processing with different priorities, e. g. periodic batch
+ scraping vs. first-time scraping for orgs just added to the website.
+
+See [this message](https://github.com/climatescape/climatescape.org/issues/40#issuecomment-585112177).
+
diff --git a/backend/doc/decisions/5-use-yarn.md b/backend/doc/decisions/5-use-yarn.md
@@ -0,0 +1,13 @@
+Decided to use yarn package manager instead of npm because of the [issue](https://github.com/yarnpkg/yarn/issues/5962)
+with node-gyp package installation: with node 12.15.0 and Mac OS X, which I was able to resolve only in yarn by running
+```
+ $ yarn global add node-gyp
+ $ yarn global remove node-gyp
+```
+
+As described [here](https://github.com/yarnpkg/yarn/issues/5962#issuecomment-435576447).
+
+See [extra context and discussion](https://github.com/climatescape/climatescape.org/pull/87#discussion_r383366463) about
+the issue and the choice of yarn over npm.
+
+Using Yarn 1.x because there are yet some problems with using Yarn 2 in Heroku.
diff --git a/backend/doc/decisions/6-push-data-to-airtable.md b/backend/doc/decisions/6-push-data-to-airtable.md
@@ -0,0 +1,13 @@
+In order to make use of the scraping results in the Gatsby-generated Climatescape website, we decided to push the
+scraped data back to Airtable in the [background worker](3-background-task-processing.md), *in addition* to storing this
+data in the Postgres database.
+
+An alternative approach is to connect Gatsby to Postgres via [gatsby-source-pg](
+https://www.gatsbyjs.org/packages/gatsby-source-pg/) plugin.
+
+We chose pushing data to Airtable from the backend to simplify the Gatsby setup, considering that some data pushing from
+backend to Airtable is needed anyway to enable sorting the organizations in the Airtable content management interface
+according to their weight ("Climatescape rank"), which is [one of the goals of the scraping automation project](
+https://github.com/climatescape/climatescape.org/issues/40#issue-558680900).
+
+See [this message](https://github.com/climatescape/climatescape.org/pull/87#issuecomment-590864830).
diff --git a/backend/docker-compose.yml b/backend/docker-compose.yml
@@ -0,0 +1,45 @@
+version: '3.7'
+services:
+  web:
+    build: .
+    depends_on:
+      - db
+    volumes:
+      - ./src:/opt/src
+    working_dir: /opt/src
+    command: pm2-dev web.js
+    ports:
+      - '3000:3000'
+    environment:
+      WITHIN_CONTAINER: 'true'
+  worker:
+    build: .
+    depends_on:
+      - db
+    volumes:
+      - ./src:/opt/src
+    working_dir: /opt/src
+    command: pm2-dev worker.js
+    environment:
+      WITHIN_CONTAINER: 'true'
+  db:
+    build:
+      context: .
+      dockerfile: Dockerfile.postgres
+    volumes:
+      - dbdata:/var/lib/postgresql/data
+    environment:
+      POSTGRES_DB: postgres
+      POSTGRES_USER: postgres
+      POSTGRES_PASSWORD: postgres
+    ports:
+      - published: 5432
+        target: 5432
+    healthcheck:
+      test: ["CMD-SHELL", "pg_isready -U postgres"]
+      interval: 1s
+      timeout: 5s
+      retries: 10
+
+volumes:
+  dbdata:
diff --git a/backend/package.json b/backend/package.json
@@ -0,0 +1,50 @@
+{
+  "engines": {
+    "node": "12.15.0",
+    "yarn": "1.x"
+  },
+  "dependencies": {
+    "airtable": "^0.8.1",
+    "dotenv": "^8.2.0",
+    "fastify": "^2.12.0",
+    "global": "^4.4.0",
+    "knex": "^0.20.10",
+    "lodash": "^4.17.15",
+    "log-timestamp": "^0.3.0",
+    "node-gyp": "^6.1.0",
+    "p-memoize": "^4.0.0",
+    "pg": "^7.18.1",
+    "pg-boss": "^4.0.0-beta5",
+    "pg-hstore": "^2.3.3",
+    "pg-pool": "^2.0.10",
+    "sequelize": "^5.21.5",
+    "swagger": "^0.7.5",
+    "twitter-lite": "^0.9.4",
+    "url": "^0.11.0"
+  },
+  "devDependenciesComment": "Have to use eslint 5.x until https://youtrack.jetbrains.com/issue/WEB-43692 is fixed",
+  "devDependencies": {
+    "eslint": "^5.16.0",
+    "eslint-plugin-jest": "^23.8.0",
+    "jest": "^25.1.0",
+    "jest-expect-message": "^1.0.2",
+    "supertest": "^4.0.2",
+    "wait-for-expect": "^3.0.2"
+  },
+  "optionalDependenciesComment": "explicitly making fsevents optional to skip trying to install fsevents on Linux in Docker image",
+  "optionalDependencies": {
+    "fsevents": "2.1.2"
+  },
+  "scripts": {
+    "about:test": "echo '--forceExit is needed because of https://github.com/facebook/jest/issues/9473'",
+    "test": "docker-compose up -d db worker && docker-compose rm -fsv web && jest --detectOpenHandles --runInBand --forceExit",
+    "about:jest": "echo '--forceExit is needed because of https://github.com/facebook/jest/issues/9473'",
+    "jest": "jest --detectOpenHandles --runInBand --forceExit"
+  },
+  "jest": {
+    "setupFilesAfterEnv": [
+      "jest-expect-message"
+    ],
+    "verbose": true
+  }
+}
diff --git a/backend/postgres-init.sql b/backend/postgres-init.sql
@@ -0,0 +1,2 @@
+-- Needed as per https://github.com/timgit/pg-boss/blob/master/docs/usage.md#database-installation
+CREATE EXTENSION pgcrypto;
diff --git a/backend/src/addFirstTimeScrapingJobs.js b/backend/src/addFirstTimeScrapingJobs.js
@@ -0,0 +1,21 @@
+const { setupPgBossQueue } = require("./pg")
+const {
+  addFirstTimeTwitterUserObjectScrapingJobs,
+} = require("./twitterUserObjectScraping")
+
+async function addFirstTimeScrapingJobs() {
+  const pgBossQueue = await setupPgBossQueue()
+  await addFirstTimeTwitterUserObjectScrapingJobs(pgBossQueue)
+}
+
+if (require.main === module) {
+  ;(async () => {
+    try {
+      await addFirstTimeScrapingJobs()
+    } catch (e) {
+      console.error("Error adding first-time scraping jobs", e)
+    }
+  })()
+}
+
+module.exports = { addFirstTimeScrapingJobs }
diff --git a/backend/src/airtable.js b/backend/src/airtable.js
@@ -0,0 +1,44 @@
+const Airtable = require("airtable")
+const { configureEnvironment, sleep } = require("./utils")
+
+configureEnvironment()
+Airtable.configure({
+  endpointUrl: "https://api.airtable.com",
+  apiKey: process.env.AIRTABLE_API_KEY,
+})
+const airtableBase = Airtable.base("appNYMWxGF1jMaf5V")
+
+/**
+ * This function is adapted from airtable.js: https://github.com/Airtable/airtable.js/blob/
+ * 31fd0a089fee87832760f35c7270eae283972e35/lib/query.js#L116-L135 to include waits to avoid hitting Airtable's rate
+ * limits accidentally (see https://github.com/Airtable/airtable.js/issues/30), and add more logging.
+ * @returns {Promise<Array<Object>>}
+ */
+function fetchAllRecords(airtableQuery) {
+  return new Promise((resolve, reject) => {
+    const allRecords = []
+    airtableQuery.eachPage(
+      function page(pageRecords, fetchNextPage) {
+        allRecords.push(...pageRecords)
+        console.log(
+          `Fetched ${pageRecords.length} records, now ${allRecords.length} in total`
+        )
+        // The rate limit is 5 rps, but we don't try to be even close to that because there may be concurrent operations
+        // with Airtable API happening in the backend system, e. g. writing scraped data to Airtable.
+        sleep(1000)
+          .then(() => fetchNextPage())
+          .catch(error => reject(error))
+      },
+      function error(err) {
+        if (err) {
+          reject(err)
+        } else {
+          console.log(`Fetched ${allRecords.length} records in total`)
+          resolve(allRecords)
+        }
+      }
+    )
+  })
+}
+
+module.exports = { airtableBase, fetchAllRecords }
diff --git a/backend/src/app.js b/backend/src/app.js
@@ -0,0 +1,36 @@
+const fastify = require("fastify")({ logger: true })
+
+const { setupPgBossQueue } = require("./pg")
+const { TWITTER_USER_OBJECT } = require("./twitterUserObjectScraping")
+
+async function buildFastify() {
+  const pgBossQueue = await setupPgBossQueue()
+
+  fastify.route({
+    method: "POST",
+    url: "/twitterUserObject",
+    schema: {
+      body: {
+        type: "object",
+        required: ["orgId", "twitterScreenName"],
+        properties: {
+          orgId: { type: "string" },
+          twitterScreenName: { type: "string" },
+        },
+      },
+    },
+    handler(req, res) {
+      fastify.log.info(
+        "Received request to scrape Twitter user object: ",
+        req.body
+      )
+      return pgBossQueue.publish(TWITTER_USER_OBJECT, req.body)
+    },
+  })
+
+  await fastify.listen(process.env.PORT || 3000, "0.0.0.0")
+  fastify.log.info(`server listening on ${fastify.server.address().port}`)
+  return fastify
+}
+
+module.exports = buildFastify
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		web: node src/web.js
		worker: node src/worker.js
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		-- Needed as per https://github.com/timgit/pg-boss/blob/master/docs/usage.md#database-installation
		CREATE EXTENSION pgcrypto;