Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend Scraper #87

Merged
merged 24 commits into from
Mar 21, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
29879c6
Revert "Revert "Add backend, part of #40""
bloudermilk Feb 24, 2020
56c3e7d
Fix backend readme to mention yarn 1.x, not 2 install, and link to th…
leventov Feb 25, 2020
d6ccd83
More links from decision docs to discussions.
leventov Feb 25, 2020
4f70864
Add comment to isProduction
leventov Feb 25, 2020
6b3a569
backend: add airtable.js
leventov Feb 26, 2020
14c7c1b
backend: reformat pg.js according to the project's style
leventov Feb 26, 2020
3ca9d7c
backend: reformat setupScraping
leventov Feb 26, 2020
34a99c9
backend: reformat worker.js
leventov Feb 26, 2020
b68663e
backend: migrate to sequelize
leventov Feb 26, 2020
91c6899
backend: add backupAirtable.js, setupAirtableBackup.js
leventov Feb 27, 2020
7166a87
backend: update yarn instructions in README
leventov Feb 27, 2020
6fb93a4
backend: in backupAirtable.js, write orgs to Postgres
leventov Feb 27, 2020
003bbc8
Update decision docs
leventov Mar 2, 2020
9563e73
Migrate to Knex, add backupAirtable.test.js
leventov Mar 5, 2020
ae3615e
Add decision doc about pushing data to Airtable from backend
leventov Mar 5, 2020
e5369a1
Add determineOrgsToScrapeFirstTime function
leventov Mar 6, 2020
53f1866
Add addFirstTimeScrapingJobs() function
leventov Mar 7, 2020
337cf32
Connect to Twitter API
leventov Mar 8, 2020
127f3d7
Refactor twitter.js and addFirstTimeScrapingJobs.js
leventov Mar 9, 2020
450614e
Complete first-time scraping of Twitter followers
leventov Mar 9, 2020
651595b
backend: refactoring: store just Twitter followers -> store the whole…
leventov Mar 10, 2020
1a4d346
backend: refactor: extract twitterUserObjectScraping.js from worker.js
leventov Mar 10, 2020
e87151c
backend: refactor: extract firstTimeScraping.js from addFirstTimeScra…
leventov Mar 10, 2020
de6bd42
Fix bug in getTwitterScreenName()
leventov Mar 10, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .idea/inspectionProfiles/Project_Default.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 4 additions & 0 deletions backend/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
node_modules
npm-debug.log
yarn-error.log
.env
20 changes: 20 additions & 0 deletions backend/.eslintrc.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
module.exports = {
bloudermilk marked this conversation as resolved.
Show resolved Hide resolved
env: {
browser: false,
node: true,
es6: true,
},
plugins: ["prettier", "jest"],
extends: [
"airbnb",
"plugin:prettier/recommended",
"plugin:jest/recommended",
],
rules: {
"no-console": "off",
"no-return-await": "off", // See https://youtrack.jetbrains.com/issue/WEB-39569
"func-names": ["error", "as-needed"],
"no-unused-vars": ["error", { "args": "none"}],
"jest/valid-expect": "off", // jest-expect-message adds a parameter to expect()
},
}
3 changes: 3 additions & 0 deletions backend/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
node_modules
npm-debug.log
yarn-error.log
11 changes: 11 additions & 0 deletions backend/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# The version should stay in sync with package.json
# -alpine version doesn't work because yarn needs Python during installation of packages
FROM node:12.15.0

# Install app dependencies.
COPY package.json yarn.lock ./

# ignore-engines to skip trying to install fsevents on Linux
RUN yarn config set ignore-engines true && yarn install

RUN yarn global add pm2
4 changes: 4 additions & 0 deletions backend/Dockerfile.postgres
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# 12.1 is the current version in Heroku, it should be periodically checked and synced (see https://data.heroku.com/)
FROM postgres:12.1-alpine

COPY postgres-init.sql /docker-entrypoint-initdb.d
2 changes: 2 additions & 0 deletions backend/Procfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
web: node src/web.js
worker: node src/worker.js
31 changes: 31 additions & 0 deletions backend/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Backend
Backend automatically scrapes data for Climatescape website.

## Overview
Backend is written in [Node.js](doc/decisions/2-use-node.md) and is deployed on [Heroku](doc/decisions/1-use-heroku.md).
It consists of two apps: *web* (shouldn't be confused with Climatescape website itself) and *worker* for [background
processing](doc/decisions/3-background-task-processing.md). web pushes jobs to a persistent [pg-boss](
doc/decisions/4-use-pg-boss-queue.md) queue backed up with Postgres.

## Local setup

1. Install Node 12.15.0 using [`nvm`](https://github.com/nvm-sh/nvm#install--update-script)
2. [Install yarn 1.x](https://classic.yarnpkg.com/en/docs/install)
3. run `yarn config set ignore-engines true && yarn install`
4. Install and start [Docker Desktop](https://www.docker.com/products/docker-desktop)

If you have problems installing dependencies (running `yarn` command) on Mac OS, try the following:
1. Follow instructions on [this page](https://github.com/nodejs/node-gyp/blob/master/macOS_Catalina.md)
2. `brew install libpq` and follow instructions about modifying `PATH`, `LDFLAGS`, `CPPFLAGS`, and `PKG_CONFIG_PATH`
variables printed by Homebrew in the end of the installation.

Run tests via `yarn test`.

For faster testing or debug loop, first start `db` and `worker` containers separately: `docker-compose up -d db worker`,
and then run `yarn jest`.

For full formation testing, use `docker-compose up -d` and ping the web via
```
curl -X POST https://127.0.0.1:3000/twitterUserObject --header 'Content-type: application/json' --data '{"orgId":"climatescape", "twitterScreenName":"climatescape"}'
```
To enter Postgres container for debugging, use `docker exec -it backend_db_1 psql -U postgres`
6 changes: 6 additions & 0 deletions backend/doc/decisions/1-use-heroku.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Decided to implement automatic data scraping backend as a customly built system deployed on Heroku, rather than using
a specialized platform for scraping such as Apify, because we think that the backend system will eventually outgrow mere
data scraping.

See [this message](https://github.com/climatescape/climatescape.org/issues/40#issuecomment-583264142) and the preceding
messages in the thread.
5 changes: 5 additions & 0 deletions backend/doc/decisions/2-use-node.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Decided to use Node.JS platform for the backend system ([deployed in Heroku](1-use-heroku.md)) for two main reasons:
- This is a platform familiar to the key stakeholder of the project, Brendan.
- Uniformity with the frontend (Netlify) part of the project and potentially sharing some model code in the future.

See [this message in Slack](https://climatescape.slack.com/archives/CT42YRV3P/p1580385766005700) and messages around it.
29 changes: 29 additions & 0 deletions backend/doc/decisions/3-background-task-processing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
The automatic data scraping system [deployed on Heroku](1-use-heroku.md) is organized as follows:

1. A "worker" dyno processes jobs (scraping tasks) from queue(s) in background.
2. Scraping tasks are populated to the queue(s) via [one-off dynos](
https://devcenter.heroku.com/articles/one-off-dynos) which are scheduled to be run periodically via [Heroku
Scheduler](https://devcenter.heroku.com/articles/scheduler). For example, a one-off script can pull the data from the
Climatescape's Airtable and schedule initial scraping tasks for newly added orgs.

This decision to use a background worker with task queue(s) is driven by the [Heroku
documentation](https://devcenter.heroku.com/articles/background-jobs-queueing) which justifies this approach as scalable
and reliable. worker fetches the scraping tasks in background, does the scraping (e. g. accesses Twitter API), and puts
the results into the Postgres database.

As an alternative to the "pull approach" (scraping tasks are populated by scheduled one-off dynos), a "push approach"
was considered: the backend maintains a web interface (on a separate dyno) which Climatescape website (via [Netlify
Functions](https://docs.netlify.com/functions/overview)) or [Zapier](https://zapier.com/home).

The pull-based approach was chosen because it has the following advantages:
- The dependency on Netlify Functions or Zapier can be avoided, reducing the number of concepts that developers have to
learn and environments to manage. On the other hand, even with the push approach, one-off scripts in Heroku could be
needed anyway to schedule periodic re-scraping of information about all organizations.
- The backend doesn't need to expose a POST or PUT interface, so no need to worry about protection and authentication.
Some form of backend web interface might be eventually added, e. g. for monitoring of the number of scraping tasks in
the queue(s), but that interface could probably be read-only so authentication might not be required.

The decision to use background worker was done [here](
https://github.com/climatescape/climatescape.org/issues/40#issuecomment-584658556) (see also a few preceding messages in
that thread), with the push-based approach. Then, we decided to use the pull-based instead of push-based approach in
[this discussion](https://github.com/climatescape/climatescape.org/pull/87#discussion_r383368702).
12 changes: 12 additions & 0 deletions backend/doc/decisions/4-use-pg-boss-queue.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
Decided to use [pg-boss](https://github.com/timgit/pg-boss) queue for managing [background task processing](
3-background-task-processing.md) because:
- It uses Postgres for persistence which is used in the backend system anyway (for storing the scraping results), thus
reducing the operational variability of the backend, compared to if a separate persistent queue such as RabbitMQ was
used.
- pg-boss is chosen over [graphile-worker](https://github.com/graphile/worker) because it supports multiple queues
(at least I couldn't figure out how to create multiple different queues from the docs of graphile-worker, as of the
version 0.4.0). We will need multiple queues for managing processing with different priorities, e. g. periodic batch
scraping vs. first-time scraping for orgs just added to the website.

See [this message](https://github.com/climatescape/climatescape.org/issues/40#issuecomment-585112177).

13 changes: 13 additions & 0 deletions backend/doc/decisions/5-use-yarn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Decided to use yarn package manager instead of npm because of the [issue](https://github.com/yarnpkg/yarn/issues/5962)
bloudermilk marked this conversation as resolved.
Show resolved Hide resolved
with node-gyp package installation: with node 12.15.0 and Mac OS X, which I was able to resolve only in yarn by running
```
$ yarn global add node-gyp
$ yarn global remove node-gyp
```

As described [here](https://github.com/yarnpkg/yarn/issues/5962#issuecomment-435576447).

See [extra context and discussion](https://github.com/climatescape/climatescape.org/pull/87#discussion_r383366463) about
the issue and the choice of yarn over npm.

Using Yarn 1.x because there are yet some problems with using Yarn 2 in Heroku.
13 changes: 13 additions & 0 deletions backend/doc/decisions/6-push-data-to-airtable.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
In order to make use of the scraping results in the Gatsby-generated Climatescape website, we decided to push the
scraped data back to Airtable in the [background worker](3-background-task-processing.md), *in addition* to storing this
data in the Postgres database.

An alternative approach is to connect Gatsby to Postgres via [gatsby-source-pg](
https://www.gatsbyjs.org/packages/gatsby-source-pg/) plugin.

We chose pushing data to Airtable from the backend to simplify the Gatsby setup, considering that some data pushing from
backend to Airtable is needed anyway to enable sorting the organizations in the Airtable content management interface
according to their weight ("Climatescape rank"), which is [one of the goals of the scraping automation project](
https://github.com/climatescape/climatescape.org/issues/40#issue-558680900).

See [this message](https://github.com/climatescape/climatescape.org/pull/87#issuecomment-590864830).
45 changes: 45 additions & 0 deletions backend/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
version: '3.7'
services:
web:
build: .
depends_on:
- db
volumes:
- ./src:/opt/src
working_dir: /opt/src
command: pm2-dev web.js
ports:
- '3000:3000'
environment:
WITHIN_CONTAINER: 'true'
worker:
build: .
depends_on:
- db
volumes:
- ./src:/opt/src
working_dir: /opt/src
command: pm2-dev worker.js
environment:
WITHIN_CONTAINER: 'true'
db:
build:
context: .
dockerfile: Dockerfile.postgres
volumes:
- dbdata:/var/lib/postgresql/data
environment:
POSTGRES_DB: postgres
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
ports:
- published: 5432
target: 5432
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 1s
timeout: 5s
retries: 10

volumes:
dbdata:
50 changes: 50 additions & 0 deletions backend/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
{
"engines": {
"node": "12.15.0",
"yarn": "1.x"
},
"dependencies": {
"airtable": "^0.8.1",
"dotenv": "^8.2.0",
"fastify": "^2.12.0",
"global": "^4.4.0",
"knex": "^0.20.10",
"lodash": "^4.17.15",
"log-timestamp": "^0.3.0",
"node-gyp": "^6.1.0",
"p-memoize": "^4.0.0",
"pg": "^7.18.1",
"pg-boss": "^4.0.0-beta5",
"pg-hstore": "^2.3.3",
"pg-pool": "^2.0.10",
"sequelize": "^5.21.5",
"swagger": "^0.7.5",
"twitter-lite": "^0.9.4",
"url": "^0.11.0"
},
"devDependenciesComment": "Have to use eslint 5.x until https://youtrack.jetbrains.com/issue/WEB-43692 is fixed",
"devDependencies": {
"eslint": "^5.16.0",
"eslint-plugin-jest": "^23.8.0",
"jest": "^25.1.0",
"jest-expect-message": "^1.0.2",
"supertest": "^4.0.2",
"wait-for-expect": "^3.0.2"
},
"optionalDependenciesComment": "explicitly making fsevents optional to skip trying to install fsevents on Linux in Docker image",
"optionalDependencies": {
"fsevents": "2.1.2"
},
"scripts": {
"about:test": "echo '--forceExit is needed because of https://github.com/facebook/jest/issues/9473'",
"test": "docker-compose up -d db worker && docker-compose rm -fsv web && jest --detectOpenHandles --runInBand --forceExit",
"about:jest": "echo '--forceExit is needed because of https://github.com/facebook/jest/issues/9473'",
"jest": "jest --detectOpenHandles --runInBand --forceExit"
},
"jest": {
"setupFilesAfterEnv": [
"jest-expect-message"
],
"verbose": true
}
}
2 changes: 2 additions & 0 deletions backend/postgres-init.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
-- Needed as per https://github.com/timgit/pg-boss/blob/master/docs/usage.md#database-installation
CREATE EXTENSION pgcrypto;
21 changes: 21 additions & 0 deletions backend/src/addFirstTimeScrapingJobs.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
const { setupPgBossQueue } = require("./pg")
const {
addFirstTimeTwitterUserObjectScrapingJobs,
} = require("./twitterUserObjectScraping")

async function addFirstTimeScrapingJobs() {
const pgBossQueue = await setupPgBossQueue()
await addFirstTimeTwitterUserObjectScrapingJobs(pgBossQueue)
}

if (require.main === module) {
;(async () => {
try {
await addFirstTimeScrapingJobs()
} catch (e) {
console.error("Error adding first-time scraping jobs", e)
}
})()
}

module.exports = { addFirstTimeScrapingJobs }
44 changes: 44 additions & 0 deletions backend/src/airtable.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
const Airtable = require("airtable")
const { configureEnvironment, sleep } = require("./utils")

configureEnvironment()
Airtable.configure({
endpointUrl: "https://api.airtable.com",
apiKey: process.env.AIRTABLE_API_KEY,
})
const airtableBase = Airtable.base("appNYMWxGF1jMaf5V")

/**
* This function is adapted from airtable.js: https://github.com/Airtable/airtable.js/blob/
* 31fd0a089fee87832760f35c7270eae283972e35/lib/query.js#L116-L135 to include waits to avoid hitting Airtable's rate
* limits accidentally (see https://github.com/Airtable/airtable.js/issues/30), and add more logging.
* @returns {Promise<Array<Object>>}
*/
function fetchAllRecords(airtableQuery) {
return new Promise((resolve, reject) => {
const allRecords = []
airtableQuery.eachPage(
function page(pageRecords, fetchNextPage) {
allRecords.push(...pageRecords)
console.log(
`Fetched ${pageRecords.length} records, now ${allRecords.length} in total`
)
// The rate limit is 5 rps, but we don't try to be even close to that because there may be concurrent operations
// with Airtable API happening in the backend system, e. g. writing scraped data to Airtable.
sleep(1000)
.then(() => fetchNextPage())
.catch(error => reject(error))
},
function error(err) {
if (err) {
reject(err)
} else {
console.log(`Fetched ${allRecords.length} records in total`)
resolve(allRecords)
}
}
)
})
}

module.exports = { airtableBase, fetchAllRecords }
36 changes: 36 additions & 0 deletions backend/src/app.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
const fastify = require("fastify")({ logger: true })

const { setupPgBossQueue } = require("./pg")
const { TWITTER_USER_OBJECT } = require("./twitterUserObjectScraping")

async function buildFastify() {
const pgBossQueue = await setupPgBossQueue()

fastify.route({
method: "POST",
url: "/twitterUserObject",
bloudermilk marked this conversation as resolved.
Show resolved Hide resolved
schema: {
body: {
type: "object",
required: ["orgId", "twitterScreenName"],
properties: {
orgId: { type: "string" },
twitterScreenName: { type: "string" },
},
},
},
handler(req, res) {
fastify.log.info(
"Received request to scrape Twitter user object: ",
req.body
)
return pgBossQueue.publish(TWITTER_USER_OBJECT, req.body)
},
})

await fastify.listen(process.env.PORT || 3000, "0.0.0.0")
fastify.log.info(`server listening on ${fastify.server.address().port}`)
return fastify
}

module.exports = buildFastify
Loading