Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend Scraper #87

Merged
merged 24 commits into from
Mar 21, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
29879c6
Revert "Revert "Add backend, part of #40""
bloudermilk Feb 24, 2020
56c3e7d
Fix backend readme to mention yarn 1.x, not 2 install, and link to th…
leventov Feb 25, 2020
d6ccd83
More links from decision docs to discussions.
leventov Feb 25, 2020
4f70864
Add comment to isProduction
leventov Feb 25, 2020
6b3a569
backend: add airtable.js
leventov Feb 26, 2020
14c7c1b
backend: reformat pg.js according to the project's style
leventov Feb 26, 2020
3ca9d7c
backend: reformat setupScraping
leventov Feb 26, 2020
34a99c9
backend: reformat worker.js
leventov Feb 26, 2020
b68663e
backend: migrate to sequelize
leventov Feb 26, 2020
91c6899
backend: add backupAirtable.js, setupAirtableBackup.js
leventov Feb 27, 2020
7166a87
backend: update yarn instructions in README
leventov Feb 27, 2020
6fb93a4
backend: in backupAirtable.js, write orgs to Postgres
leventov Feb 27, 2020
003bbc8
Update decision docs
leventov Mar 2, 2020
9563e73
Migrate to Knex, add backupAirtable.test.js
leventov Mar 5, 2020
ae3615e
Add decision doc about pushing data to Airtable from backend
leventov Mar 5, 2020
e5369a1
Add determineOrgsToScrapeFirstTime function
leventov Mar 6, 2020
53f1866
Add addFirstTimeScrapingJobs() function
leventov Mar 7, 2020
337cf32
Connect to Twitter API
leventov Mar 8, 2020
127f3d7
Refactor twitter.js and addFirstTimeScrapingJobs.js
leventov Mar 9, 2020
450614e
Complete first-time scraping of Twitter followers
leventov Mar 9, 2020
651595b
backend: refactoring: store just Twitter followers -> store the whole…
leventov Mar 10, 2020
1a4d346
backend: refactor: extract twitterUserObjectScraping.js from worker.js
leventov Mar 10, 2020
e87151c
backend: refactor: extract firstTimeScraping.js from addFirstTimeScra…
leventov Mar 10, 2020
de6bd42
Fix bug in getTwitterScreenName()
leventov Mar 10, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions backend/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
node_modules
npm-debug.log
yarn-error.log
.env
3 changes: 3 additions & 0 deletions backend/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
node_modules
npm-debug.log
yarn-error.log
11 changes: 11 additions & 0 deletions backend/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# The version should stay in sync with package.json
# -alpine version doesn't work because yarn needs Python during installation of packages
FROM node:12.15.0

# Install app dependencies.
COPY package.json yarn.lock ./

# ignore-engines to skip trying to install fsevents on Linux
RUN yarn config set ignore-engines true && yarn install

RUN yarn global add pm2
4 changes: 4 additions & 0 deletions backend/Dockerfile.postgres
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# 12.1 is the current version in Heroku, it should be periodically checked and synced (see https://data.heroku.com/)
FROM postgres:12.1-alpine

COPY postgres-init.sql /docker-entrypoint-initdb.d
2 changes: 2 additions & 0 deletions backend/Procfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
web: node src/web.js
worker: node src/worker.js
29 changes: 29 additions & 0 deletions backend/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Backend
Backend automatically scrapes data for Climatescape website.

## Overview
Backend is written in [Node.js](doc/decisions/2-use-node.md) and is deployed on [Heroku](doc/decisions/1-use-heroku.md).
It consists of two apps: *web* (shouldn't be confused with Climatescape website itself) and *worker* for [background
processing](doc/decisions/3-background-task-processing.md). web pushes jobs to a persistent [pg-boss](
doc/decisions/4-use-pg-boss-queue.md) queue backed up with Postgres.

## Local setup

1. Install Node 12.15.0 using [`nvm`](https://github.com/nvm-sh/nvm#install--update-script)
2. [Install yarn 2](https://yarnpkg.com/getting-started/install)
3. Install and start [Docker Desktop](https://www.docker.com/products/docker-desktop)

If you have problems installing dependencies (running `yarn` command) on Mac OS, try the following:
1. Follow instructions on [this page](https://github.com/nodejs/node-gyp/blob/master/macOS_Catalina.md)
2. `brew install libpq` and follow instructions about modifying `PATH`, `LDFLAGS`, `CPPFLAGS`, and `PKG_CONFIG_PATH`
variables printed by Homebrew in the end of the installation.

Run tests via `yarn --ignore-engines test`.

For faster testing or debug loop, first start `db` and `worker` containers separately: `docker-compose up -d db worker`,
and then run `yarn --ignore-engines jest`.

For full formation testing, use `docker-compose up -d` and ping the web via
```
curl -X POST https://127.0.0.1:3000/twitterFollowers --header 'Content-type: application/json' --data '{"orgId":"climatescape", "twitterUrl":"twitter.com/climatescape"}'
```
6 changes: 6 additions & 0 deletions backend/doc/decisions/1-use-heroku.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Decided to implement automatic data scraping backend as a customly built system deployed on Heroku, rather than using
a specialized platform for scraping such as Apify, because we think that the backend system will eventually outgrow mere
data scraping.

See [this message](https://github.com/climatescape/climatescape.org/issues/40#issuecomment-583264142) and the preceding
messages in the thread.
5 changes: 5 additions & 0 deletions backend/doc/decisions/2-use-node.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Decided to use Node.JS platform for the backend system ([deployed in Heroku](1-use-heroku.md)) for two main reasons:
- This is a platform familiar to the key stakeholder of the project, Brendan.
- Uniformity with the frontend (Netlify) part of the project and potentially sharing some model code in the future.

See [this message in Slack](https://climatescape.slack.com/archives/CT42YRV3P/p1580385766005700) and messages around it.
13 changes: 13 additions & 0 deletions backend/doc/decisions/3-background-task-processing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Decided to split the automatic data scraping system [deployed on Heroku](1-use-heroku.md) into two dynos:
1. "web" (should not to be confused with the Climatescape website itself)
2. "worker", which does the scraping in background.

web is expected to be called from the hooks on the Climatescape website or Zapier. web puts scraping jobs into
a persistent queue and responds to the caller immediately. worker fetches the jobs in background, does the scraping
(e. g. accesses Twitter API), and puts the results into a Postgres database.

This decision is driven by the [Heroku documentation](https://devcenter.heroku.com/articles/background-jobs-queueing)
which justifies this approach as scalable and reliable.

See [this message](https://github.com/climatescape/climatescape.org/issues/40#issuecomment-584658556) and a few
preceding messages in the thread.
12 changes: 12 additions & 0 deletions backend/doc/decisions/4-use-pg-boss-queue.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
Decided to use [pg-boss](https://github.com/timgit/pg-boss) queue for managing [background task processing](
3-background-task-processing.md) because:
- It uses Postgres for persistence which is used in the backend system anyway (for storing the scraping results), thus
reducing the operational variability of the backend, compared to if a separate persistent queue such as RabbitMQ was
used.
- pg-boss is chosen over [graphile-worker](https://github.com/graphile/worker) because it supports multiple queues
(at least I couldn't figure out how to create multiple different queues from the docs of graphile-worker, as of the
version 0.4.0). We will need multiple queues for managing processing with different priorities, e. g. periodic batch
scraping vs. first-time scraping for orgs just added to the website.

See [this message](https://github.com/climatescape/climatescape.org/issues/40#issuecomment-585112177).

10 changes: 10 additions & 0 deletions backend/doc/decisions/5-use-yarn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
Decided to use yarn package manager instead of npm because of the [issue](https://github.com/yarnpkg/yarn/issues/5962)
bloudermilk marked this conversation as resolved.
Show resolved Hide resolved
with node-gyp package installation: with node 12.15.0 and Mac OS X, which I was able to resolve only in yarn by running
```
$ yarn global add node-gyp
$ yarn global remove node-gyp
```

As described [here](https://github.com/yarnpkg/yarn/issues/5962#issuecomment-435576447).

Using Yarn 1.x because there are yet some problems with using Yarn 2 in Heroku.
3 changes: 3 additions & 0 deletions backend/doc/decisions/6-use-fastify-server.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Chosen [Fastify](https://www.fastify.io/) as a Node web server over [Express](https://expressjs.com/) because the latter
bloudermilk marked this conversation as resolved.
Show resolved Hide resolved
doesn't support async/await handlers out of the box as of version 4.x. Fastify also comes with JSON Schema verification
and logging, which is useful.
45 changes: 45 additions & 0 deletions backend/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
version: '3.7'
services:
web:
build: .
depends_on:
- db
volumes:
- ./src:/opt/src
working_dir: /opt/src
command: pm2-dev web.js
ports:
- '3000:3000'
environment:
WITHIN_CONTAINER: 'true'
worker:
build: .
depends_on:
- db
volumes:
- ./src:/opt/src
working_dir: /opt/src
command: pm2-dev worker.js
environment:
WITHIN_CONTAINER: 'true'
db:
build:
context: .
dockerfile: Dockerfile.postgres
volumes:
- dbdata:/var/lib/postgresql/data
environment:
POSTGRES_DB: postgres
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
ports:
- published: 5432
target: 5432
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 1s
timeout: 5s
retries: 10

volumes:
dbdata:
30 changes: 30 additions & 0 deletions backend/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"engines": {
"node": "12.15.0",
"yarn": "1.x"
},
"dependencies": {
"fastify": "^2.12.0",
"global": "^4.4.0",
"node-gyp": "^6.1.0",
"pg": "^7.18.1",
"pg-boss": "^4.0.0-beta5",
"pg-pool": "^2.0.10",
"swagger": "^0.7.5",
"url": "^0.11.0"
},
"devDependencies": {
"jest": "^25.1.0",
"supertest": "^4.0.2"
},
"optionalDependenciesComment": "explicitly making fsevents optional to skip trying to install fsevents on Linux in Docker image",
"optionalDependencies": {
"fsevents": "2.1.2"
},
"scripts": {
"about:test": "echo '--forceExit is needed because of https://github.com/facebook/jest/issues/9473'",
"test": "docker-compose up -d db worker && docker-compose rm -fsv web && jest --detectOpenHandles --runInBand --verbose --forceExit",
"about:jest": "echo '--forceExit is needed because of https://github.com/facebook/jest/issues/9473'",
"jest": "jest --detectOpenHandles --runInBand --verbose --forceExit"
}
}
2 changes: 2 additions & 0 deletions backend/postgres-init.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
-- Needed as per https://github.com/timgit/pg-boss/blob/master/docs/usage.md#database-installation
CREATE EXTENSION pgcrypto;
32 changes: 32 additions & 0 deletions backend/src/app.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
const fastify = require('fastify')({ logger: true });

const { pgBossQueue } = require('./pg');

fastify.route({
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on how you were planning to trigger this job? I had envisioned this working w/o any kind of web server to trigger jobs. For example, using a "cron job" (but the Heroku equivalent) to queue a job every N minutes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Netlify hook should call this endpoint. Even cron job (which will kick off for periodic batch re-scraping of all organisations' data, for example), will need to call some simplistic web endpoint which just puts a job into a queue.

I don't see how more direct it could be. Push jobs directly to PostgreSQL via a query? graphile/worker supports this, but not pg-boss (at least, it doesn't document this), so it's a slippery slope. But I also don't see this as a problem.

Even with direct SQL query to create a job, you need some basic auth/protection, I suppose, eventually, so you can't really avoid having a thin web layer.

Cron batch processing in Heroku can also be done by starting a dyno, which will do the batch processing and finish. But this won't work for first-time scraping of newly added orgs, there should be a worker listening in the background. And since we need this type of worker anyway, it can handle batch scraping, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need to call some simplistic web endpoint which just puts a job into a queue.

The way we usually do this is to run a script within the project that has the job of simply queueing the jobs you want to run. To elaborate, the script would work like this:

$ npm run queue-jobs

  • Fetch organizations from Airtable that either A) have not be scraped or B) are "outdated"
  • For each organization
    • Queue a job to process that organization

The script runs in a couple seconds and is only responsible for adding the right jobs to the queue. The worker does the rest. This avoids needing the web interface, keeps the logic of which orgs to scrape internal to the app, and removes the dependency on Zapier.

method: 'POST',
url: '/twitterFollowers',
schema: {
body: {
type: 'object',
required: ['orgId', 'twitterUrl'],
properties: {
orgId: { type: 'string' },
twitterUrl: { type: 'string' }
}
}
},
handler: function (req, res) {
fastify.log.info('Received request to scrape Twitter followers: ', req.body);
return pgBossQueue.publish('twitterFollowers', req.body);
}
});

async function buildFastify() {
await pgBossQueue.start();
await fastify.listen(process.env.PORT || 3000, '0.0.0.0');
fastify.log.info(`server listening on ${fastify.server.address().port}`);
return fastify;
}


module.exports = buildFastify;
84 changes: 84 additions & 0 deletions backend/src/pg.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
const Url = require('url');
const Pool = require('pg-pool');

// WITHIN_CONTAINER is set in docker-compose.yml
// We are *not* running within a local container when we run jest tests, e. g. via `yarn test`
const isRunningWithinLocalContainer = process.env.WITHIN_CONTAINER === 'true';
const host = isRunningWithinLocalContainer ? 'db' : 'localhost';
// user, password, and the database name correspond to those set in docker-compose.yml
const pgLocalConnectionString = `postgres://postgres:postgres@${host}:5432/postgres`;

// See https://stackoverflow.com/a/28489160
const isProduction = !!(process.env._ && (process.env._.indexOf('heroku') >= 0));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we can't just use NODE_ENV to find out if we're running in production?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it requires more setup actions by the developers locally (export NODE_ENV=dev) and Heroku setup (heroku config:set NODE_ENV="production"), imagining some developer wants to test the deployment in their own free Heroku account (like I do right now, but other developers may want to do this, too).

The above line doesn't require anything like that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leventov Heroku sets that variable by default, and I would assume we never set it locally. It's standard practice to use NODE_ENV in NodeJS apps.

console.log('isProduction: ' + isProduction);
const pgConnectionString = isProduction ? process.env.DATABASE_URL : pgLocalConnectionString;

const params = Url.parse(pgConnectionString);
const auth = params.auth.split(':');

const pgConfig = {
user: auth[0],
password: auth[1],
host: params.hostname,
port: params.port,
database: params.pathname.split('/')[1],
// If some problems with different SSL config for local dev and Heroku ever arise, they could be matched:
// 1. Run the following commands (copied from https://gist.github.com/mrw34/c97bb03ea1054afb551886ffc8b63c3b):
// openssl req -new -text -passout pass:abcd -subj /CN=localhost -out server.req -keyout privkey.pem
// openssl rsa -in privkey.pem -passin pass:abcd -out server.key
// openssl req -x509 -in server.req -text -key server.key -out server.crt
// 2. Add the following lines to Dockerfile.postgres (copied from https://stackoverflow.com/a/55072885)
// COPY server.key /var/lib/postgresql/server.key
// COPY server.crt /var/lib/postgresql/server.crt
//
// RUN chmod 600 /var/lib/postgresql/server.key
// RUN chown postgres:postgres /var/lib/postgresql/server.key
// 3. Add to db service in docker-compose.yml:
// command:
// -c ssl=on -c ssl_cert_file=/var/lib/postgresql/server.crt -c ssl_key_file=/var/lib/postgresql/server.key
ssl: isProduction
};

const pgPool = new Pool(pgConfig);
pgPool.on('error', (err, client) => console.error(err));

// Should be removed when https://github.com/brianc/node-postgres/issues/1789 is fixed
const connectWrapper = async function() {
for (let nRetry = 1; ; nRetry++) {
try {
const client = await pgPool.connect();
if (nRetry > 1) {
console.info('Now successfully connected to Postgres');
}
return client;
} catch (e) {
if (e.toString().includes('ECONNREFUSED') && nRetry < 5) {
console.info('ECONNREFUSED connecting to Postgres, ' +
'maybe container is not ready yet, will retry ' + nRetry);
// Wait 1 second
await new Promise(resolve => setTimeout(resolve, 1000));
} else {
throw e;
}
}
}
}
const pgPoolWrapper = {
connect: connectWrapper,
async query(text, values) {
const client = await connectWrapper();
try {
return client.query(text, values);
} finally {
client.release();
}
},
};

const PgBoss = require('pg-boss');

const pgBossQueue = new PgBoss({db: { executeSql: pgPoolWrapper.query }});

pgBossQueue.on('error', err => console.error(err));

module.exports = {pgConfig, pgBossQueue, pgPool: pgPoolWrapper};
21 changes: 21 additions & 0 deletions backend/src/setupScraping.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
const { pgPool } = require('./pg.js');

async function setupScraping() {
const client = await pgPool.connect();
try {
await client.query(
'CREATE TABLE IF NOT EXISTS scraping_results (' +
' org_id text not null,' +
' request_type text not null,' +
' scraping_time timestamp DEFAULT current_timestamp,' +
' result jsonb not null,' +
' PRIMARY KEY(org_id, request_type, scraping_time)' +
');'
);
} finally {
client.release();
}
return pgPool;
}

module.exports = setupScraping;
9 changes: 9 additions & 0 deletions backend/src/web.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
const buildFastify = require('./app');

(async () => {
try {
await buildFastify();
} catch (e) {
console.error('Error starting fastify server', e);
}
})();
43 changes: 43 additions & 0 deletions backend/src/worker.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
const { pgBossQueue } = require('./pg');
const setupScraping = require('./setupScraping');

async function startWorker() {
await pgBossQueue.start();
const scrapingPool = await setupScraping();
await pgBossQueue.subscribe('twitterFollowers', async job => {
const data = job.data;
console.log('Scraping Twitter followers: ' + JSON.stringify(data));
const numTwitterFollowers = await twitterFollowers(data);
console.log(`Twitter followers of ${data.orgId}: ${numTwitterFollowers}`);
const client = await scrapingPool.connect();
try {
const result = await client.query(
'INSERT INTO scraping_results (org_id, request_type, result) VALUES ($1, $2, $3) ' +
'ON CONFLICT (org_id, request_type, scraping_time) DO UPDATE ' +
'SET result = $3;',
[data.orgId, 'twitterFollowers', numTwitterFollowers]
);
if (result.rowCount !== 1) {
console.error('ERROR! Expected one updated row, got ' + result.rowCount);
}
console.log(`Twitter followers for ${data.orgId} were successfully stored in the database`);
} catch (e) {
console.error(`Error while storing twitter followers for ${data.orgId} in the database`, e);
} finally {
client.release();
}
});
}

(async () => {
try {
await startWorker();
} catch (e) {
console.error('Error starting worker', e);
}
})();

async function twitterFollowers(data) {
return 100;
bloudermilk marked this conversation as resolved.
Show resolved Hide resolved
}

Loading