EPSO is an acronym that stands for European Personnel Selection Office, and in brief they are responsible for hiring people that want to work for the European institutions. More can be read here https://eu-careers.europa.eu/en
Part of their website is also the open vacancies list, which can be found here https://eu-careers.europa.eu/en/job-opportunities/open-vacancies. This list is the subject of this crawler.
The crawler loads the configured website, parses the data/vacancies and saves them in a plain text file in Amazon S3 bucket. Then compares the downloaded list of open vacancies with a previous run of the crawler, and if there are some completely new vacancies, these will be stored in a PostgreSQL database.
Based on the previously mentioned, the prerequisites are:
- Amazon S3 (or other type of compatibile storage, such as MinIO) - used for storing plain text files (
previous.txt
andlatest.txt
) - PostgreSQL database - for storing newly opened vacancies
One possibility is to build the docker image and deploy it as-is in AWS Lambda (dockerfile is already adjusted to be extended from lambda nodejs image) and be triggered at regular intervals to crawl the EPSO website for new vacancies. The triggering events have to hold a detail
property (explained below) with URL_TO_CRAWL
and POSITION_TYPE
Another possibility would be to run the script as is. In that case:
- modification in the
index
is necessary - specially the commented part at the end that would actually trigger the function. Or some sort of wrapper needs to be implemented. - then the code needs to be built from .ts -> .js (
npm run build
) - code can be run - for example via
npm run start
(or accessing the wrapper from first step)
Crawler has to be configured via environment variables. They are all mandatory, except when deploying in AWS Lambda - in that case the AWS_
variables will be automatically handled by Lambda itself.
Config Parameter | Description |
---|---|
AWS_REGION |
Amazon region used for the S3 service |
AWS_BUCKET |
Amazon S3 bucket to store the plain text files in |
AWS_ACCESS_KEY_ID |
Amazon access key ID |
AWS_SECRET_ACCESS_KEY |
Amazon secret access key |
PG_HOST |
PostgreSQL hostname |
PG_PORT |
PostgreSQL port |
PG_DATABASE |
PostgreSQL database name |
PG_USERNAME |
PostgreSQL username |
PG_PASSWORD |
PostgreSQL password |
There are 2 more parameters necessary for a successful crawl run. But these are expected to arrive in the detail
property of the event that is triggered by some Amazon trigger. Example piece of JSON:
...
"detail": {
"URL_TO_CRAWL": "https://eu-careers.europa.eu/en/job-opportunities/open-vacancies/cast",
"POSITION_TYPE": "cast"
},
...
Where the 2 input parameters are:
URL_TO_CRAWL
- URL to be crawled. Should look likehttps://eu-careers.europa.eu/en/job-opportunities/open-vacancies/{xxx}
where the{xxx}
represents the position typePOSITION_TYPE
- Represents the sub-section on the EPSO website. The value corresponds to the last part of the URL (except for the permanent_staff). It is an enum. Allowed values are:permanent_staff
,ec_vacancies
,temp
,cast
,seconded
,others
Crawler needs at least 1 table to be present - jobs
. SQL CREATE scripts can be found in the src/models/ subfolder