BQin is a BigQuery data importer with AWS S3 and SQS messaging.
Respected to http://github.com/fujiwara/Rin
- (Someone) creates a S3 object.
- S3 event notifications will send to a message to SQS.
- BQin will fetch messages from SQS
- BQin copy S3 object to Google Cloud Storage [this is temporary bucket], and create BigQuery Load Job
Configuring Amazon S3 Event Notifications.
- Create SQS queue.
- Attach SQS access policy to the queue. Example Walkthrough 1:
- Enable Event Notifications on a S3 bucket.
- Create a temporary bucket on Google Cloud Storage and create the target dataset on BigQuery.
- Run
bqin
process with configuration for using the SQS and S3.
queue_name: my_queue_name # SQS queue name
cloud:
aws:
region: ap-northeast-1
s3:
bucket: bqin.bucket.test
region: ap-northeast-1
big_query:
project_id: bqin-test
dataset: test
option:
temporary_bucket: my_bucket_name # GCP temporary bucket
gzip: true
source_format: json # [csv, json, parquet] select able
auto_detect: true # works only csv or json
# define load rule
rules:
- big_query: # standard rule
table: user
s3:
key_prefix: data/user
- big_query: # expand by key_regexp captured value. for date-sharded tables.
table: $1_$2
s3:
key_regexp: data/(.+)/part-([0-9]+).gz
- big_query: # override default section in this rule
project_id: hoge
dataset: bqin_test
table: role
s3:
bucket: bqin.bucket.test
key_regexp: data/(.+)/part-([0-9]+).csv
option:
gzip: false
source_format: csv
A configuration file is parsed by kayac/go-config.
go-config expands environment variables using syntax {{ env "FOO" }}
or {{ must_env "FOO" }}
in a configuration file.
BQin requires some credentials.
-
AWS credentials for access to SQS and S3.
Refers to credential information like AWS CLI
https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html -
GCP credentials for access to BigQuery and Cloud Storage
Reference using GOOGLE_APPLICATION_CREDENTIALS.
https://cloud.google.com/docs/authentication/getting-started?hl=en -
alternatively, It can be embedded in the config.
cloud:
aws:
region: ap-northeast-1
access_key_id: {{ must_env "ACCESSS_KEY_ID" }}
secret_access_key: {{ must_env "SECRET_ACCESS_KEY" }}
gcp:
base64_credential: {{ must_env "GCP_CREDENTIAL_BASE64_JSON" }}
Note: For GCP credentials, specify a Base64-encoded string of the contents of the JSON file
BQin waits new SQS messages and processing it continually.
$ bqin run -config config.yaml [-debug]
BQin receive SQS messages and processing. exit when all messages in the queue have been read.
$ bqin batch -config config.yaml -queue <dlq-queue-name> [-debug]
$ echo "s3://bucket.example.com/object.txt" | bqin check -config config.yaml
MIT
KAYAC Inc.