Vietnam stock data analyzer

Objectives

The goal of this project is to collect and visualize the stock price of all tickers in Vietnam. There is quite limited access to API for a single business user, this project aim at scrap data from website, clean, extract and load into datawarehouse. The final product is a maintainable/reliable data pipeline with exposed analytic dashboard hosted on cloud, and end authorized users can access to 24/7 with daily updated data.

Data are collected for almost 10 years from 2010 to support the analytic (such as backtest or trading strategy for a particular ticker) steps afterward. For the scope of this project, the stock price will be visualized in time series domain for the monitor pupose only.

The report is live from HERE: Link To Report Dashboard

Proposed Architecture

To prove the working concept with reasonable cost, below is the proposed architecture solution.

The whole picture derives from 4 main components:

Data collection: This project can be accomplished from 2 data sources: ticker URL & price URL hosted by VNDirect. This is where all Vietnam tickers officially registered.

Ticker URL: https://www.vndirect.com.vn/portal/thong-tin-co-phieu/nhap-ma-chung-khoan.shtml?request_locale=en_GB

Price URL: https://www.vndirect.com.vn/portal/thong-ke-thi-truong-chung-khoan/lich-su-gia.shtml?request_locale=en
ETL/Orchestration: Airflow is chosen to play the orchestration role to run ETL and scheduling jobs. Airflow is dockerized and deployed on Digital Ocean cloud service. Airflow docker is selected to facilitate CI/CD tasks later on. Refer here: Airflow Docker Repository
Data storage: Up to now, there are 1710 tickers are registered and ~2.4M records stored on Postgres database maintained by Digital Ocean service. Since there are not so many users access at the moment and the data size is not quite significant big, this selection is cost effective while not affect the performance.
Data Visualization/Analytics: Because of time constraint, only visualization is implemented for this project to visualize the colleted data from Postgres database. Future feature will be Data Analytics with Spark by leveraging available data for 10 years to conduct the backtest as an example.

The superset visualization tool is also dockerized and hosted on EC2 t2.micro which is free up to 1 year service. This option is very optimal since visualization task is not a heavy hardware consumption.Refer here: Superset Docker Repository

Data sources

As mentioned before, there are 2 main data sources including ticker and price. The data are in html format which required the ELT job to include some special techniques to extract needed information.

Data model

Data model is simply described as below.

ETL Job

Although the data model looks simple, the heavy task of ETL job is to extract the needed data from html format. In this project, Selenium and BeautifulSoup libraries are used to automate the process of crawling stock data.

The pipeline includes 2 main steps: ticker ingestion and price ingestion. At price ingestion step, it is devided into different flows by exchange to enable mutithread processing for better performance. Within each thread, the data quality check has been includeded before every insert command.

How the project is organized and setup

Files in repository: In the main local dag directory (defined by AIRFLOW_HOME), separate project is put inside to make project more scalable. With this file structure design, where there are other stock data source, just simple create another one beside vn_stock project.

Within vn_stock project, dag.py lists the pipeline design and scheduler setting. It calls the ticker and price ingestion via PythonOperator.

.
└── vn_stock
    ├── dag.py
    ├── README.md
    └── tasks
        ├── config.py
        ├── data
        ├── etl_vndirect_price.py
        ├── etl_vndirect_ticker.py
        ├── requirements.txt
        ├── setup_database.py
        ├── sql_queries.py
        └── utils.py

Build and run docker-airflow with LocalExecutor setup to enable parallelizing task instances locally:

docker build --rm --build-arg AIRFLOW_DEPS="datadog,dask" --build-arg PYTHON_DEPS="flask_oauthlib>=0.9" -t puckel/docker-airflow .

docker-compose -f docker-compose-LocalExecutor.yml up -d
Setup database for the first run (access from inside airflow container):

docker exec -it -u 0 docker-airflow_webserver_1 bash

python vn_stock/task/setup_database.py

Future improvement

When data increased by 100x: move data storage to AWS Redshift which is designed to handling columnar store for better performance.
The database needed to be accessed by 100+ people: when the user access increase, it is necessary to setup the role and group for particular purpose (view/modify) to enhance the security management and performance effectiveness. Refer here: Managing PostgreSQL users
Integrate Spark for ticker analytic
Crawl more data source to enhance the data features

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
images		images
tasks		tasks
.gitignore		.gitignore
Pipfile		Pipfile
README.md		README.md
__init__.py		__init__.py
dag.py		dag.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vietnam stock data analyzer

Objectives

Proposed Architecture

Data sources

Data model

ETL Job

How the project is organized and setup

Future improvement

About

Releases

Packages

Languages

vdn-projects/vn-stock-analyzer

Folders and files

Latest commit

History

Repository files navigation

Vietnam stock data analyzer

Objectives

Proposed Architecture

Data sources

Data model

ETL Job

How the project is organized and setup

Future improvement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages