Consumption-Analysis

所得房價消費分析-鄉鎮市區電子發票B2C開立資料集是由政府資料開方平台所提供的資料集。由於對全台的B2C電子發票發票發布狀況感到好奇，進而想對此資料進行分析。原始資料集的時間從2021/01記錄到2024/12，本專案在機器學習訓練階段時只使用2021年的資料。

The Income and Housing Price Consumption Analysis -B2C E-Invoice Dataset for Townships and Districts is provided by the Government Open Data Platform. Out of curiosity about the issuance of B2C e-invoices across Taiwan, I decided to analyze this dataset.The original dataset is from 2021/01 to 2024/12, but this model is trained by the data from 2021.

資料型態(Data Structure)

此資料集共記錄了發票年月、縣市代碼、縣市名稱、鄉鎮市區代碼、鄉鎮市區名稱、行業名稱、平均開立張數、平均開立金額、平均客單價九個欄位。其中需要特別注意

平均開立張數 = 開立總張數 \ 該維度下營業人的家數
平均開立金額 = 開立總金額 \ 該維度下營業人的家數
平均客單價 = 開立客單價 \ 該維度下營業人的家數

This dataset records nine fields: invoice year-month, county/city code, county/city name, township/district code, township/district name, industry name, average number of invoices issued, average issued amount, and average transaction price.

Some key points to note:

Average number of invoices issued = Total number of invoices issued / Number of businesses in that category
Average issued amount = Total issued amount / Number of businesses in that category
Average transaction price = Total transaction price / Number of businesses in that category

EDA

	mean	std	var	min	max
平均開立張數(Avg. invoices issued)	1.389867e+04	1.381884e+04	1.909603e+08	16	309155
平均開立金額(Avg. issued amount)	4.881683e+06	1.797752e+07	3.231912e+14	33590	542486725
平均客單價(Avg. transaction price)	4.279705e+02	9.122108e+02	8.321416e+05	61	48869

清理 (Data Cleaning)

發票年月：格視為年月合併，如202406。拆分成year和month
丟棄縣市代碼和鄉鎮市區代碼
將縣市名稱和鄉鎮市區名稱兩欄位合併成「縣鄉鎮市區」，並進行經緯度轉換
針對「行業名稱」進行one-hot encoding
Invoice year-month: Initially stored as a combined "YYYYMM" format (e.g., 202406), split into year and month.
Dropped fields: county/city code and township/district code.
Merged location fields: Combined county/city name and township/district name into a single "county-township-district" field and converted them into latitude and longitude coordinates.
Industry name encoding: Applied one-hot encoding to the industry name field.

Pipeline

使用Catboost作為模型訓練資料並使用grid search調參最後使用MAPE判斷訓練結果。訓練完成後 MAPE = 0.2 CatBoost was used as the model for training, and hyperparameter tuning was performed using grid search. The final model was evaluated using Mean Absolute Percentage Error (MAPE), achieving a MAPE of 0.2.

開源(Open source)

使用streamlit開源數據統計與機器學習預測結果若你想要將本專案在地端開發，請遵循以下步驟 Use Streamlit to Open Source Data Statistics and Machine Learning Predictions
If you want to develop this project locally, please follow the steps below.

使用poetry將套件安裝完成並開啟虛擬環境

poetry env use python
poetry shell
poetry lock

執行csv_2_sql.py將csv轉換成sql

python csv_2_sql.py

將pyproject.toml轉換成requirements.txt

poetry export -f requirements.txt --output requirements.txt --without-hashes

前往streamlit完成設定即可部屬

結論 Conclude

這是我第一次將自己的訓練開源，我知道預測結果不盡理想我會繼續學習的 This is my first time open-sourcing my training process. I know the prediction results are not perfect, but I will keep learning and improving!

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.gitignore		.gitignore
README.md		README.md
app.py		app.py
catboost_model_ALL_grid.cbm		catboost_model_ALL_grid.cbm
csv_2_sql.py		csv_2_sql.py
data.db		data.db
main.ipynb		main.ipynb
ml.ipynb		ml.ipynb
ml_optimize.ipynb		ml_optimize.ipynb
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Consumption-Analysis

資料型態(Data Structure)

EDA

清理 (Data Cleaning)

Pipeline

開源(Open source)

結論 Conclude

About

Releases

Packages

Languages

Ayukevin/Consumption-Analysis

Folders and files

Latest commit

History

Repository files navigation

Consumption-Analysis

資料型態(Data Structure)

EDA

清理 (Data Cleaning)

Pipeline

開源(Open source)

結論 Conclude

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages