Live demo: https://vasilios.io
For full implementation and report, please refer to the file: report.pdf
Vasilios.io is a dynamic, interactive, nlp dashboard that provides insights into software engineering and data science articles. Designed to explore trends and reader interests in these fields, the dashboard presents exploratory data analysis and text mining visualizations. Users can uncover patterns, popular topics, and article types that resonate most with audiences, making it a valuable tool for understanding content trends in tech and data science.
High-Level Architecture of the system:
- Data Collection: (in separate repo, will publish soon)
- An Airflow orchestrator is responsible for scheduling and running the tasks here
- The first step is by scraping publicly available archive from the most popular tech publishers on Medium.com
- The next step is to transform and clean the data to fit the schema in PostgreSQL
- The data is then inserted to a message queue with RabbitMQ, which will store every row as a JSON queue
- The Database Worker fetch the data from the queue and insert them into PostgreSQL
- Data Persistence:
- Data is saved from the data collection step into a PostgreSQL database
- Backend System:
- The backend system is written in Python using the Django framework, which includes:
- RESTful API endpoints to accept GET requests from Frontend
- Data analysis by querying the database from PostgreSQL and analyzing the data using data analysis and nlp libraries, such as NumPy, Pandas, NLTK, etc.
- Testing includes unit tests, integration tests, and using mocks to test sample data
- The backend system is written in Python using the Django framework, which includes:
- Web App Basic Form / Frontend:
- The frontend is written in HTML, CSS, and JavaScript. Javascript is used to fetch data from the backend API endpoints and charts are visualized with Chart.js
- Product Environment:
- The app is containerized with Docker, with Gunicorn serving as the WSGI HTTP server and Nginx acting as a reverse proxy to handle incoming web traffic and distribute it to the application
- The app is deployed on AWS running the following services:
- AWS Elastic IP for static IP address
- AWS EC2 running web app in Docker container
- AWS ECR where the Docker images are stored
- AWS RDS PostgreSQL running the PostgreSQL database
Please make sure you have Postgres installed in your local computer before your start the following steps
You can set up the database following the Airflow steps in the report.pdf or use the backup.dump
Set up your Postgres Database:
pg_restore -U <your_username> -h localhost -d new_database_name -v /path/to/backup.dump
Navigate into the project's directory:
cd nlp_dashboard/nlp_dashboard
Comment or delete the following lines in settings.py as this is for production only:
DEFAULT_AUTO_FIELD = "django.db.models.BigAutoField"
SECURE_PROXY_SSL_HEADER = ("HTTP_X_FORWARDED_PROTO", "https")
CSRF_TRUSTED_ORIGINS = os.environ.get("CSRF_TRUSTED_ORIGINS").split(" ")
Create a .env file and include the following information:
DEBUG=1 # leave as is
SECRET_KEY = 'insert-a-secret-key-here-can-be-anything' # you can leave as is or change to your preference
DJANGO_ALLOWED_HOSTS=localhost,0.0.0.0,127.0.0.1 # leave as is
SQL_ENGINE = 'django.db.backends.postgresql' # leave as is
SQL_NAME = 'db-name-you-set-up-earlier'
SQL_USER = 'your-postgres-username'
SQL_PASSWORD = 'your-postgres-password'
SQL_HOST = 'localhost' # leave as is
SQL_PORT = '5432' # leave as is
DATABASE = 'postgres' # leave as is
Create and activate your virtual environment, install dependencies, and run
python -m venv venv_name
pip install -r requirements.txt
source venv_name/bin/activate
# inspect the database, copy the models over to models.py
python manage.py inspectdb
python manage.py makemigrations
python manage.py migrate
python manage.py runserver
Note: The live website was previously called teas.cafe