Web Scraper and Data Summaries

Project Description

This project is designed to automate the extraction of data from various online sources and summarize it efficiently. The system is built to handle large-scale data scraping and processing tasks, ensuring the data is stored, analyzed, and summarized in a structured format. This project leverages modern technologies, making it highly reliable and scalable for diverse use cases.

The main objectives of the project are:

Automating the process of data collection from websites(user Provided Urls).
Storing and organizing data in a database for further analysis.
Generating summaries from the scraped data to provide insights.

Features

Data Scraping: Extracts data from targeted web pages using custom logic and tools.
Data Summarization: Analyzes and condenses extracted data into meaningful summaries.
Database Integration: Stores scraped data and summaries in a PostgreSQL database.
Error Handling: Handles edge cases such as invalid urls and modelFailure.
REST API: Allows interaction with the system for adding, updating, and retrieving summaries.

Technology Stack

Backend: Node.js with Sequelize ORM.
Database: PostgreSQL
- The Schema was well-defined and we had strict requirements.
- The size of summaries could be very large, making handling complicated in NoSQL.
- Strong ACID compliance; supports complex multi-row transactions.
- Tradeoffs:
  - Scaling: Vertical scaling is common; horizontal scaling requires sharding.
  - Cost of Scaling: Vertical scaling can become expensive for high loads.
Tools:
- Puppeteer: Best tool for web scraping; works headless by default, meaning it doesn't need any UI.
- GroQCloud: Used for summarizing content gathered from websites. Initially considered OpenAI due to its advancements, but switched to GroQ due to OpenAI's discontinued free tier. GroQ is reliable, provides an SDK, and is easy to integrate.
- Lodash: Used for error-handling functionalities.
- Custom Upsert Function: Implements database operations to handle exists(instance) ? update : create(instance) logic seamlessly.
Languages: JavaScript, SQL.

Installation

Clone the repository:

git clone https://github.com/RaKAsHASH/RankMeScrapperTask.git
cd RankMeScrapperTask

Install dependencies:
```
npm i
```

Configure environment variables: Create a .env file in the root directory and add necessary configurations for the database and other services:

DB_USER=<your_db_user>
DB_PASSWORD=<your_db_passeord>
DB_NAME=<your_db_name>
DB_TEST_NAME=<your_testing_db_name> (These DB_TEST_NAME is for testing db name,similarly you can add for prod)
DB_HOST=<your_db_host>     (eg:localhost)
DB_PORT=<db_port>
NODE_ENV=development
GROQ_API_KEY="YOUR_API_KEY"
BACKEND_URL='http://localhost:5050'

Start the application Once everything is set up, start the application:
```
npm start
```

Here are sample of Project Testing

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
TestingScreenShots		TestingScreenShots
config		config
controllers		controllers
dao		dao
middleware		middleware
models		models
routes		routes
services		services
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
app.js		app.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraper and Data Summaries

Project Description

Features

Technology Stack

Installation

Here are sample of Project Testing

About

Releases

Packages

Languages

RaKAsHASH/WebScrapperAndSummarise

Folders and files

Latest commit

History

Repository files navigation

Web Scraper and Data Summaries

Project Description

Features

Technology Stack

Installation

Here are sample of Project Testing

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages