web-scrybe is an open-source SERP (Search Engine Results Page) scraping and social media web scraping tool built using Spring Boot. It provides a simple and efficient way to extract data from websites and integrate it into your applications.
- Available SERP scraping Integrations: Currently, Google, Bing and DuckDuckGo are supported. More features will be added subsequently.
- Available Social Media Scraping Integrations: Reddit Hot Topic. More features will be added subsequently.
- Easy-to-use API: Developers can quickly integrate web scraping functionality into their applications using the provided API.
- Scalable and Distributed: The application is designed to be highly scalable and can be deployed in a distributed environment using Docker.
- Unlimited Scraping: The automation-based approach doesn't have the same rate limits or charges as Paid APIs, allowing users to scrape data at scale without facing throttling or downtime.
- Docker and Docker Compose Support: web-scrybe is designed to be easily deployable and scalable using Docker and Docker Compose. This makes it an ideal choice for quick prototyping, development, and testing environment deployments.
NB: Search engines may have strict anti-scraping measures, read the terms and conditions before using SERP scraping in production. Social Media web scraping is implemented using official APIs making them good to use in any environment
- Java 21 or higher
- Docker (optional, for running the application in a container)
- Reddit Account (optional)
-
Clone the repository:
[git clone https://github.com/your-username/web-scrybe.git](https://github.com/r7b7/web-scrybe.git)
-
Navigate to the project directory
-
Add Following Environment Variables (Not required if not planning to use Reddit API) REDDIT_CLIENT_ID, REDDIT_CLIENT_SECRET, USER_AGENT
Environment variables can be set in runtime configurations, yaml or property files.
In Visual Studio Code, environment variables can be set inside launch.json of .vscode folder as shown below,
[{ "type": "java", "name": "Launch Java Program", "request": "launch", "mainClass": "com.r7b7.webscrybe.WebSearchApplication", "env": { "REDDIT_CLIENT_ID": "<YOUR_CLIENT_ID>", "REDDIT_CLIENT_SECRET": "<YOUR_CLIENT_SECRET>", "USER_AGENT": "<YOUR_USER_AGENT>" } }]
Alternatively, Keys can be set during runtime as well - See Step 5, #2 for details.
Reddit API keys can be generated from https://www.reddit.com/prefs/apps -> Create Another app Button -> Fill Details -> Copy generated key and secret. User-Agent Header can be set to a unique string similar to "redditdev scraper by x/MyRedditId"
- Build App Using Maven:
mvn clean package
- Run App (Application can be run using any of the following options)
- In IDE - Use Run Option
- From Terminal
- export Environment variables if not done already
export REDDIT_CLIENT_ID=<CLIENT_ID> export REDDIT_CLIENT_SECRET=<YOUR_CLIENT_SECRET> export USER_AGENT="<YOUR_USER_AGENT>"
- Run app using mvn
mvn spring-boot:run
- Alternatively, Run app as a jar . From the target directory within the project folder, run the following command
target % java -jar web-scrybe-0.0.1-SNAPSHOT.jar
- export Environment variables if not done already
- In Docker
- Run the following commands in terminal
docker build -t web-scrybe . docker-compose up -d
- Check logs
docker-compose logs -f
- Run the following commands in terminal
- Available APIs
- Reddit Search API
http://localhost:8080/api/v1/reddit/hot?subreddit=ClaudeAI&limit=2
- Google Search API
http://localhost:8080/api/v1/search?query=what%20is%20latest%20in%20AI&driver=GOOGLE
- Bing Search API
http://localhost:8080/api/v1/search?query=what%20is%20latest%20in%20AI&driver=BING
- Reddit Search API
- Swagger Documentation
http://localhost:8080/v3/api-docs
- Stop Application
-
Terminal Use Ctrl + C to stop a running application
-
Docker
- To stop container, use the following command
docker-compose down
- To remove container and image
docker rm -f <container-name> docker rmi <image-name>
- If you don't know the container ID or name, list all running containers using:
docker ps
- To stop container, use the following command
-
Contributions are welcome from the open-source community! Please read the contributing guidelines to get started.
To maintain code quality and ensure stability, please follow these guidelines before submitting a Pull Request (PR).
-
Fork and Clone the Repository Fork the repository on GitHub. Clone your fork locally
git clone https://github.com/yourusername/yourproject.git
-
Code Formatting Ensure your code follows the standard Java code style. Use the built-in code formatting feature of your IDE (IntelliJ IDEA, Eclipse, or VS Code).
-
Run Tests Locally Make sure that all existing tests pass before submitting your PR:
./mvnw test
-
Run SonarQube Analysis (To run Sonarqube in local, please read the section - "Run Sonarqube Locally") Run SonarQube locally to check for code quality issues and ensure the code meets the standards.
mvn clean verify sonar:sonar -Dsonar.login=<YOUR_TOKEN>
-
Ensure the build is stable
-
Commit Message Guidelines Use meaningful and concise commit messages. Follow this format:
[Type]: Brief description Types can include: feat: New feature fix: Bug fix docs: Documentation changes test: Adding or updating tests refactor: Code refactoring
-
Creating a Pull Request Before creating a Pull Request, ensure that your branch is up to date with the main branch:
-
Code Review Process The PR will be reviewed by maintainers and other contributors. Please be patient and respond to any requested changes.
- Install Docker
- Run the following command
docker run -d --name sonarqube -p 9000:9000 sonarqube
- Sonarqube server should be up and running at - http://localhost:9000
- To Stop server, use this command
docker stop sonarqube
- To run the sonarqube locally, you would need to pass a login token for authentication. Follow these steps to generate the token from your local sonarqube server
1. Go to your SonarQube instance at http://localhost:9000. 2. Log in with your admin credentials (admin/admin by default). 3. Navigate to My Account > Security > Generate Tokens. 4. Enter a token name (e.g., my-project-token) and click Generate.
If you encounter any issues, please open a discussion or create an issue. We're here to help!