Basic Java Scraper to search for an specific pattern in the main page of a collection of web pages given in a file. The scraper finds strings and write them down in a new file per site.
There is a scraper.properties file in the project where there is a property pattern.className which is used to establish the pattern to be used if there is not a pattern provided from the system properties given on runtime.
There are two patterns implemented:
- AtPattern (Default): Allows to find twitter usernames (with the @)
- HashtagPattern: Allows to find hashtags (with the #)
Maven can be used to execute the application. In order to use the default pattern, there is only necessary to run the mvn exec:java command.
If it is desired to change the pattern, it can be given using the pattern.className parameter just like this:
mvn exec:java -Dpattern.className=com.belatrixsf.scraper.pattern.HashtagPattern
Everytime the application is run, there is going to be created a log file called basic-scraper.log in the root directory of the project.
All the messages are also printed in the command line.
To run the test it is only necessary to run the command mvn clean test and all the tests will be executed.
The application depends on the following libraries:
- JSoup (version 1.11.3): Used to connect and get the web pages.
- JUnit (version 5.3.1): Used for unit testing. It was needed to use both Jupiter (normal tests) and Vintage (powermock tests) engines.
- Powermock (version 1.7.1): Used for unit testing where static methods mocking was needed. This library was used specifically with Mockito.
- Logback (version 1.2.3): Used for logging. This library is a native SLF4J implementation.
This application was made to be used with Java 1.8+