Skip to content

Basic Java Web Scraper - Belatrix test for back-end Java tech lead

Notifications You must be signed in to change notification settings

davidjgomez/basic-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

basic-scraper

Basic Java Scraper to search for an specific pattern in the main page of a collection of web pages given in a file. The scraper finds strings and write them down in a new file per site.

Configuration and Patterns

There is a scraper.properties file in the project where there is a property pattern.className which is used to establish the pattern to be used if there is not a pattern provided from the system properties given on runtime.

There are two patterns implemented:

  • AtPattern (Default): Allows to find twitter usernames (with the @)
  • HashtagPattern: Allows to find hashtags (with the #)

Instructions

Running the Application

Maven can be used to execute the application. In order to use the default pattern, there is only necessary to run the mvn exec:java command.

If it is desired to change the pattern, it can be given using the pattern.className parameter just like this:

mvn exec:java -Dpattern.className=com.belatrixsf.scraper.pattern.HashtagPattern

Everytime the application is run, there is going to be created a log file called basic-scraper.log in the root directory of the project.

All the messages are also printed in the command line.

Running Tests

To run the test it is only necessary to run the command mvn clean test and all the tests will be executed.

Dependencies

The application depends on the following libraries:

  • JSoup (version 1.11.3): Used to connect and get the web pages.
  • JUnit (version 5.3.1): Used for unit testing. It was needed to use both Jupiter (normal tests) and Vintage (powermock tests) engines.
  • Powermock (version 1.7.1): Used for unit testing where static methods mocking was needed. This library was used specifically with Mockito.
  • Logback (version 1.2.3): Used for logging. This library is a native SLF4J implementation.

This application was made to be used with Java 1.8+

About

Basic Java Web Scraper - Belatrix test for back-end Java tech lead

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published