Scraphead

Scraphead allow scrapping html from URL in order to retrieve OpenGraph, Twitter Card and many other meta information from HTML head tag.

Description

Scraphead was divided into core and netty. The core contains all the logic, the HTML head parsing and the mapping into OpenGraph and Twitter Card model. The netty was one of the multiple possible implementations for the web client.

Main features

non blocking
download only the <head/>, not the entire HTML file
Multiple web client implementation available
Detect file encoding
Read OpenGraph and Twitter Card, and more
Allow plugins for specific treatment (depending on domain for example)
build for Java 17 and modules

Installation

<dependency>
    <groupId>fr.ght1pc9kc</groupId>
    <artifactId>scraphead-core</artifactId>
    <version>${scraphead.version}</version>
</dependency>

<dependency>
    <groupId>fr.ght1pc9kc</groupId>
    <artifactId>scraphead-netty</artifactId>
    <version>${scraphead.version}</version>
</dependency>

Usage

With all collectors :

ScrapClient scrapHttpClient = new NettyScrapClient();
HeadScraper scraper = HeadScrapers.builder(scrapHttpClient).build();
scraper.scrap(URI.create("https://blog.ght1pc9kc.fr/2021/server-sent-event-vs-websocket-avec-spring-webflux.html"))
    .map(doWhatEverYouWantWithMeta)
    .subscribe();

With limited collectors' usage :

ScrapClient scrapClient = new NettyScrapClient();
HeadScraper scraper = HeadScrapers.builder(scrapClient)
  .useMetaTitleAndDescr()
  .useOpengraph()
  .build();
scraper.scrap(URI.create("https://blog.ght1pc9kc.fr/2021/server-sent-event-vs-websocket-avec-spring-webflux.html"))
  .map(doWhatEverYouWantWithMeta)
  .subscribe();

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Scraphead

Description

Main features

Installation

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

Scraphead

Description

Main features

Installation

Usage