Skip to content

Latest commit

 

History

History
66 lines (52 loc) · 2.6 KB

README.md

File metadata and controls

66 lines (52 loc) · 2.6 KB

Scraphead GitHub license

Quality Gate Status Coverage Maintainability Rating

Scraphead allow scrapping html from URL in order to retrieve OpenGraph, Twitter Card and many other meta information from HTML head tag.

Description

Scraphead was divided into core and netty. The core contains all the logic, the HTML head parsing and the mapping into OpenGraph and Twitter Card model. The netty was one of the multiple possible implementations for the web client.

Main features

  • non blocking
  • download only the <head/>, not the entire HTML file
  • Multiple web client implementation available
  • Detect file encoding
  • Read OpenGraph and Twitter Card, and more
  • Allow plugins for specific treatment (depending on domain for example)
  • build for Java 17 and modules

Installation

<dependency>
    <groupId>fr.ght1pc9kc</groupId>
    <artifactId>scraphead-core</artifactId>
    <version>${scraphead.version}</version>
</dependency>

<dependency>
    <groupId>fr.ght1pc9kc</groupId>
    <artifactId>scraphead-netty</artifactId>
    <version>${scraphead.version}</version>
</dependency>

Usage

With all collectors :

ScrapClient scrapHttpClient = new NettyScrapClient();
HeadScraper scraper = HeadScrapers.builder(scrapHttpClient).build();
scraper.scrap(URI.create("https://blog.ght1pc9kc.fr/2021/server-sent-event-vs-websocket-avec-spring-webflux.html"))
    .map(doWhatEverYouWantWithMeta)
    .subscribe();

With limited collectors' usage :

ScrapClient scrapClient = new NettyScrapClient();
HeadScraper scraper = HeadScrapers.builder(scrapClient)
  .useMetaTitleAndDescr()
  .useOpengraph()
  .build();
scraper.scrap(URI.create("https://blog.ght1pc9kc.fr/2021/server-sent-event-vs-websocket-avec-spring-webflux.html"))
  .map(doWhatEverYouWantWithMeta)
  .subscribe();