keepValidPages incorrectly filters out pages with mime-type text/html followed by charset #199

dportabella · 2018-04-23T11:45:27Z

Sometimes, the Content-Type header contains the charset, such as text/html;charset=ISO-8859-1 for page http://www.patentbuddy.com/,
so the keepValidPages function does not work properly in those cases: it incorrectly filters out such pages.

More info: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type

package io.archivesunleashed.spark.rdd
object RecordRDD extends java.io.Serializable {
    def keepValidPages(): RDD[ArchiveRecord] = {
      rdd.filter(r =>
        r.getCrawlDate != null
          && (r.getMimeType == "text/html"
          || r.getMimeType == "application/xhtml+xml"
          || r.getUrl.endsWith("htm")
          || r.getUrl.endsWith("html"))
          && !r.getUrl.endsWith("robots.txt"))
    }

The text was updated successfully, but these errors were encountered:

dportabella · 2018-04-23T11:53:38Z

fix:

import java.util.regex.Pattern

val htmlMimeTypePattern: Pattern = Pattern.compile("(text/html|application/xhtml\\+xml)(;.*)?")

def keepValidPages(): RDD[ArchiveRecord] = {
  rdd.filter(r =>
    r.getCrawlDate != null
      && ((r.getMimeType != null && htmlMimeTypePattern.matcher(r.getMimeType).matches)
      || r.getUrl.endsWith("htm")
      || r.getUrl.endsWith("html"))
      && !r.getUrl.endsWith("robots.txt"))
}

ruebot · 2018-04-23T12:17:03Z

@dportabella if you have a fix, feel free to put in a PR.

…ent-type when charset param exists

dportabella · 2018-04-23T17:57:00Z

It was better to fix the function ArchiveRecord.getMimeType rather than patching keepValidPages.
Find here the pull-request: #200

…ent-type when charset param exists

…n cha… (#200)

dportabella added a commit to dportabella/aut that referenced this issue Apr 23, 2018

fix archivesunleashed#199: mime-type was incorrectly parsed from cont…

36d8700

…ent-type when charset param exists

dportabella mentioned this issue Apr 23, 2018

fix #199: mime-type was incorrectly parsed from content-type when cha… #200

Merged

dportabella added a commit to dportabella/aut that referenced this issue Apr 23, 2018

fix archivesunleashed#199: mime-type was incorrectly parsed from cont…

3f9fab6

…ent-type when charset param exists

ruebot closed this as completed in #200 Apr 26, 2018

ruebot pushed a commit that referenced this issue Apr 26, 2018

Resolves #199: mime-type was incorrectly parsed from content-type whe…

b90c559

…n cha… (#200)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keepValidPages incorrectly filters out pages with mime-type text/html followed by charset #199

keepValidPages incorrectly filters out pages with mime-type text/html followed by charset #199

dportabella commented Apr 23, 2018 •

edited

Loading

dportabella commented Apr 23, 2018 •

edited

Loading

ruebot commented Apr 23, 2018

dportabella commented Apr 23, 2018

keepValidPages incorrectly filters out pages with mime-type text/html followed by charset #199

keepValidPages incorrectly filters out pages with mime-type text/html followed by charset #199

Comments

dportabella commented Apr 23, 2018 • edited Loading

dportabella commented Apr 23, 2018 • edited Loading

ruebot commented Apr 23, 2018

dportabella commented Apr 23, 2018

dportabella commented Apr 23, 2018 •

edited

Loading

dportabella commented Apr 23, 2018 •

edited

Loading