Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keepValidPages incorrectly filters out pages with mime-type text/html followed by charset #199

Closed
dportabella opened this issue Apr 23, 2018 · 3 comments · Fixed by #200
Closed

Comments

@dportabella
Copy link
Contributor

dportabella commented Apr 23, 2018

Sometimes, the Content-Type header contains the charset, such as text/html;charset=ISO-8859-1 for page http://www.patentbuddy.com/,
so the keepValidPages function does not work properly in those cases: it incorrectly filters out such pages.

More info: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type

package io.archivesunleashed.spark.rdd
object RecordRDD extends java.io.Serializable {
    def keepValidPages(): RDD[ArchiveRecord] = {
      rdd.filter(r =>
        r.getCrawlDate != null
          && (r.getMimeType == "text/html"
          || r.getMimeType == "application/xhtml+xml"
          || r.getUrl.endsWith("htm")
          || r.getUrl.endsWith("html"))
          && !r.getUrl.endsWith("robots.txt"))
    }
@dportabella
Copy link
Contributor Author

dportabella commented Apr 23, 2018

fix:

import java.util.regex.Pattern

val htmlMimeTypePattern: Pattern = Pattern.compile("(text/html|application/xhtml\\+xml)(;.*)?")

def keepValidPages(): RDD[ArchiveRecord] = {
  rdd.filter(r =>
    r.getCrawlDate != null
      && ((r.getMimeType != null && htmlMimeTypePattern.matcher(r.getMimeType).matches)
      || r.getUrl.endsWith("htm")
      || r.getUrl.endsWith("html"))
      && !r.getUrl.endsWith("robots.txt"))
}

@ruebot
Copy link
Member

ruebot commented Apr 23, 2018

@dportabella if you have a fix, feel free to put in a PR.

@dportabella
Copy link
Contributor Author

It was better to fix the function ArchiveRecord.getMimeType rather than patching keepValidPages.
Find here the pull-request: #200

dportabella added a commit to dportabella/aut that referenced this issue Apr 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants