You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The conditional logic in textFiles does not filter properly. You can test with an example WARC in src/test/resources
To Reproduce
scala> import io.archivesunleashed.df._
import io.archivesunleashed.df._
scala> val arcPath = "/home/nruest/Projects/au/aut/src/test/resources/warc/example.warc.gz"
arcPath: String = /home/nruest/Projects/au/aut/src/test/resources/warc/example.warc.gz
scala> val df = RecordLoader.loadArchives(arcPath, sc).textFiles()
df: org.apache.spark.sql.DataFrame = [url: string, filename: string ... 6 more fields]
scala> df.select("url").orderBy(desc("md5")).show(5, false)
[Stage 0:> (0 + 1) / 1]19/12/16 22:04:33 WARN PDFParser: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
19/12/16 22:04:33 WARN SQLite3Parser: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
+--------------------------------------------+
|url |
+--------------------------------------------+
|http://ia311518.us.archive.org/robots.txt |
|http://ia331306.us.archive.org/robots.txt |
|http://ia300230.us.archive.org/robots.txt |
|http://ia360602.us.archive.org/robots.txt |
|http://ia340915.us.archive.org/robots.txt |
+--------------------------------------------+
Expected behavior
We should be filtering out robots.txt files, along with all js, css, html, and htm files.
Describe the bug
The conditional logic in
textFiles
does not filter properly. You can test with an example WARC insrc/test/resources
To Reproduce
Expected behavior
We should be filtering out
robots.txt
files, along with alljs
,css
,html
, andhtm
files.Environment information
The text was updated successfully, but these errors were encountered: