Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame commands throwing java.lang.NullPointerException on example data #320

Closed
ianmilligan1 opened this issue Jun 18, 2019 · 7 comments
Labels

Comments

@ianmilligan1
Copy link
Member

Right now on 0.17.0, using Docker, running any DataFrame command leads to a java.lang.NullPointerException error.

For example,

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("example.arc.gz", sc)
  .extractValidPagesDF()

df.printSchema()

leads to

// Exiting paste mode, now interpreting.

java.lang.NullPointerException
  at scala.collection.mutable.ArrayOps$ofRef$.newBuilder$extension(ArrayOps.scala:190)
  at scala.collection.mutable.ArrayOps$ofRef.newBuilder(ArrayOps.scala:186)
  at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:246)
  at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
  at scala.collection.mutable.ArrayOps$ofRef.filter(ArrayOps.scala:186)
  at io.archivesunleashed.package$RecordLoader$.getFiles(package.scala:53)
  at io.archivesunleashed.package$RecordLoader$.loadArchives(package.scala:66)
  ... 54 elided

We should try to get it so that on Docker the DataFrame commands work out of the box (which they did before, I think..).

@ianmilligan1
Copy link
Member Author

ianmilligan1 commented Jun 18, 2019

Works when running natively with

alias aut45='/home/i2millig/spark-2.3.2-bin-hadoop2.7/bin/spark-shell --driver-memory 45G --packages "io.archivesunleashed:aut:0.17.0"'

but fails when running with

docker run --rm -it -v "/Users/ianmilligan1/desktop/data:/data" archivesunleashed/docker-aut:0.17.0

Apologies, this probably belongs in the docker repo.

@ianmilligan1
Copy link
Member Author

Works if we read in a directory, i.e.

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/data/*", sc)
  .extractValidPagesDF()

df.printSchema()

@ruebot
Copy link
Member

ruebot commented Jun 20, 2019

So, is it just a documentation issue on archivesunleashed.org/aut?

@ianmilligan1
Copy link
Member Author

No, it can't read the example.arc.gz as it won't seem to support *.gz wildcarding w/o throwing an error. For consistency, it'd be nice if was always able to read example.arc.gz.

i.e. this doesn't work

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("*.gz", sc)
  .extractValidPagesDF()

df.printSchema()

Or we can just say not to use it with Docker?

@ruebot
Copy link
Member

ruebot commented Jun 20, 2019

I can't reproduce it:

Standalone:

Spark context Web UI available at http://172.17.0.1:4040
Spark context available as 'sc' (master = local[*], app id = local-1561031350339).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.3
      /_/
         
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/home/nruest/Projects/au/aut-resources/Sample-Data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
  .extractValidPagesDF()

df.printSchema()

// Exiting paste mode, now interpreting.

root
 |-- CrawlDate: string (nullable = true)
 |-- Url: string (nullable = true)
 |-- MimeType: string (nullable = true)
 |-- Content: string (nullable = true)

import io.archivesunleashed._
import io.archivesunleashed.df._
df: org.apache.spark.sql.DataFrame = [CrawlDate: string, Url: string ... 2 more fields]

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/home/nruest/Projects/au/aut-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz", sc)
  .extractValidPagesDF()

df.printSchema()

// Exiting paste mode, now interpreting.

root
 |-- CrawlDate: string (nullable = true)
 |-- Url: string (nullable = true)
 |-- MimeType: string (nullable = true)
 |-- Content: string (nullable = true)

import io.archivesunleashed._
import io.archivesunleashed.df._
df: org.apache.spark.sql.DataFrame = [CrawlDate: string, Url: string ... 2 more fields]

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/home/nruest/Projects/au/aut-resources/Sample-Data/*.gz", sc)
  .extractValidPagesDF()

df.printSchema()


// Exiting paste mode, now interpreting.

root
 |-- CrawlDate: string (nullable = true)
 |-- Url: string (nullable = true)
 |-- MimeType: string (nullable = true)
 |-- Content: string (nullable = true)

import io.archivesunleashed._
import io.archivesunleashed.df._
df: org.apache.spark.sql.DataFrame = [CrawlDate: string, Url: string ... 2 more fields]

Docker:

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://9092d9b58a11:4040
Spark context available as 'sc' (master = local[*], app id = local-1561031732106).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.2
      /_/
         
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/aut-resources/Sample-Data/*.gz", sc)
  .extractValidPagesDF()

df.printSchema()

// Exiting paste mode, now interpreting.

2019-06-20 11:56:05 WARN  ObjectStore:6666 - Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
2019-06-20 11:56:05 WARN  ObjectStore:568 - Failed to get database default, returning NoSuchObjectException
2019-06-20 11:56:06 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
root
 |-- CrawlDate: string (nullable = true)
 |-- Url: string (nullable = true)
 |-- MimeType: string (nullable = true)
 |-- Content: string (nullable = true)

import io.archivesunleashed._
import io.archivesunleashed.df._
df: org.apache.spark.sql.DataFrame = [CrawlDate: string, Url: string ... 2 more fields]

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/aut-resources/Sample-Data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz", sc)
  .extractValidPagesDF()

df.printSchema()

// Exiting paste mode, now interpreting.

root
 |-- CrawlDate: string (nullable = true)
 |-- Url: string (nullable = true)
 |-- MimeType: string (nullable = true)
 |-- Content: string (nullable = true)

import io.archivesunleashed._
import io.archivesunleashed.df._
df: org.apache.spark.sql.DataFrame = [CrawlDate: string, Url: string ... 2 more fields]

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.df._

val df = RecordLoader.loadArchives("/aut-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz", sc)
  .extractValidPagesDF()

df.printSchema()

// Exiting paste mode, now interpreting.

root
 |-- CrawlDate: string (nullable = true)
 |-- Url: string (nullable = true)
 |-- MimeType: string (nullable = true)
 |-- Content: string (nullable = true)

import io.archivesunleashed._
import io.archivesunleashed.df._
df: org.apache.spark.sql.DataFrame = [CrawlDate: string, Url: string ... 2 more fields]

I'm certain it is a documentation issue, or a misreading of it. There is no example.arc.gz in docker-aut. There is the sample ARC and WARC in /aut-resources/Sample-Data:

  • ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz
  • ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz

All of the documentation here uses example.arc.gz as an example file, and the lesson we use with Docker doesn't have Data Frame example in it.

@ianmilligan1
Copy link
Member Author

🤦‍♂

Oh, of course. I'll close this with egg on my face. Sorry @ruebot.

@ruebot
Copy link
Member

ruebot commented Jun 20, 2019

No worries! :-D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants