Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrames for image analysis #220

Closed
lintool opened this issue May 14, 2018 · 9 comments
Closed

DataFrames for image analysis #220

lintool opened this issue May 14, 2018 · 9 comments

Comments

@lintool
Copy link
Member

lintool commented May 14, 2018

Currently, we have RDD-based analytics for image analysis here:
https://archivesunleashed.org/aut/#image-analysis

Let's DataFrame-ify it - that is, build the DataFrame infrastructure that wold support image analysis.

I'm thinking of creating two separate DataFrames:

  • The first for the image links, something like (source page, image url)
  • The second for the images themselves, something like (img url, type, height, width, md5, raw bytes)
@jwli229
Copy link
Contributor

jwli229 commented May 14, 2018

I will work on this.

@lintool
Copy link
Member Author

lintool commented May 14, 2018

@JWZ2018

Great. BTW - one end game would be to create something like this in a single e2e pipeline: http://ruebot.net/elxn42.html

Fork the repo, start working on a branch - send an initial PR when you're ready and we can discuss iteratively.

@ruebot
Copy link
Member

ruebot commented May 14, 2018

Here's the background on how I make those.

@jwli229
Copy link
Contributor

jwli229 commented May 14, 2018

@lintool
Just making sure I'm on the right track and understanding the codebase properly, for the above two dataframes, I'm planning to add extractImageLinksDF and extractImageDetailsDF to the WARecordRDD in io/archivesunleashed/package.scala and then add them to DataframeLoader similar to how extractValidPagesDF and extractHyperlinksDF are implemented.
Does that sound right?

@lintool
Copy link
Member Author

lintool commented May 14, 2018

Yes, that's a good start!

ruebot pushed a commit that referenced this issue May 15, 2018
* Extract Image Links DF API
* Add extract image links text
* Remove unnecessary comment from test
* Add doc comments
* Addresses #220
@jwli229
Copy link
Contributor

jwli229 commented May 15, 2018

@lintool
Clarifying for the second df:

@lintool
Copy link
Member Author

lintool commented May 15, 2018

However, I'm not opposed to poking around for other options... I found this, for example: http://imglib2.net/

Might be a better option, as opposed to messing with JNI.

@ruebot has experience with ImageMagick - thoughts?

@ruebot
Copy link
Member

ruebot commented May 15, 2018

You can get the image info with Apache Tika, which we already use in the project with language and mime type extraction. https://tika.apache.org/1.7/formats.html#Image_formats

ruebot pushed a commit that referenced this issue May 21, 2018
* Add Extract Image Details API
* Change check for jpeg and fix spacing
* Add tiff parser
* Use AutoDetectParser and read Numeric fields
* Use ComputeImageSize
* Hex encode hash and base64 encode image bytes
* Fix test
* Change df column names
@lintool
Copy link
Member Author

lintool commented May 21, 2018

With #226 this is done. Closing.

@lintool lintool closed this as completed May 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants