Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add image analysis w/ tensorflow #318

Merged
merged 5 commits into from
Jul 5, 2019

Conversation

h324yang
Copy link
Contributor

JCDL2019 demo

Using AUT and SSD model w/ Tensorflow to do object detection analysis on web archives.


  1. default setting is standalone mode, so need to set up master and slaves first.
  2. run detect.py to get and store the object probabilities and the image byte strings.
  3. run extract_images.py to get image files from the result of step2

@codecov-io
Copy link

codecov-io commented Apr 25, 2019

Codecov Report

Merging #318 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #318   +/-   ##
=======================================
  Coverage   75.95%   75.95%           
=======================================
  Files          41       41           
  Lines        1148     1148           
  Branches      200      200           
=======================================
  Hits          872      872           
  Misses        209      209           
  Partials       67       67

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5cb05f7...4d104b0. Read the comment docs.

@ruebot
Copy link
Member

ruebot commented Apr 28, 2019

@h324yang thanks for getting this started. Can you update your PR to use the PR template? That'll help us flesh out documentation that we'll need to run examples, and then write it all up here. Also, I'm not seeing any tests. Can you provide some?

@lintool do you want #241 open still? Does this supersede it?

@ruebot
Copy link
Member

ruebot commented Apr 28, 2019

...and is this apart of everything that should be included, or just helpers for the work you did on the paper?

@h324yang
Copy link
Contributor Author

h324yang commented May 6, 2019

Distributed image analysis via the integration of AUT and Tensorflow

GitHub issue(s): #240 #241

What does this Pull Request do?

  • Integrating AUT and Tensorflow with python interface (pyspark).
  • The code of the JCDL 2019 paper.
  • Single Shot MultiBox Detector is used so far, because of the balance between speed and accuracy.
  • The inference scores and the byte strings of images are stored first.
  • Using the image extractor to get the image files, , i.e., jpeg, gif, etc., which scores are higher than the threshold defined by users.

How should this be tested?

Step 1: Run detection

python aut/src/main/python/tf/detect.py \
		--web_archive "/tuna1/scratch/nruest/geocites/warcs/1/*" \
		--aut_jar aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar \
		--spark spark-2.3.2-bin-hadoop2.7/bin \
		--master spark://127.0.1.1:7077 \
		--img_model ssd \
		--filter_size 640 640 \
		--output_path warc_res

Step 2: Extract Images

python aut/src/main/python/tf/extract_images.py \
		--res_dir warc_res \
		--output_dir warc_imgs \
		--threshold 0.85

Additional Notes:

Python Dependency

My python environment is as listed in here. Though it's not the minimal requirement, to quickly set up, you can directly download it and then pip install req.txt .

Note that you should ensure that driver and workers use the same python version. You might set as follows:

export PYSPARK_PYTHON=[YOUR PYTHON]
export PYSPARK_DRIVER_PYTHON=[YOUR PYTHON]

Spark Mode

The default mode is standalone. E.g., you can launch in this mode as follows:

cd spark-2.3.2-bin-hadoop2.7
./sbin/start-master.sh
./sbin/start-slave.sh 127.0.1.1:7077

The spark parameters are set by using init_spark() in src/main/python/tf/util/init.py

Design Details

  • The pre-trained model and the corresponding dictionary for label mapping are stored in src/main/python/tf/model/graph/ and src/main/python/tf/model/category/ , respectively.
  • For each pre-trained model, though there is only one now, we define a model class and an extractor class, as SSD and SSDExtractor in src/main/python/tf/model/object_detection.py.
  • Using the model class, as SSD, to derive the pandas UDF function for inference.

Interested parties

@lintool

@ruebot
Copy link
Member

ruebot commented May 30, 2019

@h324yang can you remove the binaries from the PR, provide code comments and instructions in PR testing comment on where to locate them, download them, and place them?

src/main/python/tf/util/init.py Outdated Show resolved Hide resolved
src/main/python/tf/util/init.py Outdated Show resolved Hide resolved
src/main/python/tf/util/init.py Outdated Show resolved Hide resolved
src/main/python/tf/util/init.py Outdated Show resolved Hide resolved
src/main/python/tf/util/init.py Outdated Show resolved Hide resolved
src/main/python/tf/extract_images.py Outdated Show resolved Hide resolved
src/main/python/tf/extract_images.py Outdated Show resolved Hide resolved
src/main/python/tf/extract_images.py Outdated Show resolved Hide resolved
src/main/python/tf/extract_images.py Outdated Show resolved Hide resolved
src/main/python/tf/extract_images.py Outdated Show resolved Hide resolved
@ruebot
Copy link
Member

ruebot commented Jun 5, 2019

@h324yang I'm unable to get this to run.

$ cat warc-image-classification/run_detection.sh 
export PYSPARK_PYTHON=/home/ruestn/anaconda3/bin/python
export PYSPARK_DRIVER_PYTHON=/home/ruestn/anaconda3/bin/python

python /home/ruestn/aut/src/main/python/tf/detect.py --web_archive "/tuna1/scratch/nruest/geocites/warcs/1/*" \
    --aut_jar /home/ruestn/aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar \
    --aut_py /home/ruestn/aut/src/main/python \
    --spark /home/ruestn/spark-2.4.3-bin-hadoop2.7/bin \
    --master spark://127.0.1.1:7077 \
    --img_model ssd \
    --filter_size 100 100 \
    --output_path /home/ruestn/aut_318_test

I get:

$ ./run_detection.sh 
Traceback (most recent call last):
  File "/home/ruestn/aut/src/main/python/tf/detect.py", line 3, in <module>
    from util.init import *
  File "/home/ruestn/aut/src/main/python/tf/util/init.py", line 4, in <module>
    from pyspark import SparkConf, SparkContext, SQLContext
ModuleNotFoundError: No module named 'pyspark'

@ruebot
Copy link
Member

ruebot commented Jun 5, 2019

Chatting with Leo in Slack; guess who did a 🤦‍♂️?

I was giving a path to Python, not PySpark, without having PySpark installed for Anaconda Python.

@ruebot
Copy link
Member

ruebot commented Jun 6, 2019

First pass worked with some tweaks; changed "spark.cores.max", "48" and added "spark.network.timeout", "1000000".

We should definitely figure out a way to pass the Spark conf settings, since a user will definitely need to tweak them depending on their setup. I don't think we should have the conf settings hard coded in src/main/python/tf/util/init.py.

With auk we just pass a whole bunch of flags with we run Spark. That might not be ideal here since we already pass a lot of flags. Or we just roll with it. Or, we include a sample conf file in the repo, and tell folks to copy that and tweak it as needed.

What do you think @h324yang @lintool @ianmilligan1?

@ianmilligan1
Copy link
Member

All of the options sound good to me for various reasons! But I think at this stage as a prototype function we could probably just have people add some flags and roll with it – down the line, perhaps as a separate issue, come up with a conf file to try to reduce some of the flag soup? @ruebot

@ruebot
Copy link
Member

ruebot commented Jun 19, 2019

We might want to address this message from when we run the initial pass too:

WARNING:tensorflow:From /home/ruestn/aut/src/main/python/tf/model/object_detection.py:49: FastGFile.__init__ (from tensorflow.python.platform.gfile) is deprecated and will be removed in a future version.

@ruebot
Copy link
Member

ruebot commented Jun 21, 2019

@h324yang did y'all get a lot of this when you ran the first pass script? Just trying to understand what's normal/expected behaviour here.

@h324yang
Copy link
Contributor Author

@h324yang did y'all get a lot of this when you ran the first pass script? Just trying to understand what's normal/expected behaviour here.

Seems like an OOM error; The arguments I set in util/init.py were optimized and running well on Tuna. I got some errors but I don't think OOM is a frequent one. You also run on Tuna?

Maybe a lower value of "spark.sql.execution.arrow.maxRecordsPerBatch" could help, e.g., 1280 -> 640. (Indeed, tuning such settings bothered me a lot :-/)

@ruebot
Copy link
Member

ruebot commented Jun 24, 2019

@h324yang I ended up dropping it down to 320, and doing 10 WARCs instead of the previous attempts of doing 1000, and 100. It was a lot more stable with 10, and the initial job completed successfully.

@h324yang
Copy link
Contributor Author

We might want to address this message from when we run the initial pass too:

WARNING:tensorflow:From /home/ruestn/aut/src/main/python/tf/model/object_detection.py:49: FastGFile.__init__ (from tensorflow.python.platform.gfile) is deprecated and will be removed in a future version.

I update to the TF 1.14.0 api, i.e. tf.io.gfile.GFile.

@h324yang
Copy link
Contributor Author

@ruebot I done all requested changes except for --img_model, which reason is replied in the thread. Also, conf file is added. Please re-review the new commits.

Copy link
Member

@ruebot ruebot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@h324yang we still have the models files. Those need to be pulled out. I don't believe we can distribute them based on a discussion with @lintool.

@h324yang
Copy link
Contributor Author

h324yang commented Jul 3, 2019

Sorry! That slipped my mind, and I already removed it.
The model is from TF detection model zoo: ssd_mobilenet_v1_fpn_coco ☆

We can download it and mv the frozen_inference_graph.pb to the designated folder aut/src/main/python/tf/model/graph/ssd_mobilenet_v1_fpn_640x640

For example:

wget http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03.tar.gz
tar -xzvf ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03.tar.gz
mkdir -p aut/src/main/python/tf/model/graph/ssd_mobilenet_v1_fpn_640x640/
cp ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03/frozen_inference_graph.pb aut/src/main/python/tf/model/graph/ssd_mobilenet_v1_fpn_640x640/

Then, we need the category mapping file mscoco_label_map.pbtxt, which can be downloaded from here and also mv it to the designated folder aut/src/main/python/tf/model/category/

For example:

mkdir -p aut/src/main/python/tf/model/category/
cd aut/src/main/python/tf/model/category/
wget https://raw.githubusercontent.com/tensorflow/models/master/research/object_detection/data/mscoco_label_map.pbtxt

@ruebot ruebot merged commit 7a61f0e into archivesunleashed:master Jul 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants