POC of socring rest service od Spark ML Pipelines. The
Serve Apache spark pipelines from Kubernetes
REST JSON Request -> K8s service -> Apache Spark
in the code data/pipline-archive
it pyspark pipeline
Create simple pipeline and Save pipeline
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql.session import SparkSession
sc = SparkContext("local[2]", "SparkML")
spark = SparkSession(sc)
# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])
# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
# Fit the pipeline to training documents.
model = pipeline.fit(training)
# This is the save pipeline method
model.save("/some-where-ml/project1/pipeline")
{
"name":"Test pipeline",
"pipeline":"simple_pipline1",
"schema":[
{"name":"name","type":"STRING", "isNullable":false}
],
"sample":{
"name":"Bob"
},
"output": ["name", "features"]
}
No zip the folder /some-where-ml/project1
and put it in your pipelines.folder
zip -r spark-sample-pipeline.zip /some-where-ml/project1
in application.properties
change the pipelines.folder
to pipeline store folder -OR- run as paramete (like below)
# Build mvn
mvn install package
# Run
java -Dpipelines.folder=/some-where-ml -jar target/mlserver.jar
curl -X POST \
-H "Content-Type: application/json" \
-d '[{"text":"Yehuda"}]' \
http://localhost:9900/predict/spark-sample-pipeline
- add Warm-up for modules
- dockerize and serve it as service
To do some docker build
mvn clean install package && docker build -t alefbt/spring-mlspark-serving .
Then run
docker run --rm -it -p 9900:8080 alefbt/spring-mlspark-serving
- This is code is not optimized to sub-second serving, it's possibol <,i did it on other project ;-) in order to do it, you need do some cacheing>
- Code contribution is welcome !
- Remember: this porject is POC.
- FIXED. Snappy (
xerial.snappy
package) doing some problems when running on vanilla Dockeropenjdk:8-jdk-alpine
- see theDockerfile
explaines the walkaround
MIT - Free.
Use and Contribute