Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve dataprep CI and fix pptx file ingesting bug #1334

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 5 additions & 6 deletions comps/dataprep/src/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -256,13 +256,12 @@ def load_pptx(pptx_path):
if table_contents:
text += table_contents + "\n"
if hasattr(shape, "image") and hasattr(shape.image, "blob"):
img_path = f"./{shape.image.filename}"
with open(img_path, "wb") as f:
with tempfile.NamedTemporaryFile() as f:
f.write(shape.image.blob)
img_text = load_image(img_path)
if img_text:
text += img_text + "\n"
os.remove(img_path)
f.flush()
img_text = load_image(f.name)
if img_text:
text += img_text + "\n"
return text


Expand Down
136 changes: 136 additions & 0 deletions tests/dataprep/dataprep_utils.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
#!/usr/bin/env bash

# Copyright (C) 2025 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"

# call_curl <url> <http_header> <remaining params>
function call_curl() {
local url=$1
local header=$2
shift 2
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -H "$header" "${url}" $@)
HTTP_STATUS=$(echo $HTTP_RESPONSE | tr -d '\n' | sed -e 's/.*HTTPSTATUS://')
RESPONSE_BODY=$(echo $HTTP_RESPONSE | sed -e 's/HTTPSTATUS\:.*//g')
}

# _invoke_curl <fqdn> <port> <action> <remaining params passed to curl ...>
function _invoke_curl() {
local url="http://$1:$2/v1/dataprep/$3"
local action=$3
shift 3
local header=""
case $action in
ingest)
header='Content-Type: multipart/form-data'
;;
delete|get)
header='Content-Type: application/json'
;;
*)
echo "Error: Unsupported dataprep action $action!"
exit 1
;;
esac

HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -H "$header" "${url}" $@)
HTTP_STATUS=$(echo $HTTP_RESPONSE | tr -d '\n' | sed -e 's/.*HTTPSTATUS://')
RESPONSE_BODY=$(echo $HTTP_RESPONSE | sed -e 's/HTTPSTATUS\:.*//g')
}

# validate_ingest <service fqdn> <port>
function ingest_doc() {
local fqdn=$1
local port=$2
shift 2
_invoke_curl $fqdn $port ingest -F "files=@${SCRIPT_DIR}/ingest_dataprep.doc" $@
}

function ingest_docx() {
local fqdn=$1
local port=$2
shift 2
_invoke_curl $fqdn $port ingest -F "files=@${SCRIPT_DIR}/ingest_dataprep.docx" $@
}

function ingest_pdf() {
local fqdn=$1
local port=$2
shift 2
_invoke_curl $fqdn $port ingest -F "files=@${SCRIPT_DIR}/ingest_dataprep.pdf" $@
}

function ingest_pptx() {
local fqdn=$1
local port=$2
shift 2
_invoke_curl $fqdn $port ingest -F "files=@${SCRIPT_DIR}/ingest_dataprep.pptx" $@
}

function ingest_txt() {
local fqdn=$1
local port=$2
shift 2
_invoke_curl $fqdn $port ingest -F "files=@${SCRIPT_DIR}/ingest_dataprep.txt" $@
}

function ingest_xlsx() {
local fqdn=$1
local port=$2
shift 2
_invoke_curl $fqdn $port ingest -F "files=@${SCRIPT_DIR}/ingest_dataprep.xlsx" $@
}

function ingest_external_link() {
local fqdn=$1
local port=$2
shift 2
_invoke_curl $fqdn $port ingest -F 'link_list=["https://www.ces.tech/"]' $@
}

function delete_all() {
local fqdn=$1
local port=$2
shift 2
_invoke_curl $fqdn $port delete -d '{"file_path":"all"}' $@
}

function delete_single() {
local fqdn=$1
local port=$2
shift 3
_invoke_curl $fqdn $port delete -d '{"file_path":"ingest_dataprep.txt"}' $@
}

function get_all() {
local fqdn=$1
local port=$2
shift 2
_invoke_curl $fqdn $port get $@
}

function check_result() {
local service_name=$1
local expected_response=$2
local container_name=$3
local logfile=$4
local http_status="${5:-200}"

if [ "$HTTP_STATUS" -ne ${http_status} ]; then
echo "[ $service_name ] HTTP status is not ${http_status}. Received status was $HTTP_STATUS"
docker logs $container_name >> $logfile
exit 1
else
echo "[ $service_name ] HTTP status is ${http_status}. Checking content..."
fi

# check response body
if [[ "$RESPONSE_BODY" != *${expected_response}* ]]; then
echo "[ $service_name ] Content does not match the expected result: $RESPONSE_BODY"
docker logs $container_name >> $logfile
exit 1
else
echo "[ $service_name ] Content is as expected."
fi
}
Binary file added tests/dataprep/ingest_dataprep.doc
Binary file not shown.
Binary file added tests/dataprep/ingest_dataprep.docx
Binary file not shown.
Binary file added tests/dataprep/ingest_dataprep.pdf
Binary file not shown.
Binary file added tests/dataprep/ingest_dataprep.pptx
Binary file not shown.
1 change: 1 addition & 0 deletions tests/dataprep/ingest_dataprep.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Like many companies in the O&G sector, the stock of Chevron (NYSE:CVX) has declined about 10% over the past 90-days despite the fact that Q2 consensus earnings estimates have risen sharply (~25%) during that same time frame. Over the years, Chevron has kept a very strong balance sheet. FirstEnergy (NYSE:FE – Get Rating) posted its earnings results on Tuesday. The utilities provider reported $0.53 earnings per share for the quarter, topping the consensus estimate of $0.52 by $0.01, RTT News reports. FirstEnergy had a net margin of 10.85% and a return on equity of 17.17%. The Dáil was almost suspended on Thursday afternoon after Sinn Féin TD John Brady walked across the chamber and placed an on-call pager in front of the Minister for Housing Darragh O’Brien during a debate on retained firefighters. Mr O’Brien said Mr Brady had taken part in an act of theatre that was obviously choreographed.Around 2,000 retained firefighters around the country staged a second day of industrial action on Tuesday and are due to start all out-strike action from next Tuesday. The mostly part-time workers, who keep the services going outside of Ireland’s larger urban centres, are taking industrial action in a dispute over pay and working conditions. Speaking in the Dáil, Sinn Féin deputy leader Pearse Doherty said firefighters had marched on Leinster House today and were very angry at the fact the Government will not intervene. Reintroduction of tax relief on mortgages needs to be considered, O’Brien says. Martin withdraws comment after saying People Before Profit would ‘put the jackboot on people’ Taoiseach ‘propagated fears’ farmers forced to rewet land due to nature restoration law – Cairns An intervention is required now. I’m asking you to make an improved offer in relation to pay for retained firefighters, Mr Doherty told the housing minister.I’m also asking you, and challenging you, to go outside after this Order of Business and meet with the firefighters because they are just fed up to the hilt in relation to what you said.Some of them have handed in their pagers to members of the Opposition and have challenged you to wear the pager for the next number of weeks, put up with an €8,600 retainer and not leave your community for the two and a half kilometres and see how you can stand over those type of pay and conditions. At this point, Mr Brady got up from his seat, walked across the chamber and placed the pager on the desk in front of Mr O’Brien. Ceann Comhairle Seán Ó Fearghaíl said the Sinn Féin TD was completely out of order and told him not to carry out a charade in this House, adding it was absolutely outrageous behaviour and not to be encouraged.Mr O’Brien said Mr Brady had engaged in an act of theatre here today which was obviously choreographed and was then interrupted with shouts from the Opposition benches. Mr Ó Fearghaíl said he would suspend the House if this racket continues.Mr O’Brien later said he said he was confident the dispute could be resolved and he had immense regard for firefighters. The minister said he would encourage the unions to re-engage with the State’s industrial relations process while also accusing Sinn Féin of using the issue for their own political gain.
Binary file added tests/dataprep/ingest_dataprep.xlsx
Binary file not shown.
86 changes: 32 additions & 54 deletions tests/dataprep/test_dataprep_elasticsearch.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ ip_address=$(hostname -I | awk '{print $1}')
DATAPREP_PORT=11100
export TAG="comps"

SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
source ${SCRIPT_DIR}/dataprep_utils.sh

function build_docker_images() {
cd $WORKPATH
echo $WORKPATH
Expand Down Expand Up @@ -40,64 +43,39 @@ function start_service() {
}

function validate_microservice() {
cd $LOG_PATH

# test /v1/dataprep
URL="http://${ip_address}:$DATAPREP_PORT/v1/dataprep/ingest"
echo "Deep learning is a subset of machine learning that utilizes neural networks with multiple layers to analyze various levels of abstract data representations. It enables computers to identify patterns and make decisions with minimal human intervention by learning from large amounts of data." > $LOG_PATH/dataprep_file.txt

HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" -X POST -F 'files=@./dataprep_file.txt' -H 'Content-Type: multipart/form-data' "$URL")
if [ "$HTTP_STATUS" -eq 200 ]; then
echo "[ dataprep ] HTTP status is 200. Checking content..."
cp ./dataprep_file.txt ./dataprep_file2.txt
local CONTENT=$(curl -s -X POST -F 'files=@./dataprep_file2.txt' -H 'Content-Type: multipart/form-data' "$URL" | tee ${LOG_PATH}/dataprep.log)

if echo "$CONTENT" | grep -q "Data preparation succeeded"; then
echo "[ dataprep ] Content is as expected."
else
echo "[ dataprep ] Content does not match the expected result: $CONTENT"
docker logs dataprep-elasticsearch >> ${LOG_PATH}/dataprep.log
exit 1
fi
else
echo "[ dataprep ] HTTP status is not 200. Received status was $HTTP_STATUS"
docker logs dataprep-elasticsearch >> ${LOG_PATH}/dataprep.log
exit 1
fi
# test /v1/dataprep/ingest upload file
ingest_doc ${ip_address} ${DATAPREP_PORT}
check_result "dataprep - upload - doc" "Data preparation succeeded" dataprep-elasticsearch ${LOG_PATH}/dataprep_elastic.log

# test /v1/dataprep/get_file
URL="http://${ip_address}:$DATAPREP_PORT/v1/dataprep/get"
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" -X POST -H 'Content-Type: application/json' "$URL")
if [ "$HTTP_STATUS" -eq 200 ]; then
echo "[ dataprep - file ] HTTP status is 200. Checking content..."
local CONTENT=$(curl -s -X POST -H 'Content-Type: application/json' "$URL" | tee ${LOG_PATH}/dataprep_file.log)

if echo "$CONTENT" | grep -q '{"name":'; then
echo "[ dataprep - file ] Content is as expected."
else
echo "[ dataprep - file ] Content does not match the expected result: $CONTENT"
docker logs dataprep-elasticsearch >> ${LOG_PATH}/dataprep_file.log
exit 1
fi
else
echo "[ dataprep - file ] HTTP status is not 200. Received status was $HTTP_STATUS"
docker logs dataprep-elasticsearch >> ${LOG_PATH}/dataprep_file.log
exit 1
fi
ingest_docx ${ip_address} ${DATAPREP_PORT}
check_result "dataprep - upload - docx" "Data preparation succeeded" dataprep-elasticsearch ${LOG_PATH}/dataprep_elastic.log

# test /v1/dataprep/delete_file
URL="http://${ip_address}:$DATAPREP_PORT/v1/dataprep/delete"
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" -X POST -d '{"file_path": "dataprep_file.txt"}' -H 'Content-Type: application/json' "$URL")
if [ "$HTTP_STATUS" -eq 200 ]; then
echo "[ dataprep - del ] HTTP status is 200."
docker logs dataprep-elasticsearch >> ${LOG_PATH}/dataprep_del.log
else
echo "[ dataprep - del ] HTTP status is not 200. Received status was $HTTP_STATUS"
docker logs dataprep-elasticsearch >> ${LOG_PATH}/dataprep_del.log
exit 1
fi
ingest_pdf ${ip_address} ${DATAPREP_PORT}
check_result "dataprep - upload - pdf" "Data preparation succeeded" dataprep-elasticsearch ${LOG_PATH}/dataprep_elastic.log

ingest_pptx ${ip_address} ${DATAPREP_PORT}
check_result "dataprep - upload - pptx" "Data preparation succeeded" dataprep-elasticsearch ${LOG_PATH}/dataprep_elastic.log

ingest_txt ${ip_address} ${DATAPREP_PORT}
check_result "dataprep - upload - txt" "Data preparation succeeded" dataprep-elasticsearch ${LOG_PATH}/dataprep_elastic.log

ingest_xlsx ${ip_address} ${DATAPREP_PORT}
check_result "dataprep - upload - xlsx" "Data preparation succeeded" dataprep-elasticsearch ${LOG_PATH}/dataprep_elastic.log

# test /v1/dataprep/ingest upload link
ingest_external_link ${ip_address} ${DATAPREP_PORT}
check_result "dataprep - upload - link" "Data preparation succeeded" dataprep-elasticsearch ${LOG_PATH}/dataprep_elastic.log

# test /v1/dataprep/get
get_all ${ip_address} ${DATAPREP_PORT}
check_result "dataprep - get" '{"name":' dataprep-elasticsearch ${LOG_PATH}/dataprep_elastic.log

# test /v1/dataprep/delete
delete_single ${ip_address} ${DATAPREP_PORT}
check_result "dataprep - del" '{"status":true}' dataprep-elasticsearch ${LOG_PATH}/dataprep_elastic.log
}


function stop_docker() {
cid=$(docker ps -aq --filter "name=elasticsearch-vector-db" --filter "name=dataprep-elasticsearch")
if [[ ! -z "$cid" ]]; then docker stop $cid && docker rm $cid && sleep 1s; fi
Expand Down
98 changes: 27 additions & 71 deletions tests/dataprep/test_dataprep_milvus.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,9 @@ DATAPREP_PORT=11101
service_name="dataprep-milvus tei-embedding-serving etcd minio standalone"
export TAG="comps"

SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
source ${SCRIPT_DIR}/dataprep_utils.sh

function build_docker_images() {
cd $WORKPATH
echo $(pwd)
Expand Down Expand Up @@ -38,84 +41,37 @@ function start_service() {
sleep 1m
}

function validate_service() {
local URL="$1"
local EXPECTED_RESULT="$2"
local SERVICE_NAME="$3"
local DOCKER_NAME="$4"
local INPUT_DATA="$5"

if [[ $SERVICE_NAME == *"dataprep_upload_file"* ]]; then
cd $LOG_PATH
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F 'files=@./dataprep_file.txt' -H 'Content-Type: multipart/form-data' "$URL")
elif [[ $SERVICE_NAME == *"dataprep_upload_link"* ]]; then
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F 'link_list=["https://www.ces.tech/"]' -F 'chunk_size=400' "$URL")
elif [[ $SERVICE_NAME == *"dataprep_get"* ]]; then
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -H 'Content-Type: application/json' "$URL")
elif [[ $SERVICE_NAME == *"dataprep_del"* ]]; then
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -d '{"file_path": "all"}' -H 'Content-Type: application/json' "$URL")
else
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -d "$INPUT_DATA" -H 'Content-Type: application/json' "$URL")
fi
HTTP_STATUS=$(echo $HTTP_RESPONSE | tr -d '\n' | sed -e 's/.*HTTPSTATUS://')
RESPONSE_BODY=$(echo $HTTP_RESPONSE | sed -e 's/HTTPSTATUS\:.*//g')
function validate_microservice() {
# test /v1/dataprep/ingest upload file
ingest_doc ${ip_address} ${DATAPREP_PORT}
check_result "dataprep - upload - doc" "Data preparation succeeded" dataprep-milvus-server ${LOG_PATH}/dataprep_milvus.log

docker logs ${DOCKER_NAME} >> ${LOG_PATH}/${SERVICE_NAME}.log
ingest_docx ${ip_address} ${DATAPREP_PORT}
check_result "dataprep - upload - docx" "Data preparation succeeded" dataprep-milvus-server ${LOG_PATH}/dataprep_milvus.log

# check response status
if [ "$HTTP_STATUS" -ne "200" ]; then
echo "[ $SERVICE_NAME ] HTTP status is not 200. Received status was $HTTP_STATUS"
ingest_pdf ${ip_address} ${DATAPREP_PORT}
check_result "dataprep - upload - pdf" "Data preparation succeeded" dataprep-milvus-server ${LOG_PATH}/dataprep_milvus.log

if [[ $SERVICE_NAME == *"dataprep_upload_link"* ]]; then
docker logs test-comps-dataprep-milvus-tei-server >> ${LOG_PATH}/tei-embedding.log
fi
exit 1
else
echo "[ $SERVICE_NAME ] HTTP status is 200. Checking content..."
fi
# check response body
if [[ "$RESPONSE_BODY" != *"$EXPECTED_RESULT"* ]]; then
echo "[ $SERVICE_NAME ] Content does not match the expected result: $RESPONSE_BODY"
exit 1
else
echo "[ $SERVICE_NAME ] Content is as expected."
fi
ingest_pptx ${ip_address} ${DATAPREP_PORT}
check_result "dataprep - upload - pptx" "Data preparation succeeded" dataprep-milvus-server ${LOG_PATH}/dataprep_milvus.log

sleep 5s
}
ingest_txt ${ip_address} ${DATAPREP_PORT}
check_result "dataprep - upload - txt" "Data preparation succeeded" dataprep-milvus-server ${LOG_PATH}/dataprep_milvus.log

function validate_microservice() {
cd $LOG_PATH
ingest_xlsx ${ip_address} ${DATAPREP_PORT}
check_result "dataprep - upload - xlsx" "Data preparation succeeded" dataprep-milvus-server ${LOG_PATH}/dataprep_milvus.log

# test /v1/dataprep/delete
validate_service \
"http://${ip_address}:${DATAPREP_PORT}/v1/dataprep/delete" \
'{"status":true}' \
"dataprep_del" \
"dataprep-milvus-server"

# test /v1/dataprep upload file
echo "Deep learning is a subset of machine learning that utilizes neural networks with multiple layers to analyze various levels of abstract data representations. It enables computers to identify patterns and make decisions with minimal human intervention by learning from large amounts of data." > $LOG_PATH/dataprep_file.txt
validate_service \
"http://${ip_address}:${DATAPREP_PORT}/v1/dataprep/ingest" \
"Data preparation succeeded" \
"dataprep_upload_file" \
"dataprep-milvus-server"

# test /v1/dataprep upload link
validate_service \
"http://${ip_address}:${DATAPREP_PORT}/v1/dataprep/ingest" \
"Data preparation succeeded" \
"dataprep_upload_link" \
"dataprep-milvus-server"

# test /v1/dataprep/get_file
validate_service \
"http://${ip_address}:${DATAPREP_PORT}/v1/dataprep/get" \
'{"name":' \
"dataprep_get" \
"dataprep-milvus-server"
# test /v1/dataprep/ingest upload link
ingest_external_link ${ip_address} ${DATAPREP_PORT}
check_result "dataprep - upload - link" "Data preparation succeeded" dataprep-milvus-server ${LOG_PATH}/dataprep_milvus.log

# test /v1/dataprep/get
get_all ${ip_address} ${DATAPREP_PORT}
check_result "dataprep - get" '{"name":' dataprep-milvus-server ${LOG_PATH}/dataprep_milvus.log

# test /v1/dataprep/delete
delete_all ${ip_address} ${DATAPREP_PORT}
check_result "dataprep - del" '{"status":true}' dataprep-milvus-server ${LOG_PATH}/dataprep_milvus.log
}

function stop_docker() {
Expand Down
Loading
Loading