Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue#14129] Fix for spark.jsl.settings.storage.cluster_tmp_dir configuration #14132

Conversation

jiamaozheng
Copy link
Contributor

@jiamaozheng jiamaozheng commented Jan 11, 2024

fixes #14129


Verifications

1. DBFS - AWS Databricks (DBR 9.1 LTS ML)

Databricks notebook was modified from spark-nlp-training-and-inference-example

Before the fix:

  • Libraries:
spark-nlp==5.2.2
com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2
  • Advanced options -> Spark Config
spark.kryoserializer.buffer.max 2000M
spark.kyroserializer org.apache.spark.serializer.KryoSerializer
spark.jsl.settings.storage.cluster_tmp_dir dbfs:/tmp/spark_nlp/standard
  • Outcome:
    Errors thrown from glove_embeddings = WordEmbeddingsModel.load( "dbfs:/FileStore/pzn_ai/nlp_pretrained_models/glove_100d")

_java.nio.file.AccessDeniedException: nvirginia-prod/42307XXX92305032/dbfs:/tmp/spark_nlp/standard/ca628fbc03c8_cdx/EMBEDDINGS_glove_100d/: PUT 0-byte object on nvirginia-prod/4230797092305032/dbfs:/tmp/spark_nlp/standard/ca628fbc03c8_cdx/EMBEDDINGS_glove_100d/: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied; request: PUT https://audix-prod-root.s3-fips.us-east-1.amazonaws.com nvirginia-prod/423079XXX305032/dbfs%3A/tmp/spark_nlp/standard/ca628fbc03c8_cdx/EMBEDDINGS_glove_100d/ {} Hadoop 2.7.4, aws-sdk-java/1.11.678 Linux/5.4.0-1116-aws-fips OpenJDK_64-Bit_Server_VM/25.362-b09 java/1.8.0_362 scala/2.12.10 vendor/Azul_Systems,_Inc. com.amazonaws.services.s3.model.PutObjectRequest; Request ID: 6T6PP67TRDG77BC3, Extended Request ID: /9WZK/wlhMxzFNR7j0NtCxqA5msaIFGj9HGOl8fOEJZ1G59sGls8uSqts31aryjXc6HHp99f1vo=, Cloud Provider: AWS, Instance ID: i-0855670b7e4a1edf4 (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 6T6PP67TRDG77BC3; S3 Extended Request ID: /9WZK/wlhMxzFNR7j0NtCxqA5msaIFGj9HGOl8fOEJZ1G59sGls8uSqts31aryjXc6HHp99f1vo=), S3 Extended Request ID: /9WZK/wlhMxzFNR7j0NtCxqA5msaIFGj9HGOl8fOEJZ1G59sGls8uSqts31aryjXc6HHp99f1vo=:AccessDenied_

  • FILES from spark.jsl.settings.storage.cluster_tmp_dir
dbutils.fs.ls('dbfs:/tmp/spark_nlp/standard/ca628fbc03c8_cdx/')
Out[2]: []

After the fix

  • Libraries:
spark_nlp-5.2.2-py2.py3-none-any.whl (local build with the fix)
spark_nlp_assembly_5_2_2.jar (local build with the fix)
  • Advanced options -> Spark Config
spark.kryoserializer.buffer.max 2000M
spark.kyroserializer org.apache.spark.serializer.KryoSerializer
spark.jsl.settings.storage.cluster_tmp_dir dbfs:/tmp/spark_nlp/test
  • Outcome:
    Databricks notebook runs successfully.

  • FILES from spark.jsl.settings.storage.cluster_tmp_dir

dbutils.fs.ls('dbfs:/tmp/spark_nlp/test/7273b40b614e_cdx/EMBEDDINGS_glove_100d/')
Out[6]: 
[FileInfo(path='dbfs:/tmp/spark_nlp/test/7273b40b614e_cdx/EMBEDDINGS_glove_100d/000034.sst', name='000034.sst', size=33544959),
 FileInfo(path='dbfs:/tmp/spark_nlp/test/7273b40b614e_cdx/EMBEDDINGS_glove_100d/000036.sst', name='000036.sst', size=4206926),
........

Conclusion

The PR resolved #14129

2. S3 - AWS Databricks (Not supported as expected)

Databricks notebook was modified from spark-nlp-training-and-inference-example

The same settings as shown above except spark.jsl.settings.storage.cluster_tmp_dir

S3:

s3://audix-prod-1-rs-ephemeral/tmp/personalization_ml/spark_nlp/standard/

Errors thrown from ner_model = ner_pipeline.fit(training_data) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 19) (10.171.87.166 executor 1): org.apache.spark.SparkException: Failed to fetch s3://audix-prod-1-rs-ephemeral/tmp/personalization_ml/spark_nlp/standard/7ab643ad115f_cdx/EMBEDDINGS_glove_100d during dependency update

DBFS S3 bucket mounts:

dbfs:/mnt/audix-prod-1-ephemeral/tmp/personalization_ml/spark_nlp/standard/

Errors thrown from ner_model = ner_pipeline.fit(training_data)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 440.0 failed 4 times, most recent failure: Lost task 0.3 in stage 440.0 (TID 36114) (10.171.14.248 executor 286): org.apache.spark.SparkException: Failed to fetch dbfs:/mnt/audix-prod-1-ephemeral/tmp/personalization_ml/spark_nlp/d5e82b9b2bc1_cdx/EMBEDDINGS_glove_100d during dependency update

Conclusion

S3 and DBFS S3 bucket mounts are not supported for spark.jsl.settings.storage.cluster_tmp_dir configuration.

3. Local - Jupyter Notebook (macOS Monterey v12.7.1)

Jupyter Notebooks were modified from Databricks notebook.

from pyspark.sql import SparkSession 

// before the fix
spark = SparkSession.builder \
    .appName("Spark NLP")\
    .master("local[*]")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2")\
    .config("spark.jsl.settings.storage.cluster_tmp_dir", "file:/tmp/spark_nlp/standard") \
    .getOrCreate()

// after the fix
spark = SparkSession.builder \
    .appName("Spark NLP")\
    .master("local[*]")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars", "/Users/xxxxxx/Desktop/spark-nlp/spark-nlp-assembly-5.2.2.jar") \
    .config("spark.driver.extraClassPath", "/Users/xxxxxxx/Desktop/spark-nlp/spark-nlp-assembly-5.2.2.jar") \
    .config("spark.jsl.settings.storage.cluster_tmp_dir", "file:/tmp/spark_nlp/test") \
    .getOrCreate()

from sparknlp.training import CoNLL

training_data = CoNLL().readDataset(spark, 'eng.train')
test_data = CoNLL().readDataset(spark, 'eng.testa')

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

max_epochs=1
lr=0.003
batch_size=32
random_seed=0
verbose=1
validation_split= 0.2
evaluation_log_extended= True
enable_output_logs= True
include_confidence= True
output_logs_path="/tmp/ner_logs/"
 
 
nerTagger = NerDLApproach()\
  .setInputCols(["sentence", "token", "embeddings"])\
  .setLabelColumn("label")\
  .setOutputCol("ner")\
  .setMaxEpochs(max_epochs)\
  .setLr(lr)\
  .setBatchSize(batch_size)\
  .setRandomSeed(random_seed)\
  .setVerbose(verbose)\
  .setValidationSplit(validation_split)\
  .setEvaluationLogExtended(evaluation_log_extended)\
  .setEnableOutputLogs(enable_output_logs)\
  .setIncludeConfidence(include_confidence)\
  .setOutputLogsPath(output_logs_path)

glove_embeddings = WordEmbeddingsModel.load('glove_100d')\
          .setInputCols(["document", "token"])\
          .setOutputCol("embeddings")
 
ner_pipeline = Pipeline(stages=[
          glove_embeddings,
          nerTagger
 ])
 
ner_model = ner_pipeline.fit(training_data)
  • Outcome:
    Both local Jupyter notebooks with and without the fix run successfully.
Training started - total epochs: 1 - lr: 0.003 - batch size: 32 - labels: 9 - chars: 84 - training examples: 11204
Epoch 1/1 started, lr: 0.003, dataset size: 11204
Epoch 1/1 - 58.17s - loss: 1358.5903 - batches: 353
Quality on validation dataset (20.0%), validation examples = 2240
time to finish evaluation: 6.71s
label	 tp	 fp	 fn	 prec	 rec	 f1
B-LOC	 1208	 59	 266	 0.95343333	 0.81953865	 0.88143015
I-ORG	 613	 287	 130	 0.6811111	 0.82503366	 0.746196
I-MISC	 85	 12	 124	 0.87628865	 0.40669855	 0.5555555
I-LOC	 120	 5	 152	 0.96	 0.44117647	 0.60453403
I-PER	 868	 16	 45	 0.98190045	 0.95071197	 0.96605456
B-MISC	 504	 52	 199	 0.9064748	 0.71692747	 0.8006354
B-ORG	 1153	 481	 134	 0.70563036	 0.8958819	 0.78945565
B-PER	 1260	 81	 111	 0.9395973	 0.9190372	 0.92920357
tp: 5811 fp: 993 fn: 1161 labels: 8
Macro-average	 prec: 0.87555444, rec: 0.7468757, f1: 0.80611223
Micro-average	 prec: 0.8540564, rec: 0.8334768, f1: 0.84364116

Conclusion

The PR doesn't impact spark.jsl.settings.storage.cluster_tmp_dir. As expected, no intermediate files were generated from local runs.

$ pwd
/tmp/spark_nlp
$ tree
├── standard
│   └── d0ec6772da79_cdx
└── test
    └── 88955cc442d7_cdx

4. HDFS - AWS EMR

AWS EMR cluster creations followed the guidelines from How to create EMR cluster via CLI.

Before the fix:

  • AWS CLI:
aws emr create-cluster \
--name sparknlp-standard \
--release-label emr-6.5.0 \
--applications Name=Hadoop Name=Spark Name=Hive \
--ec2-attributes KeyName=sparknlp \
--instance-type m5.xlarge \
--instance-count 3 \
--bootstrap-actions Path=s3://<xxxxx>.com/jsl_emr_bootstrap_standard.sh,Name=sparknlp \
--configurations "https://<xxxxxx>.com/standard.configuration.json" \
--use-default-roles
  • jsl_emr_bootstrap_standard.sh
#!/bin/bash
set -x -e

echo -e 'export PYSPARK_PYTHON=/usr/bin/python3
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_JARS_DIR=/usr/lib/spark/jars
export SPARK_HOME=/usr/lib/spark' >> $HOME/.bashrc && source $HOME/.bashrc

sudo python3 -m pip install awscli boto3 spark-nlp numpy

set +x
exit 0
  • standard.configuration.json
[{
  "Classification": "spark-env",
  "Properties": {},
  "Configurations": [{
    "Classification": "export",
    "Properties": {
      "PYSPARK_PYTHON": "/usr/bin/python3"
    }
  }]
},
{
  "Classification": "spark-defaults",
    "Properties": {
      "spark.yarn.stagingDir": "hdfs:///tmp",
      "spark.yarn.preserve.staging.files": "true",
      "spark.kryoserializer.buffer.max": "2000M",
      "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
      "spark.driver.maxResultSize": "0",
      "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.2",
      "spark.jsl.settings.storage.cluster_tmp_dir": "hdfs:///tmp/sparknlp/standard"
    }
}
]
  • pyspark script
from pyspark.sql import SparkSession 
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import CoNLL

training_data = CoNLL().readDataset(spark, 'hdfs:///sparknlp/eng.train')
test_data = CoNLL().readDataset(spark, 'hdfs:///sparknlp/eng.testa')


max_epochs=1
lr=0.003
batch_size=32
random_seed=0
verbose=1
validation_split= 0.2
evaluation_log_extended= True
enable_output_logs= True
include_confidence= True
output_logs_path="hdfs:///tmp/ner_logs"
 
 
nerTagger = NerDLApproach()\
  .setInputCols(["sentence", "token", "embeddings"])\
  .setLabelColumn("label")\
  .setOutputCol("ner")\
  .setMaxEpochs(max_epochs)\
  .setLr(lr)\
  .setBatchSize(batch_size)\
  .setRandomSeed(random_seed)\
  .setVerbose(verbose)\
  .setValidationSplit(validation_split)\
  .setEvaluationLogExtended(evaluation_log_extended)\
  .setEnableOutputLogs(enable_output_logs)\
  .setIncludeConfidence(include_confidence)\
  .setOutputLogsPath(output_logs_path)

glove_embeddings = WordEmbeddingsModel.load('hdfs:///sparknlp/glove_100d')\
          .setInputCols(["document", "token"])\
          .setOutputCol("embeddings")
          
ner_pipeline = Pipeline(stages=[
          glove_embeddings,
          nerTagger
 ])
 
ner_model = ner_pipeline.fit(training_data)
  • Outcome:
    "java.io.IOException: Incomplete HDFS URI, no host" Exception thrown from glove_embeddings = WordEmbeddingsModel.load( 'hdfs:///sparknlp/glove_100d')

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/spark/python/pyspark/ml/util.py", line 332, in load return cls.read().load(path) File "/usr/lib/spark/python/pyspark/ml/util.py", line 282, in load java_obj = self._jread.load(path) File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 111, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o108.load. : java.io.IOException: Incomplete HDFS URI, no host: hdfs://ip-172-31-18-38.ec2.internal:8020hdfs:/tmp/sparknlp/standard/05ab51ba5bad_cdx/EMBEDDINGS_glove_100d at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:168) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3364) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:123) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3413) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3381) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:486) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) at com.johnsnowlabs.storage.StorageHelper$.copyIndexToCluster(StorageHelper.scala:100) at com.johnsnowlabs.storage.StorageHelper$.sendToCluster(StorageHelper.scala:90) at com.johnsnowlabs.storage.StorageHelper$.load(StorageHelper.scala:50) at com.johnsnowlabs.storage.HasStorageModel.$anonfun$deserializeStorage$1(HasStorageModel.scala:43) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at com.johnsnowlabs.storage.HasStorageModel.deserializeStorage(HasStorageModel.scala:42) at com.johnsnowlabs.storage.HasStorageModel.deserializeStorage$(HasStorageModel.scala:40) at com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel.deserializeStorage(WordEmbeddingsModel.scala:147) at com.johnsnowlabs.storage.StorageReadable.readStorage(StorageReadable.scala:34) at com.johnsnowlabs.storage.StorageReadable.readStorage$(StorageReadable.scala:33) at com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel$.readStorage(WordEmbeddingsModel.scala:357) at com.johnsnowlabs.storage.StorageReadable.$anonfun$$init$$1(StorageReadable.scala:37) at com.johnsnowlabs.storage.StorageReadable.$anonfun$$init$$1$adapted(StorageReadable.scala:37) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1(ParamsAndFeaturesReadable.scala:50) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$onRead$1$adapted(ParamsAndFeaturesReadable.scala:49) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.onRead(ParamsAndFeaturesReadable.scala:49) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1(ParamsAndFeaturesReadable.scala:61) at com.johnsnowlabs.nlp.ParamsAndFeaturesReadable.$anonfun$read$1$adapted(ParamsAndFeaturesReadable.scala:61) at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:38) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:750)

  • FILES from spark.jsl.settings.storage.cluster_tmp_dir
[hadoop@ip-172-31-18-38 ~]$ hdfs dfs -ls /tmp/sparknlp/standard/05ab51ba5bad_cdx

After the fix

  • AWS CLI:
aws emr create-cluster \
--name sparknlp-test \
--release-label emr-6.5.0 \
--applications Name=Hadoop Name=Spark Name=Hive \
--ec2-attributes KeyName=sparknlp \
--instance-type m5.xlarge \
--instance-count 3 \
--bootstrap-actions Path=s3://<xxxxxx>.com/jsl_emr_bootstrap_test.sh,Name=sparknlptest \
--configurations "https://<xxxxxx>.com/test.configuration.json" \
--use-default-roles
  • jsl_emr_bootstrap_test.sh
#!/bin/bash
set -x -e

sudo aws s3 cp s3://<xxxxx>.com/spark-nlp-assembly-5.2.2.jar $SPARK_HOME/jars

sudo mkdir /whl 
sudo aws s3 cp s3://<xxxxx>.com/spark_nlp-5.2.2-py2.py3-none-any.whl /whl/

echo -e 'export PYSPARK_PYTHON=/usr/bin/python3
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_JARS_DIR=/usr/lib/spark/jars
export SPARK_HOME=/usr/lib/spark' >> $HOME/.bashrc && source $HOME/.bashrc

sudo python3 -m pip install awscli boto3 numpy /whl/spark_nlp-5.2.2-py2.py3-none-any.whl

set +x
exit 0
  • test.configuration.json
[{
  "Classification": "spark-env",
  "Properties": {},
  "Configurations": [{
    "Classification": "export",
    "Properties": {
      "PYSPARK_PYTHON": "/usr/bin/python3"
    }
  }]
},
{
  "Classification": "spark-defaults",
    "Properties": {
      "spark.yarn.stagingDir": "hdfs:///tmp",
      "spark.yarn.preserve.staging.files": "true",
      "spark.kryoserializer.buffer.max": "2000M",
      "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
      "spark.driver.maxResultSize": "0",
      "spark.executor.extraClassPath": "/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*",
      "spark.driver.extraClassPath": "/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*",
      "spark.jsl.settings.storage.cluster_tmp_dir": "hdfs:///tmp/sparknlp/test"
    }
}
]
  • pyspark script
from pyspark.sql import SparkSession 
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import CoNLL

training_data = CoNLL().readDataset(spark, 'hdfs:///sparknlp/eng.train')
test_data = CoNLL().readDataset(spark, 'hdfs:///sparknlp/eng.testa')


max_epochs=1
lr=0.003
batch_size=32
random_seed=0
verbose=1
validation_split= 0.2
evaluation_log_extended= True
enable_output_logs= True
include_confidence= True
output_logs_path="hdfs:///tmp/ner_logs"
 
 
nerTagger = NerDLApproach()\
  .setInputCols(["sentence", "token", "embeddings"])\
  .setLabelColumn("label")\
  .setOutputCol("ner")\
  .setMaxEpochs(max_epochs)\
  .setLr(lr)\
  .setBatchSize(batch_size)\
  .setRandomSeed(random_seed)\
  .setVerbose(verbose)\
  .setValidationSplit(validation_split)\
  .setEvaluationLogExtended(evaluation_log_extended)\
  .setEnableOutputLogs(enable_output_logs)\
  .setIncludeConfidence(include_confidence)\
  .setOutputLogsPath(output_logs_path)

glove_embeddings = WordEmbeddingsModel.load('hdfs:///sparknlp/glove_100d')\
          .setInputCols(["document", "token"])\
          .setOutputCol("embeddings")
          
ner_pipeline = Pipeline(stages=[
          glove_embeddings,
          nerTagger
 ])
 
ner_model = ner_pipeline.fit(training_data)
  • Outcome:
    pyspark script runs successfully.
Training started - total epochs: 1 - lr: 0.003 - batch size: 32 - labels: 9 - chars: 84 - training examples: 11254
Epoch 1/1 started, lr: 0.003, dataset size: 11254
Epoch 1/1 - 75.34s - loss: 981.09045 - batches: 354
Quality on validation dataset (20.0%), validation examples = 2250
time to finish evaluation: 6.22s
label  tp  fp  fn  prec  rec   f1
B-LOC  1384  100   107   0.93261456  0.92823607  0.9304202
I-ORG  557   78  181   0.8771654   0.75474256  0.81136197
I-MISC   122   20  100   0.85915494  0.5495495   0.67032963
I-LOC  177   31  48  0.85096157  0.7866667   0.8175519
I-PER  929   48  20  0.95087004  0.97892517  0.9646936
B-MISC   539   49  128   0.9166667   0.80809593  0.8589641
B-ORG  1027  111   218   0.90246046  0.8248996   0.8619388
B-PER  1314  87  48  0.9379015   0.9647577   0.95114005
tp: 6049 fp: 524 fn: 850 labels: 8
Macro-average  prec: 0.9034744, rec: 0.8244842, f1: 0.86217386
Micro-average  prec: 0.9202799, rec: 0.87679374, f1: 0.89801073
  • FILES from spark.jsl.settings.storage.cluster_tmp_dir
[hadoop@ip-172-31-38-165 ~]$ hdfs dfs -ls /tmp/sparknlp/test/78ac1a86e9a2_cdx/EMBEDDINGS_glove_100d
Found 83 items
-rw-r--r--   1 hadoop hdfsadmingroup   33544959 2024-01-15 02:06 /tmp/sparknlp/test/78ac1a86e9a2_cdx/EMBEDDINGS_glove_100d/000034.sst
-rw-r--r--   1 hadoop hdfsadmingroup    4206926 2024-01-15 02:06 /tmp/sparknlp/test/78ac1a86e9a2_cdx/EMBEDDINGS_glove_100d/000036.sst
-rw-r--r--   1 hadoop hdfsadmingroup    4207987 2024-01-15 02:06 /tmp/sparknlp/test/78ac1a86e9a2_cdx/EMBEDDINGS_glove_100d/000039.sst
........

Conclusion

The PR resolved the issue (java.io.IOException: Incomplete HDFS URI, no host) of the HDFS version of spark.jsl.settings.storage.cluster_tmp_dir, very similar to the one observed in DBFS - #14129

@maziyarpanahi
Copy link
Member

@danilojsl perhaps this part can have logics for different storage layers like s3, dbfs, hdfs, local, etc.?

val clusterFilePath: Path = {
     if (!getTmpLocation.matches("s3[a]?:/.*")) {
       Path.mergePaths(
         new Path(fileSystem.getUri.toString + clusterTmpLocation),
         new Path("/" + clusterFileName))
     } else new Path(clusterTmpLocation + "/" + clusterFileName)
   }

Could you please have a look and run some tests?

@jiamaozheng jiamaozheng marked this pull request as ready for review January 15, 2024 03:15
@jiamaozheng
Copy link
Contributor Author

@danilojsl perhaps this part can have logics for different storage layers like s3, dbfs, hdfs, local, etc.?

val clusterFilePath: Path = {
     if (!getTmpLocation.matches("s3[a]?:/.*")) {
       Path.mergePaths(
         new Path(fileSystem.getUri.toString + clusterTmpLocation),
         new Path("/" + clusterFileName))
     } else new Path(clusterTmpLocation + "/" + clusterFileName)
   }

Could you please have a look and run some tests?

@maziyarpanahi , I have done some tests in HDFS, DBFS, S3 and Local environments, and the outcomes of these runs are expected. @danilojsl, please feel free to run more tests in your environments if needed. Also please approve three appending CI builds. Thanks,

@maziyarpanahi maziyarpanahi changed the base branch from master to release/523-release-candidate January 18, 2024 12:35
@danilojsl
Copy link
Contributor

@maziyarpanahi I also ran several tests and the change is working. Thanks for the contribution @jiamaozheng

@jiamaozheng
Copy link
Contributor Author

@maziyarpanahi I also ran several tests and the change is working. Thanks for the contribution @jiamaozheng

@maziyarpanahi , if there are no other concerns, would you please approve, merge and release this bug fix? Thanks,

@maziyarpanahi
Copy link
Member

@maziyarpanahi I also ran several tests and the change is working. Thanks for the contribution @jiamaozheng

@maziyarpanahi , if there are no other concerns, would you please approve, merge and release this bug fix? Thanks,

Thanks @jiamaozheng
This will be merged into our next release candidates and will be included in 5.2.4 release

@maziyarpanahi maziyarpanahi changed the base branch from release/523-release-candidate to release/524-release-candidate January 24, 2024 06:48
@maziyarpanahi maziyarpanahi changed the base branch from release/524-release-candidate to release/530-release-candidate February 6, 2024 11:55
@maziyarpanahi maziyarpanahi merged commit 9377bb3 into JohnSnowLabs:release/530-release-candidate Feb 6, 2024
maziyarpanahi added a commit that referenced this pull request Feb 27, 2024
…date

* fixed all sbt warnings

* remove file system url prefix (#14132)

* SPARKNLP-942: MPNet Classifiers (#14147)

* SPARKNLP-942: MPNetForSequenceClassification

* SPARKNLP-942: MPNetForQuestionAnswering

* SPARKNLP-942: MPNet Classifiers Documentation

* Restore RobertaforQA bugfix

* adding import notebook + changing default model + adding onnx support (#14158)

* Sparknlp 876: Introducing LLAMA2  (#14148)

* introducing LLAMA2

* Added option to read model from model path to onnx wrapper

* Added option to read model from model path to onnx wrapper

* updated text description

* LLAMA2 python API

* added method to save onnx_data

* added position ids

* - updated Generate.scala to accept onnx tensors
- added beam search support for LLAMA2

* updated max input length

* updated python default params
changed test to slow test

* fixed serialization bug

* Doc sim rank as retriever (#14149)

* Added retrieval interface to the doc sim rank approach

* Added Python interface as retriever in doc sim ranker

---------

Co-authored-by: Stefano Lori <s.lori@izicap.com>

* 812 implement de berta for zero shot classification annotator (#14151)

* adding code

* adding notebook for import

---------

Co-authored-by: Maziyar Panahi <maziyar.panahi@iscpif.fr>

* Add notebook for fine tuning sbert (#14152)

* [SPARKNLP-986] Fixing optional input col validations (#14153)

* [SPARKNLP-984] Fixing Deberta notebooks URIs (#14154)

* SparkNLP 933: Introducing M2M100 : multilingual translation model (#14155)

* introducing LLAMA2

* Added option to read model from model path to onnx wrapper

* Added option to read model from model path to onnx wrapper

* updated text description

* LLAMA2 python API

* added method to save onnx_data

* added position ids

* - updated Generate.scala to accept onnx tensors
- added beam search support for LLAMA2

* updated max input length

* updated python default params
changed test to slow test

* fixed serialization bug

* Added Scala code for M2M100

* Documentation for scala code

* Python API for M2M100

* added more tests for scala

* added tests for python

* added pretrained

* rewording

* fixed serialization bug

* fixed serialization bug

---------

Co-authored-by: Maziyar Panahi <maziyar.panahi@iscpif.fr>

* SPARKNLP-985: Add flexible naming for onnx_data (#14165)

Some annotators might have different naming schemes
for their files. Added a parameter to control this.

* Add LLAMA2Transformer and M2M100Transformer to annotator

* Add LLAMA2Transformer and M2M100Transformer to ResourceDownloader

* bump version to 5.3.0 [skip test]

* SPARKNLP-999: Fix remote model loading for some onnx models

* used filesystem to check for the onnx_data file (#14169)

* [SPARKNLP-940] Adding changes to correctly copy cluster index storage… (#14167)

* [SPARKNLP-940] Adding changes to correctly copy cluster index storage when defined

* [SPARKNLP-940] Moving local mode control to its right place

* [SPARKNLP-940] Refactoring sentToCLuster method

* [SPARKNLP-988] Updating EntityRuler documentation (#14168)

* [SPARKNLP-940] Adding changes to support storage temp directory (cluster_tmp_dir)

* SPARKNLP-1000: Disable init_all_tables for GPT2 (#14177)

Fixes `java.lang.IllegalArgumentException: No Operation named [init_all_tables] in the Graph` when model needs to be deserialized.
The deserialization is skipped when the modelis already loaded (so it will only appear on the worker nodes and not the driver)

GPT2 does not contain tables and so does not require this command.

* fixes python documentation (#14172)

* revert MarianTransformer.scala

* revert HasBatchedAnnotate.scala

* revert Preprocessor.scala

* Revert ViTClassifier.scala

* disable hard exception

* Replace hard exception with soft logs (#14179)

This reverts commit eb91fde.

* move the example from root to examples/ [skip test]

* Cleanup some code [skip test]

* Update onnxruntime to 1.17.0 [skip test]

* Fix M2M100 default model's name [skip test]

* Update docs [run doc]

* Update Scala and Python APIs

---------

Co-authored-by: ahmedlone127 <ahmedlone127@gmail.com>
Co-authored-by: Jiamao Zheng <jiamaozheng@users.noreply.github.com>
Co-authored-by: Devin Ha <33089471+DevinTDHa@users.noreply.github.com>
Co-authored-by: Prabod Rathnayaka <prabod@rathnayaka.me>
Co-authored-by: Stefano Lori <wolliq@users.noreply.github.com>
Co-authored-by: Stefano Lori <s.lori@izicap.com>
Co-authored-by: Danilo Burbano <37355249+danilojsl@users.noreply.github.com>
Co-authored-by: Devin Ha <t.ha@tu-berlin.de>
Co-authored-by: Danilo Burbano <danilo@johnsnowlabs.com>
Co-authored-by: github-actions <action@github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spark NLP Configuration's spark.jsl.settings.storage.cluster_tmp_dir: Databricks DBFS location does not work
3 participants