Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear spark/databricks support #2262

Closed
guybartal opened this issue Apr 30, 2023 · 17 comments · Fixed by #2339
Closed

Unclear spark/databricks support #2262

guybartal opened this issue Apr 30, 2023 · 17 comments · Fixed by #2339
Assignees
Milestone

Comments

@guybartal
Copy link

Expected behavior

Spark drivers included in WebAPI

Actual behavior

We get an error on WebAPI application startup:

2023-04-30T08:00:58.824720185Z 2023-04-30 08:00:58.824 INFO main org.ohdsi.webapi.DataAccessConfig - [] - error loading com.simba.spark.jdbc.Driver

Steps to reproduce behavior

Start WebAPI application with ohdsi/webapi:2.12.1 docker image

Description

Steps to support spark source is unclear and undocumented,
what should we do to use spark source without changing Dockerfile?

the steps we did to load the spark drivers are:

  1. Download JDBC spark drivers (version 2.6.22) from Databricks website: https://www.databricks.com/spark/jdbc-drivers-archive
  2. Extract and rename jar driver file to /src/main/extras/spark/spark-2.6.22.1040.jar
  3. Edit Dockerfile:
  • Comment out these lines:
# Download dependencies
COPY pom.xml /code/
# RUN mkdir .git \
#     && mvn package \
#      -P${MAVEN_PROFILE}
  • Add initialize to mvn command:
# Compile code and repackage it
COPY src /code/src
RUN mvn initialize \
    -Dgit.branch=${GIT_BRANCH} \
    -Dgit.commit.id.abbrev=${GIT_COMMIT_ID_ABBREV} \
    -P${MAVEN_PROFILE} \
    && mvn package \
    -Dgit.branch=${GIT_BRANCH} \
    -Dgit.commit.id.abbrev=${GIT_COMMIT_ID_ABBREV} \
    -P${MAVEN_PROFILE} \
    && mkdir war \
    && mv target/WebAPI.war war \
    && cd war \
    && jar -xf WebAPI.war \
    && rm WebAPI.war
  1. Build the Dockerfile with the following build args and maven profile:
docker build -t webapi --build-arg MAVEN_PROFILE=webapi-docker,webapi-spark --build-arg GIT_BRANCH=v2.12.1 .
  1. Run the custom image and see startup logs:
    2023-04-30T08:00:58.824720185Z 2023-04-30 08:00:58.824 INFO main org.ohdsi.webapi.DataAccessConfig - [] - driver loaded: com.simba.spark.jdbc.Driver
@guybartal
Copy link
Author

@anton-abushkevich

@konstjar
Copy link
Contributor

konstjar commented Apr 30, 2023

You may need to specify the exact path to driver in pom.xml

@guybartal
Copy link
Author

Thanks @konstjar but it's already set by default to:

<spark.classpath>${basedir}/src/main/extras/spark</spark.classpath>

And as I wrote above in item number 2 on the list, we've extract and copy the driver to that path.

But we are seeking a way to use WebAPI image or at least build the image without the need to fork the repository and change the source code, for example we saw there's support for Redshift, Postgresql and MS SQL Server in the public image.

@konstjar
Copy link
Contributor

konstjar commented May 1, 2023

@guybartal I'm not sure if this path is correct. We specify in our build exact path.

Also I see there are some copy instruction in after the build. Maybe it causes issues. In our case we build .war file and run application under Tomcat server.

@leeevans Do you have an idea?

@guybartal
Copy link
Author

@leeevans did you have time to see this? We'd appreciate more clarity on that matter.
Thanks!

@anatbal
Copy link

anatbal commented May 11, 2023

Hello @konstjar , @leeevans. It seems like some other folks have faced this issue in the past. As you can see here and here the OHDSI webapi repo was forked and changes were made to the docker file, pom file and added the jar file. I think what we are looking for and especially since others tried to tackle it in the past, is a way to build ohdsi WebAPI with spark "almost" out of the box - meaning passing a build argument to the docker file. What are you thoughts? are there any plans on making the ohdsi WebAPI with spark simpler to build, or is it something you are open for us to contribute?

@konstjar
Copy link
Contributor

@anatbal As far as I know there are no plans for "spark simpler to build". Feel free to submit PR.

@anthonysena
Copy link
Collaborator

Bumping this as I'm trying to build WebAPI with support for data bricks with @fdefalco. I ran into the same issues as @anatbal @guybartal. I was hoping it would be as simple as using Maven to compile the project with the webapi-spark profile included in the command. Unfortunately, I've run into issues since that driver (2.6.22.1040) no longer appears to be on maven central. I've made some attempts to update the pom.xml to use the latest JDBC drivers but it is still attempting to find this file in

<spark.classpath>${basedir}/src/main/extras/spark</spark.classpath>

which is not currently a directory under the ${basedir}.

Tagging @TomWhite-MedStar @konstjar hoping they can share any insights for compiling/deploying WebAPI w/ spark support.

@konstjar
Copy link
Contributor

@anthonysena I do not think we do something different.

We have Spark drivers in external folder that's why we specify the path using spark.classpath parameter during build process:

mvn --batch-mode initialize -Dspark.classpath=${SPARK_PATH} -Pwebapi-spark,webapi-snowflake
mvn --batch-mode package -e -X -Dspark.classpath=${SPARK_PATH} -Dmaven.test.skip -Pwebapi-spark,webapi-snowflake

@konstjar
Copy link
Contributor

And you can grab the drivers manually here

@chrisknoll
Copy link
Collaborator

@konstjar , just a clarification on the Maven lifecycle (maybe this worked once but not anymore?)

From pom.xml:

    <profile>
      <id>webapi-spark</id>
      <properties>
        <spark.enabled>true</spark.enabled>
        <!-- Spark JDBC driver path -->
        <spark.classpath>${basedir}/src/main/extras/spark</spark.classpath>
      </properties>
      <dependencies>
        <dependency>
          <groupId>com.simba</groupId>
          <artifactId>spark</artifactId>
          <version>2.6.22.1040</version>
        </dependency>
      </dependencies>
      <build>
        <plugins>
          <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-install-plugin</artifactId>
            <version>3.1.0</version>
            <executions>
              <execution>
                <id>spark-jdbc</id>
                <phase>initialize</phase>
                <goals>
                  <goal>install-file</goal>
                </goals>
                <configuration>
                  <groupId>com.simba</groupId>
                  <artifactId>spark</artifactId>
                  <version>2.6.22.1040</version>
                  <packaging>jar</packaging>
                  <file>${spark.classpath}/spark-2.6.22.1040.jar</file>
                </configuration>
              </execution>
            </executions>
          </plugin>
        </plugins>
      </build>
    </profile>

This seems to be calling out that there should be an action during the initialize phase to perform the install-file goal, so we'd expect that the artifact would be installed any time the maven toolchain is invoked that would include the 'initalize' phase (which is most of them...maybe not clean, but definitely package). So do you know if there's something that used to work in the past (but no longer) where you can't do the install-file goal in the same mvn command that will use the installed dependency (ie: the drivers get installed into the m2 cache)? My theory is that we're getting the failure because all dependencies are resolved at mvn launch and any deps that are installed during the build will not be seen in the same build command....which is why it might work for you where you do the nstall (via initialize) in your first command, and then run the package command in the next (where the deps have been installed into the maven repo via install-file)

@konstjar
Copy link
Contributor

@chrisknoll Thank you for the comments.

We do CI/CD builds internally using pipelines and every time they create build environment from scratch. That's why I can confirm it's not a one time success case.

Since we build WebAPI version with all available profiles, we know that initialize command should be run in advance. Most of JDBC drivers should be cached in the m2 before package command. This approach is used for a long time already.

You can see in original message that 2 commands are used initialize and package , and I do not think this is a problem.
The only difference is the path to drivers and how they are passed to Maven.

@chrisknoll
Copy link
Collaborator

Thanks @konstjar . My question was that is the 2 separate command redundant when the second command should cover the lifecycle phase initialize because the target of package should include initialize.

It's fine if it must be a two-phase build in maven if you're going to use external JDBC drivers, I just don't think our documentation reflects that.

@konstjar
Copy link
Contributor

Yes, I can confirm that 2 commands are required when additional JDBC profiles are in use.

Though my understanding about lifecycle phases was the same. I thought it's enough to have just package.

@chrisknoll
Copy link
Collaborator

I'm no expert on maven pipeline, but my guess is that when the command starts up it has a cache of all the installed dependencies and so if anything gets 'added' to the m2 repo during the pipeline, it's not known about it until the process restarts, which is why we need to do it in 2 different invocations. Lame, but it would explain it.

@anthonysena anthonysena linked a pull request Jan 23, 2024 that will close this issue
@anthonysena
Copy link
Collaborator

I put together a PR that will both update the driver and remove the maven install goal discussed here. I was able to deploy a version of WebAPI with a recent Databricks driver and here is the maven command I used:

mvn clean package -s D:/apache-maven/apache-maven-3.9.6/conf/settings.xml -DskipTests=true -DskipUnitTests=true -DskipITtests=true -Pwebapi-postgresql-security-enabled,webapi-solr,webapi-spark

You will notice that all I did was add webapi-spark to the list of profiles and that was enough to include the Databricks driver into the WebAPI.war.

@anthonysena
Copy link
Collaborator

Update here based on review of #2339, we've decided to add a new profile WebAPI's pom.xml called 'webapi-databricks' to allow for using the latest DataBricks driver. This is to preserve backwards compatibility for those that are currently using the previous Spark driver and may be using a connection string that starts with jdbc:spark.... Changing that profile would then require changing the respective connection strings, etc.

Once #2339 is merged, the command to build WebAPI w/ the DataBricks driver would look like this:

mvn clean package -s D:/apache-maven/apache-maven-3.9.6/conf/settings.xml -DskipTests=true -DskipUnitTests=true -DskipITtests=true -Pwebapi-postgresql-security-enabled,webapi-solr,webapi-databricks

@anthonysena anthonysena changed the title Unclear spark support Unclear spark/databricks support Jan 30, 2024
@anthonysena anthonysena added this to the v2.15 milestone Feb 13, 2024
@anthonysena anthonysena self-assigned this Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants