-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not able to save to redshift using spark-redshift package v 0.6 #158
Comments
Can you please share more of the log? Are there any additional log messages which follow the one that you posted here? The "The AWS Access Key Id you provided does not exist in our records" message makes this sound like some sort of credentials problem, so it would be helpful if you could let me know more about how you've configured access to S3. Specifically:
|
Also, since you're on EMR, have you seen the "IAM instance profiles" section of https://github.com/databricks/spark-redshift#aws-credentials? If you're using instance profiles and not S3 keys, then that could explain why Spark is able to authenticate to S3 when writing the Avro file. If this is the case and you can't use keys, you'll want to use the security token service to obtain temporary credentials to pass to Redshift. There are some examples of this at https://stackoverflow.com/questions/33797693/how-to-properly-provide-credentials-for-spark-redshift-in-emr-instances, but I haven't tried them out / vetted them for accuracy. It would be great if someone submitted a patch to automatically obtain the tokens; I don't have time to work on that feature myself, but would be glad to review pull requests for it. |
Thanks for your quick response ...Josh. Yes, I'm using this example ( https://stackoverflow.com/questions/33797693/how-to-properly-provide-credentials-for-spark-redshift-in-emr-instances) to setup credentials as follows:
Setting hadoop credentials as follows:
The following is not working but temp files are generated in the S3 bucket:
Here's the partial stack trace for:
Here's the save routine:
Any help would really be appreciated. |
In the example code that you posted above, it looks like you're taking STS session credentials and are configuring
Similarly, according to https://docs.aws.amazon.com/redshift/latest/dg/r_copy-temporary-security-credentials.html:
Since What I think is happening here is that the invalid credentials in
Regarding that bucket lifecycle exception, I think that's a harmless but annoying bug in |
By the way, I would appreciate any suggestions on how I can make the documentation more clear. If my suggestion fixes things for you and you have time, feel free to submit a pull request to add clarifications to the README. |
I've opened #159 to fix the spurious bucket lifecycle check warning. |
Based on your recommendations, i went ahead and implemented made the following updates:
I'm wondering if this line of code in RedshiftRelation.scala needs to retain the credentials in the S3 url: https://github.com/databricks/spark-redshift/blob/master/src/main/scala/com/databricks/spark/redshift/RedshiftRelation.scala#L155. |
This is the stack trace: |
You can't embed the credentials into the URI when specifying it as part of the |
To clarify, when using IAM instance profiles you should not specify the credentials in the Hadoop configuration or the URI; you should only specify them via the |
Do you have an example showing how to provide s3 URL credentials via the temporary_aws_*? |
Sure; here's an example: spark-redshift/src/it/scala/com/databricks/spark/redshift/STSIntegrationSuite.scala Line 60 in 3b2ce8b
|
Okay thanks. val eventsDF = sqlContext.read.format("com.databricks.spark.redshift").option("url", jdbcURL).option("tempdir", tempS3Dir).option("dbtable", "event").option("temporary_aws_access_key_id", awsAccessKey).option("temporary_aws_secret_access_key", awsSecretKey).option("temporary_aws_session_token", token).load() However, I'm getting the following stack trace: Do you have an example that uses InstanceProfile? |
Aha, that is a bug. This will be fixed as soon as I merge #159. I guess we haven't had other users try this out (I don't have end-to-end integration tests for this STS/IAM path yet; pull requests are welcome for this, since I don't have time to work on this myself). |
I'm going to merge #159. If you have the means to test out a custom build of |
That's great news! Let me know how to access the custom build and test it later today. |
@wmdoble, the easiest way to test this out is probably to clone this repository and use either |
Confirming your code fix works. Thank you! scala> eventsDF.show() Output: I appreciate your help. |
I'm having issues with that. I rebuilt the master branch, and having such behaviours:
I do get: 2.With the following I do get something better (permission issue):
Exception at the end:
I'll try with 's3a'. |
With s3a there are problems, but those are related to Hadoop: hadoop-was-2.7.2 calls the non existent method:
|
same problem as @AndriyLytvynskyy for me. py4j.protocol.Py4JJavaError: An error occurred while calling o30.load. I wonder if this is specific with the Python version of this library. my snippet running with these versions of packages: spark-submit --jars ./RedshiftJDBC41-1.1.13.1013.jar --packages org.apache.hadoop:hadoop-aws:2.6.4,com.amazonaws:aws-java-sdk:1.11.29,com.databricks:spark-redshift_2.10:1.1.0 --driver-class-path ./RedshiftJDBC41-1.1.13.1013.jar main.py
|
Hey @austinchau, let's continue this discussion over at #260, which I think is a duplicate of your specific issue. |
I'm running your tutorial for the spark-redshift package on Amazon EMR - emr-4.2.0, Spark 1.5.2, spark-redshfit 0.6.0 but I can't seem to get any tables written to Redshift. Both S3 buckets and Redshift cluster are in us-east-1. I can see the temp files created in S3, despite the Error:
An error occurred while trying to read the S3 bucket lifecycle configuration com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403...)
Is this a known issue? Any best practices or workaround?
The text was updated successfully, but these errors were encountered: