By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I do have a Java Spark job which access an S3 bucket using the protocol s3a but with localstack I'm having issues in accessibility. Here the details of what I've done:

In my pom.xml I imported:

    <dependency>
      <groupId>cloud.localstack</groupId>
      <artifactId>localstack-utils</artifactId>
      <version>0.1.9</version>
    </dependency>

The class I'm trying to test looks like this:

public class Session {
    private static SparkSession sparkSession = null;
    public SparkSession getSparkSession() {
        if (sparkSession == null) {
            sparkSession = SparkSession.builder()
                .appName("my-app)
                .master("local)
                .getOrCreate();
            Configuration hadoopConf = sparkSession.sparkContext().hadoopConfiguration();
            hadoopConf.set("fs.s3a.access.key", "test");
            hadoopConf.set("fs.s3a.secret.key", "test");
            hadoopConf.set("fs.s3a.endpoint", "http://test.localhost.atlassian.io:4572");           
        return sparkSession;
    public Dataset<Row> readJson(Seq<String> paths) {
        // Paths example: "s3a://testing-bucket/key1", "s3a://testing-bucket/key2", ...
        return this.getSparkSession().read().json(paths)

My testing class look like this:

@RunWith(LocalstackTestRunner.class)
public class BaseTest {
    private static AmazonS3 s3Client = null;
    @BeforeAll
    public static void setup() {
        EndpointConfiguration configuration = new AwsClientBuilder.EndpointConfiguration(
            LocalstackTestRunner.getEndpointS3(),
            LocalstackTestRunner.getDefaultRegion());
        s3Client = AmazonS3ClientBuilder.standard()
            .withEndpointConfiguration(configuration)
            .withCredentials(TestUtils.getCredentialsProvider())
            .withChunkedEncodingDisabled(true)
            .withPathStyleAccessEnabled(true)
            .build();
        String bucketName = "testing-bucket";
        s3Client.createBucket(bucketName);
        // load files here: I followed the example under the folder ext/java
        // to put some objects in there
    @Test
    public void testOne() {
         Session s = new Session();
         Seq<String> paths = new Seq<String>("....");
         Dataset<Row> files = s.readJson(paths); // <--- error here

I do receive two types of error depending on the protocol I am using. If I use protocol s3a I do have the following:

org.apache.spark.sql.AnalysisException: Path does not exist: s3a://testing-bucket/key1;
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:360)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:348)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
	at scala.collection.immutable.List.flatMap(List.scala:344)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:348)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
	at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:333)
	at com.company.Session.readJson(Session.java:63)
	at test.SessionTest.testSession(SessionTest.java:23)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at cloud.localstack.LocalstackTestRunner.run(LocalstackTestRunner.java:129)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
	at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)

If I use s3n (with the appropriate Spark configurations), I do receive this:

       .....
	at test.SessionTest.testSession(SessionTest.java:23)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at cloud.localstack.LocalstackTestRunner.run(LocalstackTestRunner.java:129)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
	at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
Caused by: org.jets3t.service.S3ServiceException: Service Error Message. -- ResponseCode: 403, ResponseStatus: Forbidden, XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>AllAccessDisabled</Code><Message>All access to this object has been disabled</Message><RequestId>EDDCE84A7470E7EC</RequestId><HostId>pKR4W3eUs6UCc0LU+QGifEvja3xMA4SfaBZRDt8JKoiu450VJtKoOaRhK/B5wD4f1iBu6HwUlWk=</HostId></Error>
	at org.jets3t.service.S3Service.getObject(S3Service.java:1470)
	at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.get(Jets3tFileSystemStore.java:164)
	... 51 more

I don't understand if it's a bug or I am missing something. Thanks!

Thanks for reporting @davideberdin . Can you try running your example with S3 path style access enabled (This is also briefly mentioned in the Troubleshooting section of the README):

        hadoopConf.set("fs.s3a.path.style.access", "true");
          

Hi @whummer, thanks for your answer! I tried by setting that property but I still have the same issue

java.nio.file.AccessDeniedException: s3a://testing-bucket/key1: getFileStatus on s3a://testing-bucket/key1: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 5A704DAC25D2DA09), S3 Extended Request ID: CE3wcJ/I5mF0t3CHoqAAf6fLkFJYVgKzIehHeXvYKWt0FZLOYFg6kFiLRRGTuEFl8SOZ9/h/IGQ=

Do you know if I should in somehow configure the ACL of the bucket or something related to the file in S3 ? This is the way I put the object in my bucket

AmazonS3 client = TestUtils.getClientS3();
byte[] dataBytes = .... // my content here
ObjectMetadata metaData = new ObjectMetadata();
metaData.setContentType(ContentType.APPLICATION_JSON.toString());
metaData.setContentEncoding(StandardCharsets.UTF_8.name());
metaData.setContentLength(dataBytes.length);
byte[] resultByte = DigestUtils.md5(dataBytes);
String streamMD5 = new String(Base64.encodeBase64(resultByte));
metaData.setContentMD5(streamMD5);
PutObjectRequest putObjectRequest = new PutObjectRequest("testing-bucket", "key1", new ByteArrayInputStream(dataBytes), metaData);
client.putObject(putObjectRequest);
          

One thing I noticed - can you double check the bucket names - the example in the first post uses "testing-bucket" for createBucket , but "test-bucket" for the paths in readJson. Also, please use the s3a protocol (it appears that s3n is deprecated). It should not be necessary to set bucket ACL, since you're already using a (fake) access key and secret key.

I am facing a problem like this but using the hadoop-aws library directly. My program tries to read a Parquet (using parquet-avro) file from localstack but it cannot

// Creat Bucket
val testBucket = s"test-bucket-${UUID.randomUUID()}"
val localtack = new EndpointConfiguration("http://s3:4572", Regions.US_EAST_1.getName)
val s3 = AmazonS3ClientBuilder.standard()
    .withEndpointConfiguration(localtack)
    .withPathStyleAccessEnabled(true)
    .build()
s3.createBucket(testBucket)
// Upload test file
val parquet = new File(getClass.getResource("/track.parquet").toURI)
val obj = new PutObjectRequest(testBucket, "track.parquet", parquet)
s3.putObject(obj)
// Hadoop configuration
val configuration = new Configuration()
configuration.set("fs.s3a.endpoint", "http://s3:4572")
configuration.set("fs.s3a.access.key", "<empty>")
configuration.set("fs.s3a.secret.key", "<empty>")
configuration.set("fs.s3a.path.style.access", "true")
// Read parquet
val path = new Path(s"s3a://$testBucket/track.parquet")
val reader = AvroParquetReader.builder[GenericRecord](path).withConf(configuration).build()
println(reader.read()) // This piece of code never is executed but no Exception is thrown

I can access the file normally with the AWS CLI and SDK but have no clue what is going on with with the ParquetReader.

There is any way to debug the S3 call made to localstack in that way I think it would be easier to track where the error is either on localstack the parquet-avro or in my code.

One last thing is that the AvroParquetReader works fine when is called with a local path.

This was broken for me as well. It was related to Content-Length always being returned as 0 on a HEAD request.
I upgraded to localstack 0.8.6 and it works 😸

Late to the party, but it seems that the Write to S3 part also isn't working from Spark when using S3a Protocol. Gives me MD5 Check errors.

Exception in thread "main" org.apache.hadoop.fs.s3a.AWSClientIOException: innerMkdirs on s3a://bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21/input.csv/_temporary/0: com.amazonaws.SdkClientException: Unable to verify integrity of data upload.  Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: accd48352b8de701213f0d8fa29bf438 in hex) calculated by Amazon S3.  You may need to delete the data stored in Amazon S3. (metadata.contentMD5: null, md5DigestStream: com.amazonaws.services.s3.internal.MD5DigestCalculatingInputStream@487069c6, bucketName: bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21, key: input.csv/_temporary/0/): Unable to verify integrity of data upload.  Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: accd48352b8de701213f0d8fa29bf438 in hex) calculated by Amazon S3.  You may need to delete the data stored in Amazon S3. (metadata.contentMD5: null, md5DigestStream: com.amazonaws.services.s3.internal.MD5DigestCalculatingInputStream@487069c6, bucketName: bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21, key: input.csv/_temporary/0/)

Spark Code:

from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from uuid import uuid4
spark = SparkSession\
        .builder\
        .appName("S3Write")\
        .getOrCreate()
spark.sparkContext.setLogLevel("DEBUG")
sc = spark.sparkContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('file:///genie/datafabric/power/spark_executor/latest/299423.csv')
fname = str(uuid4()) + '.csv'
df.write.csv('s3a://bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21/'+fname)
df.show(5)
print(df.count())
Spark Version: 2.1.0
Hadoop Version: 2.8.0
AWS SDK: 1.11.228
Localstack (Docker): Tried with latest,0.8.7,0.8.6

P.S. Able to write to Actual S3 with the same code.

i am trying to read the csv file from s3 using following code

s3_read_df = spark.read \
        .option("delimiter", ",") \
        .format("csv") \
        .load("s3a://tutorial/test.csv")

Getting xml parsing error for same.

py4j.protocol.Py4JJavaError: An error occurred while calling o36.load. : com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200, Response Text: OK at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:399) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3480)

i was able to read/write on localstack s3 with spark without hadoop 2.3.1 version installation and hadoop 2.8.4.
we will need to set parameter mentioned in link to make whole thing working
http://spark.apache.org/docs/latest/hadoop-provided.html

localstack#538 (comment)) One of the external symptoms of the bug is that S3 interactions for objects that are only the null string (zero length objects) throw an error like `Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: ...` which is telling because the `contentMD5` value is the base64 encoded value for the md5 of null string. The correct etag value should be `d41d8cd98f00b204e9800998ecf8427e`. On multiple executions of the test case the etag generated by the S3 server is different so there is some non deterministic value being put into the hash function. I haven't been able to find where the actual issue exists in the code as I don't use python very often. One workaround for avoiding this issue is to disable the MD5 validations in the S3 client. For Java that means setting the following properties to any value. com.amazonaws.services.s3.disableGetObjectMD5Validation com.amazonaws.services.s3.disablePutObjectMD5Validation

Late to the party, but it seems that the Write to S3 part also isn't working from Spark when using S3a Protocol. Gives me MD5 Check errors.

Exception in thread "main" org.apache.hadoop.fs.s3a.AWSClientIOException: innerMkdirs on s3a://bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21/input.csv/_temporary/0: com.amazonaws.SdkClientException: Unable to verify integrity of data upload.  Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: accd48352b8de701213f0d8fa29bf438 in hex) calculated by Amazon S3.  You may need to delete the data stored in Amazon S3. (metadata.contentMD5: null, md5DigestStream: com.amazonaws.services.s3.internal.MD5DigestCalculatingInputStream@487069c6, bucketName: bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21, key: input.csv/_temporary/0/): Unable to verify integrity of data upload.  Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: accd48352b8de701213f0d8fa29bf438 in hex) calculated by Amazon S3.  You may need to delete the data stored in Amazon S3. (metadata.contentMD5: null, md5DigestStream: com.amazonaws.services.s3.internal.MD5DigestCalculatingInputStream@487069c6, bucketName: bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21, key: input.csv/_temporary/0/)

Spark Code:

from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from uuid import uuid4
spark = SparkSession\
        .builder\
        .appName("S3Write")\
        .getOrCreate()
spark.sparkContext.setLogLevel("DEBUG")
sc = spark.sparkContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('file:///genie/datafabric/power/spark_executor/latest/299423.csv')
fname = str(uuid4()) + '.csv'
df.write.csv('s3a://bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21/'+fname)
df.show(5)
print(df.count())
Spark Version: 2.1.0
Hadoop Version: 2.8.0
AWS SDK: 1.11.228
Localstack (Docker): Tried with latest,0.8.7,0.8.6

P.S. Able to write to Actual S3 with the same code.

@avn4986, were you able to sort this out? I have the same issue.

Created https://github.com/risdenk/s3a-localstack to demonstrate the S3A rename issue. This uses localstack docker with JUnit.

Example output:

testS3ALocalStackFileSystem(io.github.risdenk.hadoop.s3a.TestS3ALocalstack)  Time elapsed: 4.562 sec  <<< ERROR!
org.apache.hadoop.fs.s3a.AWSBadRequestException: copyFile(YQuGJ, lzKBU) on YQuGJ: com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you specified was invalid (Service: Amazon S3; Status Code: 400; Error Code: InvalidDigest; Request ID: null; S3 Extended Request ID: null), S3 Extended Request ID: null:InvalidDigest: The Content-MD5 you specified was invalid (Service: Amazon S3; Status Code: 400; Error Code: InvalidDigest; Request ID: null; S3 Extended Request ID: null)
	at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:224)
	at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:111)
	at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:125)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.copyFile(S3AFileSystem.java:2541)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.innerRename(S3AFileSystem.java:996)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.rename(S3AFileSystem.java:863)
	at io.github.risdenk.hadoop.s3a.TestS3ALocalstack.testS3ALocalStackFileSystem(TestS3ALocalstack.java:123)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at cloud.localstack.docker.LocalstackDockerTestRunner.run(LocalstackDockerTestRunner.java:43)
	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
	at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
	at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you specified was invalid (Service: Amazon S3; Status Code: 400; Error Code: InvalidDigest; Request ID: null; S3 Extended Request ID: null), S3 Extended Request ID: null
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1640)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1058)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4368)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4315)
	at com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1890)
	at com.amazonaws.services.s3.transfer.internal.CopyCallable.copyInOneChunk(CopyCallable.java:146)
	at com.amazonaws.services.s3.transfer.internal.CopyCallable.call(CopyCallable.java:134)
	at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:132)
	at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:43)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
          

Problem Statement - s3a rename
So finally had some time to sit down and look at the s3a rename failure. The AWS Java SDK in the case of s3a is doing a CopyObjectRequest here [1]. The ObjectMetadata is being copied to this request and so the Content-MD5 is being sent. It is unclear from the AWS docs [2] if Content-MD5 should be set on the request.

Localstack thinks that if the Content-MD5 header is set, then the MD5 of data must be checked [3]. The content ends up being empty on a copy request. When Localstack calculates the MD5 of '' it doesn't match what the AWS SDK/s3a set as the Content-MD5 [4].

In the case of say Minio, there is a separate copy request handler that doesn't check the Content-MD5 header [5].

Potential Solution
So I think there are best option here is to:

  • Make localstack be more lenient and check if this is a copy request before checking MD5
  • Something like if 'Content-MD5' in headers and 'x-amz-copy-source' not in headers:
  • https://github.com/localstack/localstack/blob/master/localstack/services/s3/s3_listener.py#L444
  • An alternative is to make Apache Hadoop S3AFilesystem not set the Content-MD5 metadata for a copy request. This isn't ideal for 2 reasons:

  • There are existing s3a users out there this wouldn't work for existing clients
  • S3AFilesystem works against Amazon S3 currently
  • [1] https://github.com/apache/hadoop/blob/branch-3.2.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L2541
    [2] https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectCOPY.html
    [3] https://github.com/localstack/localstack/blob/master/localstack/services/s3/s3_listener.py#L444
    [4] https://github.com/localstack/localstack/blame/master/localstack/services/s3/s3_listener.py#L311
    [5] https://github.com/minio/minio/blob/master/cmd/object-handlers.go#L623

    An alternative is to make Apache Hadoop S3AFilesystem not set the Content-MD5 metadata for a copy request. This isn't ideal for 2 reasons:

    There are existing s3a users out there this wouldn't work for existing clients

    valid point

    S3AFilesystem works against Amazon S3 currently

    true, but we don't want to preclude other stores. Looking at the last time anyone touched the metadata propagation code it actually came from EMC HADOOP-11687.

    One suggestion here is that someone checks out the hadoop source code and run the hadoop-aws tests against this new s3a endpoint. Then if there are problems, they can be fixed at either end. If we shouldn't be setting the content-MD5, then we shouldn't be setting the content-MD5.

  • I am actually doing some copy stuff in HADOOP-16490. Improve S3Guard handling of FNFEs in copy apache/hadoop#1229, but this should be separate if it stands any chance of being backported.
  • the probability of me finding the time to work on this is ~0
  • but if others were to do a patch and, after complying with our test requirements, repeatedly harass me to look at it, I'll take a look
  • my first recommendation would be for you to use the hadoop-aws test suite as part of your own integration testing, "this is what the big data apps expect". lets see what happens

    I can make a pull request for not checking the MD5 if it is a copy. It is a simple one line change. Not sure where the tests are. Let me put up the PR and go from there.

    my first recommendation would be for you to use the hadoop-aws test suite as part of your own integration testing, "this is what the big data apps expect". lets see what happens

    I tried to take a crack at this tonight and ran into a bunch of issues with ETag Change detection policy requires ETag.