You signed in with another tab or window.
Reload
to refresh your session.
You signed out in another tab or window.
Reload
to refresh your session.
You switched accounts on another tab or window.
Reload
to refresh your session.
I do have a Java Spark job which access an S3 bucket using the protocol
s3a
but with localstack I'm having issues in accessibility. Here the details of what I've done:
public class Session {
private static SparkSession sparkSession = null;
public SparkSession getSparkSession() {
if (sparkSession == null) {
sparkSession = SparkSession.builder()
.appName("my-app)
.master("local)
.getOrCreate();
Configuration hadoopConf = sparkSession.sparkContext().hadoopConfiguration();
hadoopConf.set("fs.s3a.access.key", "test");
hadoopConf.set("fs.s3a.secret.key", "test");
hadoopConf.set("fs.s3a.endpoint", "http://test.localhost.atlassian.io:4572");
return sparkSession;
public Dataset<Row> readJson(Seq<String> paths) {
// Paths example: "s3a://testing-bucket/key1", "s3a://testing-bucket/key2", ...
return this.getSparkSession().read().json(paths)
My testing class look like this:
@RunWith(LocalstackTestRunner.class)
public class BaseTest {
private static AmazonS3 s3Client = null;
@BeforeAll
public static void setup() {
EndpointConfiguration configuration = new AwsClientBuilder.EndpointConfiguration(
LocalstackTestRunner.getEndpointS3(),
LocalstackTestRunner.getDefaultRegion());
s3Client = AmazonS3ClientBuilder.standard()
.withEndpointConfiguration(configuration)
.withCredentials(TestUtils.getCredentialsProvider())
.withChunkedEncodingDisabled(true)
.withPathStyleAccessEnabled(true)
.build();
String bucketName = "testing-bucket";
s3Client.createBucket(bucketName);
// load files here: I followed the example under the folder ext/java
// to put some objects in there
@Test
public void testOne() {
Session s = new Session();
Seq<String> paths = new Seq<String>("....");
Dataset<Row> files = s.readJson(paths); // <--- error here
I do receive two types of error depending on the protocol I am using. If I use protocol s3a I do have the following:
org.apache.spark.sql.AnalysisException: Path does not exist: s3a://testing-bucket/key1;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:360)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:348)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:348)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:333)
at com.company.Session.readJson(Session.java:63)
at test.SessionTest.testSession(SessionTest.java:23)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at cloud.localstack.LocalstackTestRunner.run(LocalstackTestRunner.java:129)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
If I use s3n (with the appropriate Spark configurations), I do receive this:
.....
at test.SessionTest.testSession(SessionTest.java:23)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at cloud.localstack.LocalstackTestRunner.run(LocalstackTestRunner.java:129)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
Caused by: org.jets3t.service.S3ServiceException: Service Error Message. -- ResponseCode: 403, ResponseStatus: Forbidden, XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>AllAccessDisabled</Code><Message>All access to this object has been disabled</Message><RequestId>EDDCE84A7470E7EC</RequestId><HostId>pKR4W3eUs6UCc0LU+QGifEvja3xMA4SfaBZRDt8JKoiu450VJtKoOaRhK/B5wD4f1iBu6HwUlWk=</HostId></Error>
at org.jets3t.service.S3Service.getObject(S3Service.java:1470)
at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.get(Jets3tFileSystemStore.java:164)
... 51 more
I don't understand if it's a bug or I am missing something. Thanks!
Thanks for reporting @davideberdin . Can you try running your example with S3 path style access enabled (This is also briefly mentioned in the Troubleshooting section of the README):
Do you know if I should in somehow configure the ACL of the bucket or something related to the file in S3 ? This is the way I put the object in my bucket
AmazonS3 client = TestUtils.getClientS3();
byte[] dataBytes = .... // my content here
ObjectMetadata metaData = new ObjectMetadata();
metaData.setContentType(ContentType.APPLICATION_JSON.toString());
metaData.setContentEncoding(StandardCharsets.UTF_8.name());
metaData.setContentLength(dataBytes.length);
byte[] resultByte = DigestUtils.md5(dataBytes);
String streamMD5 = new String(Base64.encodeBase64(resultByte));
metaData.setContentMD5(streamMD5);
PutObjectRequest putObjectRequest = new PutObjectRequest("testing-bucket", "key1", new ByteArrayInputStream(dataBytes), metaData);
client.putObject(putObjectRequest);
One thing I noticed - can you double check the bucket names - the example in the first post uses "testing-bucket" for createBucket , but "test-bucket" for the paths in readJson. Also, please use the s3a protocol (it appears that s3n is deprecated). It should not be necessary to set bucket ACL, since you're already using a (fake) access key and secret key.
I am facing a problem like this but using the hadoop-aws library directly. My program tries to read a Parquet (using parquet-avro) file from localstack but it cannot
// Creat BucketvaltestBucket=s"test-bucket-${UUID.randomUUID()}"vallocaltack=newEndpointConfiguration("http://s3:4572", Regions.US_EAST_1.getName)
vals3=AmazonS3ClientBuilder.standard()
.withEndpointConfiguration(localtack)
.withPathStyleAccessEnabled(true)
.build()
s3.createBucket(testBucket)
// Upload test filevalparquet=newFile(getClass.getResource("/track.parquet").toURI)
valobj=newPutObjectRequest(testBucket, "track.parquet", parquet)
s3.putObject(obj)
// Hadoop configurationvalconfiguration=newConfiguration()
configuration.set("fs.s3a.endpoint", "http://s3:4572")
configuration.set("fs.s3a.access.key", "<empty>")
configuration.set("fs.s3a.secret.key", "<empty>")
configuration.set("fs.s3a.path.style.access", "true")
// Read parquetvalpath=newPath(s"s3a://$testBucket/track.parquet")
valreader=AvroParquetReader.builder[GenericRecord](path).withConf(configuration).build()
println(reader.read()) // This piece of code never is executed but no Exception is thrown
I can access the file normally with the AWS CLI and SDK but have no clue what is going on with with the ParquetReader.
There is any way to debug the S3 call made to localstack in that way I think it would be easier to track where the error is either on localstack the parquet-avro or in my code.
One last thing is that the AvroParquetReader works fine when is called with a local path.
This was broken for me as well. It was related to Content-Length always being returned as 0 on a HEAD request.
I upgraded to localstack 0.8.6 and it works 😸
Late to the party, but it seems that the Write to S3 part also isn't working from Spark when using S3a Protocol. Gives me MD5 Check errors.
Exception in thread "main" org.apache.hadoop.fs.s3a.AWSClientIOException: innerMkdirs on s3a://bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21/input.csv/_temporary/0: com.amazonaws.SdkClientException: Unable to verify integrity of data upload. Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: accd48352b8de701213f0d8fa29bf438 in hex) calculated by Amazon S3. You may need to delete the data stored in Amazon S3. (metadata.contentMD5: null, md5DigestStream: com.amazonaws.services.s3.internal.MD5DigestCalculatingInputStream@487069c6, bucketName: bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21, key: input.csv/_temporary/0/): Unable to verify integrity of data upload. Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: accd48352b8de701213f0d8fa29bf438 in hex) calculated by Amazon S3. You may need to delete the data stored in Amazon S3. (metadata.contentMD5: null, md5DigestStream: com.amazonaws.services.s3.internal.MD5DigestCalculatingInputStream@487069c6, bucketName: bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21, key: input.csv/_temporary/0/)
Spark Code:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from uuid import uuid4
spark = SparkSession\
.builder\
.appName("S3Write")\
.getOrCreate()
spark.sparkContext.setLogLevel("DEBUG")
sc = spark.sparkContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('file:///genie/datafabric/power/spark_executor/latest/299423.csv')
fname = str(uuid4()) + '.csv'
df.write.csv('s3a://bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21/'+fname)
df.show(5)
print(df.count())
py4j.protocol.Py4JJavaError: An error occurred while calling o36.load.
: com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200, Response Text: OK
at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:399)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3480)
i was able to read/write on localstack s3 with spark without hadoop 2.3.1 version installation and hadoop 2.8.4.
we will need to set parameter mentioned in link to make whole thing working http://spark.apache.org/docs/latest/hadoop-provided.html
localstack#538 (comment))
One of the external symptoms of the bug is that S3 interactions for
objects that are only the null string (zero length objects) throw
an error like
`Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: ...`
which is telling because the `contentMD5` value is the base64 encoded
value for the md5 of null string. The correct etag value should be
`d41d8cd98f00b204e9800998ecf8427e`.
On multiple executions of the test case the etag generated by the S3
server is different so there is some non deterministic value being put
into the hash function. I haven't been able to find where the actual
issue exists in the code as I don't use python very often.
One workaround for avoiding this issue is to disable the MD5 validations
in the S3 client. For Java that means setting the following properties
to any value.
com.amazonaws.services.s3.disableGetObjectMD5Validation
com.amazonaws.services.s3.disablePutObjectMD5Validation
Late to the party, but it seems that the Write to S3 part also isn't working from Spark when using S3a Protocol. Gives me MD5 Check errors.
Exception in thread "main" org.apache.hadoop.fs.s3a.AWSClientIOException: innerMkdirs on s3a://bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21/input.csv/_temporary/0: com.amazonaws.SdkClientException: Unable to verify integrity of data upload. Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: accd48352b8de701213f0d8fa29bf438 in hex) calculated by Amazon S3. You may need to delete the data stored in Amazon S3. (metadata.contentMD5: null, md5DigestStream: com.amazonaws.services.s3.internal.MD5DigestCalculatingInputStream@487069c6, bucketName: bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21, key: input.csv/_temporary/0/): Unable to verify integrity of data upload. Client calculated content hash (contentMD5: 1B2M2Y8AsgTpgAmY7PhCfg== in base 64) didn't match hash (etag: accd48352b8de701213f0d8fa29bf438 in hex) calculated by Amazon S3. You may need to delete the data stored in Amazon S3. (metadata.contentMD5: null, md5DigestStream: com.amazonaws.services.s3.internal.MD5DigestCalculatingInputStream@487069c6, bucketName: bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21, key: input.csv/_temporary/0/)
Spark Code:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from uuid import uuid4
spark = SparkSession\
.builder\
.appName("S3Write")\
.getOrCreate()
spark.sparkContext.setLogLevel("DEBUG")
sc = spark.sparkContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('file:///genie/datafabric/power/spark_executor/latest/299423.csv')
fname = str(uuid4()) + '.csv'
df.write.csv('s3a://bucket-958abef2-b13e-4778-8a89-dc5d0a6aae21/'+fname)
df.show(5)
print(df.count())
testS3ALocalStackFileSystem(io.github.risdenk.hadoop.s3a.TestS3ALocalstack) Time elapsed: 4.562 sec <<< ERROR!
org.apache.hadoop.fs.s3a.AWSBadRequestException: copyFile(YQuGJ, lzKBU) on YQuGJ: com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you specified was invalid (Service: Amazon S3; Status Code: 400; Error Code: InvalidDigest; Request ID: null; S3 Extended Request ID: null), S3 Extended Request ID: null:InvalidDigest: The Content-MD5 you specified was invalid (Service: Amazon S3; Status Code: 400; Error Code: InvalidDigest; Request ID: null; S3 Extended Request ID: null)
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:224)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:111)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:125)
at org.apache.hadoop.fs.s3a.S3AFileSystem.copyFile(S3AFileSystem.java:2541)
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerRename(S3AFileSystem.java:996)
at org.apache.hadoop.fs.s3a.S3AFileSystem.rename(S3AFileSystem.java:863)
at io.github.risdenk.hadoop.s3a.TestS3ALocalstack.testS3ALocalStackFileSystem(TestS3ALocalstack.java:123)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at cloud.localstack.docker.LocalstackDockerTestRunner.run(LocalstackDockerTestRunner.java:43)
at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: The Content-MD5 you specified was invalid (Service: Amazon S3; Status Code: 400; Error Code: InvalidDigest; Request ID: null; S3 Extended Request ID: null), S3 Extended Request ID: null
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1640)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1058)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4368)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4315)
at com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1890)
at com.amazonaws.services.s3.transfer.internal.CopyCallable.copyInOneChunk(CopyCallable.java:146)
at com.amazonaws.services.s3.transfer.internal.CopyCallable.call(CopyCallable.java:134)
at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:132)
at com.amazonaws.services.s3.transfer.internal.CopyMonitor.call(CopyMonitor.java:43)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Problem Statement - s3arename
So finally had some time to sit down and look at the s3arename failure. The AWS Java SDK in the case of s3a is doing a CopyObjectRequest here [1]. The ObjectMetadata is being copied to this request and so the Content-MD5 is being sent. It is unclear from the AWS docs [2] if Content-MD5 should be set on the request.
Localstack thinks that if the Content-MD5 header is set, then the MD5 of data must be checked [3]. The content ends up being empty on a copy request. When Localstack calculates the MD5 of '' it doesn't match what the AWS SDK/s3a set as the Content-MD5 [4].
In the case of say Minio, there is a separate copy request handler that doesn't check the Content-MD5 header [5].
Potential Solution
So I think there are best option here is to:
Make localstack be more lenient and check if this is a copy request before checking MD5
Something like if 'Content-MD5' in headers and 'x-amz-copy-source' not in headers:
An alternative is to make Apache Hadoop S3AFilesystem not set the Content-MD5 metadata for a copy request. This isn't ideal for 2 reasons:
There are existing s3a users out there this wouldn't work for existing clients
valid point
S3AFilesystem works against Amazon S3 currently
true, but we don't want to preclude other stores. Looking at the last time anyone touched the metadata propagation code it actually came from EMC HADOOP-11687.
the probability of me finding the time to work on this is ~0
but if others were to do a patch and, after complying with our test requirements, repeatedly harass me to look at it, I'll take a look
my first recommendation would be for you to use the hadoop-aws test suite as part of your own integration testing, "this is what the big data apps expect". lets see what happens
I can make a pull request for not checking the MD5 if it is a copy. It is a simple one line change. Not sure where the tests are. Let me put up the PR and go from there.
my first recommendation would be for you to use the hadoop-aws test suite as part of your own integration testing, "this is what the big data apps expect". lets see what happens
I tried to take a crack at this tonight and ran into a bunch of issues with ETag Change detection policy requires ETag.