Changing reference for all publications in a dataset - NOMAD and FAIRmat

Hi, I’ve been trying to change reference for large number of publication (close to 3000) within a dataset. Each calculation has been uploaded individually, so I have to use API and I’ve been having issues with it.

I can do it for single calculation like this:

base_url = 'https://nomad-lab.eu/prod/v1/staging/api/v1/' query = {'calc_id': u'hlb48O0vNx9NtuO5txg_QWBfqdfZ'} response = requests.post( base_url + "/entries/edit", headers={'Authorization': 'Bearer {}'.format(token_access)}, json = {'query': query, 'metadata': {'references': ["https://www.nature.com/articles/s41524-023-01113-5","http://htp-ahe.fzu.cz/"]}

I’ve tried it for a few calculations and that works.

When I instead try specifying the whole dataset using this query:

query = {'datasets' : {'dataset_id': u'w-aD3WUATVqP_GpRd-GU1g',}}

it does not seem to have any effect, even though the same query works using /entries/query.

Should I change every calculation individually?

Thanks.

It looks good to me and should work. The query is ok. There are also no errors in the logs. Therefore, I can’t tell you what goes wrong. Has to be some kind of bug.

The simple solution would be that I just manually do the changes in our backend.

Unfortunately, you created individual uploads for each entry. Is there a reason why you did this? it makes it very hard to change the metadata. Also we try to avoid this as much as possible. We have limits on simultaneously unpublished uploads etc. You did a good job circumventing this via API, but it is still less than ideal for us. Right now, you dataset constitutes 30% of all nomad uploads because of it. This makes future maintenance, backup, and migration tasks unnecessarily hard for us and has negative performance implications.

Can I ask you to re-upload the data into one upload again?. E.g. via one zip files with directories 0, 1, …, 2871 or something. I can than remove the tiny uploads and make sure that the references are those you want, and migrate the dataset while keeping the DOI intact.

Oh, sorry about this, I didn’t realize that uploading many calculations individually is not a good practice. I did it like this, simply because it was easier to implement. I think that the when put all together the zip file will be very large and will probably have to be split, but anyway I can upload in few large increments.

I’ve put everything in one zip file, but it is 580gb. I can split it into smaller ones if that’s better.

If this is too big I can remove the Wannier Hamiltonians as they take most of the size, although they were the main reason why I wanted to upload this data.

Hi @Zeleznyj ,

I read “Wannier Hamiltonians” and I just wanted to jump in

May I ask which code are you use for calculating the Wannier orbitals? Wannier90? Furthermore, are you referring to store the reciprocal space full Hamiltonian? If that is the case, I’d say that you would only need to store the hopping matrix instead, plus the material information; that would be enough to reconstruct the k-space full Hamiltonian. Maybe that is a further way of saving some Gb.

If you are using Wannier90, you would only need to keep the .win, .wout and _hr.dat files for each Wannier interpolation. If it is not Wannier90, we can further help you.

Best regards,

I have split the files into zip files below 32GB, but I’m still having troubles with uploading.

When I try it using Python requests it crashes with the error:

OverflowError: string longer than 2147483647 bytes

I also tried curl, but I get 413 Request Entity Too Large . This happens even with small files, so I must be doing something wrong.

Sorry for not replying earlier, I was swamped with other stuff.

I figured out the problem with curl, I was using wrong url.

I managed to upload all the files. It’s split into 21 uploads. Can you replace the old uploads now? I can put the new uploads into a separate dataset if it would help.

Either you publish with a new dataset and I remove the old dataset and uploads later, or the other way around. We can also reuse the dataset, if this is somehow beneficial for you? Whatever you prefer.

Some criteria to delete the old ones would help. But I guess, I can use the upload time? They where all done in the same day/week/etc? Just want to make sure to not delete some wrong stuff.

moved the existing DOI to the new dataset

deleted the old dataset

deleted the 2871 uploads associated with the old dataset

Everything looks good now from my end. Thanks for your help and sorry again for all the extra work and inconvenience.