tags : AWS, Filesystems, Distributed Filesystems, Storage, Storage Engines

FAQ

S3 presigned URL

It’s a AWS s3 concept but s3 based api store provders like R2 provide support for it(limited)

Generation of e2 presigned URLs

Generation of the signed URL is static, NO CONNECTION WITH R2 is made, whatever is generating the presigned URL just needs to have the correct access creds so that the signed URL can be signed correctly. (Verify)

“Presigned URLs are generated with no communication with R2 and must be generated by an application with access to your R2 bucket’s credentials.” all that’s required is the secret + an implementation of the signing algorithm, so you can generate them anywhere.

  • Presigned URLs are sort of security tradeoff.
  • There are also examples on the interwebs which suggest sending out direct AWS credentials as a security
  • They’re specific to one resource, if someone tries to alter the identifiers of the generated URL they’ll usually get a signature mismatch error from S3.

Gotchas

  • Another thing that bit me in the past is that if I created a pre-signed URL using temp creds, then the URL expires when the creds expire.
  • The most important thing about presigned URLs is that they authenticate you, but they do not authorize you.

S3 signed URL vs Cloudfront signed URL

State as of 2024

Consistency in object stores

Concurrent Writes

Eg. Backups

  • You have backups on S3, backups are done from multiple locations, lot of duplicate files floating around in the bucket
  • You need to design a system which will
    • Deduplicate these backups to save some sweet space/disk/gandhi.
    • Expire data which was modified X days ago
  • Concurrency issue with these 2 cases:
    • De-duplication: If we run 2 instances of de-duplication program and it finds the other duplicate and decide to delete the former, we’ll end up with all variants of the duplicate deleted!
    • Old data expiry: If there is a race condition where we are appending new data to some object in s3 and at the same time we are deleting that file because it was expired when we first checked but it’s getting updated now! So in this case, the new data will go for that file will go missing as the file itself will be deleted.
  • Solution: External Locks(eg. dynamodb), CAS. (CAS is what is being supported by other object stores but not s3)

CAS

  • In object store speak, CAS is sometimes mentioned as “precondition”
  • It guarantees that: no other thread wrote to the memory location in between the "compare" and the "swap"
  • See Lockfree and Compare-And-Swap Loops (CAS): CAS allows you to implement a Lock-free system
    • Guaranteed system-wide progress (FORWARD PROGRESS GURANTEE)
    • Some operation can be blocked on specific parts but rest of the system continiues to work without stall
    • CAS is about supporting atomic renames
  • CAS is supported by other object stores via HTTP headers
    • GCS: x-goog-if-generation-match header
    • R2: cf-copy-destination-if-none-match: *

**

S3

Since they do promise “read-after-write consistency for PUTS of new objects in your S3 bucket [but only] eventual consistency for overwrite PUTS and DELETES”,

Consistency Model of S3 (Strong consistency)

  • Initially S3 was Eventual Consistent but later around 2020/2021 strong consistency was added.
    • It’s similar to Causal consistency - Wikiwand

      “So this means that the “system” that contains the witness(es) is a single point of truth and failure (otherwise we would lose consistency again), but because it does not have to store a lot of information, it can be kept in-memory and can be exchanged quickly in case of failure.

      Or in other words: minimize the amount of information that is strictly necessary to keep a system consistent and then make that part its own in-memory and quickly failover-able system which is then the bar for the HA component.”

  • Strong Consistency
    • What is Amazon S3? - Amazon Simple Storage Service
    • All S3 GET, PUT, and LIST operations, as well as operations that change object tags, ACLs, or metadata, are now strongly consistent. What you write is what you will read, and the results of a LIST will be an accurate reflection of what’s in the bucket.

Last writer wins, no locking (Concurrent Writes)

From the docs:

  • Amazon S3 does not support object locking for concurrent writers. If two PUT requests are simultaneously made to the same key, the request with the latest timestamp wins. If this is an issue, you must build an object-locking mechanism into your application.
  • Updates are key-based. There is no way to make atomic updates across keys. For example, you cannot make the update of one key dependent on the update of another key unless you design this functionality into your application.
  • After 2020, AWS S3’s consistency model many operations are strongly consistent, but concurrent operations on the same key are not.
  • The S3 API doesn’t have the concurrency control primitives necessary to guarantee consistency in the face of concurrent writes.