tags : Cryptography, Data Structures, HMAC, Encryption
FAQ
Are collisions expected?
- Yes, because
input
can be any string, but the number returned iswithin some promised range
, it’s possible that two different inputs can return the same number. This itself is the definition of an collision. - Good hash functions try to minimise how many collisions they produce. It’s not possible to completely eliminate collisions.
- We have ways to handle collisions, when they occur.
- Few hash functions are collision-free, but most are designed so that collisions are minimized. See Perfect hash function
Avalanche Effect
- Refers to how many bits in the output value change when just a single bit of the input changes.
- To say that a hash function has a good avalanche effect, a single bit flip in the input should result in an average of 50% the output bits flipping.
Non-Crypographic hashes?
- Can be reversed using various methods, including brute-force attacks, rainbow table attacks, or other techniques depending on the specific hash function and its design.
Is MD5 cryptographic hash or no?
- Originally it was a cryptographic hash function but later attacks were shown
- Now its popularly used for non-cryptographic usage
TODO Length Extension Attack? what it do with HMAC?
MAC and HMAC
See HMAC
Types by usage
- See The many flavors of hashing — Volution Notes
- See Use Fast Data Algorithms | Joey Lynch’s Site (You might as-well use something other than md5)
Cryptographic | Type | Example |
---|---|---|
No | Shuffling Hashes/Lookup Hashes | xxh3/SipHash |
No | Integrity hashes | md5/xxh3 |
No | Similar Items | soundex/perceptual hash |
Yes | Signature hashes |
Shuffling/Lookup hashes/Similar
- What: Given an input, return a small number (16, 32, or 64 bits).
- performance for small inputs
- Eg. xxh3/xxh128/highwayhash/SipHash/djb2/Murmur3 (See comparision)
- SipHash is mostly used as the hash function in hash-table implementation in most programming languages
- Usage
- Implementing data structures like
hash tables
- Use a hash function to determine which bucket an item belongs to(evenly distribute items among the buckets)
- Converting a key into an index for memory addresses when constructing hash tables
- Classifying inputs into a small number of buckets in sharding (see Scaling Databases)
- Can be used for storage, but for storage, usually cryptographically strong hashes are used.
- Implementing data structures like
- Tips
- Do not use for security
- If input is controlled by user, consider using seed/key.
- Expect regular collisions
- Go with prime number or powers of 2 (avoids expensive modulo operations) for bucket count.
Find similar items hashes
- Text: Soundex / metaphone / doublemetaphone.
- Images: ahash/phash/dhash/whash/blockmeat/colormoment/colorhas/marrHildreth/Perceptual hashing - Wikipedia
- perceptual hash functions do not provide the same properties as cryptographic hash functions.
- AsuharietYgvar/AppleNeuralHash2ONNX#1 Working Collision?
- Audio: acoustid
- Digest: MD5
Integrity hashes
Checksums based on these hash functions cannot be 100% reliable due to the possibility of collisions. They provide a probabilistic guarantee of integrity, not an absolute one.
- Usually not cryptographic, Eg. CRC, MD5, xxh3
- In modern systems, go w xxh3 for performance advantages etc.
- We can use cryptographic hash functions as integrity hashes(they’ll be slower) if there’s a strong need for collision resistance but usually non-cryptographic functions suffice.
Signature Hashes/Cryptographic Hashes/Secure hashes
- See Cryptographic and Non-Cryptographic Hash Functions · Dadario’s Blog
- Even theoretically ‘broken’ ones like MD5, these hashes exhibit distribution that’s indistinguishable from randomness
- Non-invertible and must be resistant to
- Collision attacks (but doesn’t have to be collision free)
- Preimage attacks
- Larger output size than general hashing functions
- Eg. BLAKE3, SHA256(with HMAC).
- Whether or not cryptographic hash functions are truly one-way functions is like P vs NP (Complexity Theory)
- Not to be used for password hashes, to be used with libraries like libsodium etc.
Usecases of hash functions
Key Derivation Hashes (KDF) / Token Hash & Password hashing
- Taking a known (and secret) key to generate any number of unguessable tokens.
Input (password) -> KDF -> Derived Key
- They’re like cryptographic hashing algorithms but with:
- Salt: A random value added to the input to prevent rainbow table attacks.
- Iteration Count: Number of times the function is applied, increasing computational cost.
- Memory-Hard Function: Some modern KDFs use memory-intensive operations.
- Example
- HKDF is a valid KDF (stricter requirements than a PRF, Based on HMAC)
- Blake3
How is KDF and PKI different?
They’re not really related, but you could build a PKI with KDF if you wanted
- It is possible to construct a public key signature using a cryptographic hash as the only underlying primitive.
- These can be proven secure based on the usual assumptions about the security of the underlying cryptographic hash.
- A major drawback is that they are expensive to compute, and the signatures require a huge number of bits compared to other public key algorithms.
Password Hashing
- This is essentially KDF applied to passwords
- BUT
- In password hashing, we don’t like fast hashing functions, because password crackers do like fast hashing functions. The faster they can do, the sooner they can recover the password.
DIFFERENCE with KDF
- specifically designed password hashing functions, where a custom work factor is included as a cost.
- These have a
work factor
- These have a
- digest length can be arbitrary with KDFs, whereas password hashing functions will have a
fixed length output
.
- specifically designed password hashing functions, where a custom work factor is included as a cost.
- Eg. Argon2, scrypt, bcrypt, PBKDF2.
Encryption
- See Encryption
- Hashes are used as
mixing functions
in encryption - But you can also trivially turn a hash function into a Cipher and encrypt with it (apologies if I missed an explanation of this in the article). Just hash a key and a counter together to create a keystream and XOR your plaintext to it. That’s how Salsa20 and ChaCha20 work.
Hash tables
See Data Structures
Other uses
Properties of hash functions
See HashMap in Data Structures. Hashmap/Hashtable use a hash function to determine the collision.
Property Name | Description | Crypographic HF |
---|---|---|
collision resistance | It should be computationally infeasible to find any two different inputs that hash to the same value. | NEEDED |
pre-image resistance | Given a hash, it should be computationally infeasible to find any input that hashes to that value. | NEEDED |
Second pre-image resistance | Given an input, it should be computationally infeasible to find a different input with the same hash. | NEEDED |
One way function | NEEDED | |
Avalanche behavior | Small changes in input cause significant changes in output | NEEDED |
Uniform Distribution | ||
collision free(perfect hashing) | Not the same as collision resistant | |
Performance | ||
Input type | Some functions only work on numbers, strings etc | |
Other params | Some functions take a second param, seed etc |
Collision free vs Collision resistant & Birthday Paradox
- Due to the
pigeonhole principle
, collisions “must exist” for any hash function that maps a larger input space to a smaller output space. - cryptographic hash function need to be
collision restistant
NOTcollision free
- i.e the computational in-feasibility of finding a collision from the output.
- Due to the
birthday paradox
, collisions can be found more easily than one might intuitively expect.- For an n-bit hash, the probability of finding a collision becomes significant after about 2^(n/2) hashes, rather than 2^n.
- For 64-bit, with 2^32 files hashed
- There’s a 50% chance that at least two of those files will have the same hash value.
- The 50% probability refers to the chance of any collision existing among the 2^32 files, not the likelihood of a new file colliding.
- TODO: Need to be more clear on the math here
- Solution:
- Use larger hash sizes to reduce collision probability based on requirement
- Combine multiple hash functions for more robust checksums.