Hashing

tags : Cryptography, Data Structures, HMAC, Encryption

FAQ

Are collisions expected?

Yes, because input can be any string, but the number returned is within some promised range, it’s possible that two different inputs can return the same number. This itself is the definition of an collision.
Good hash functions try to minimise how many collisions they produce. It’s not possible to completely eliminate collisions.
We have ways to handle collisions, when they occur.
Few hash functions are collision-free, but most are designed so that collisions are minimized. See Perfect hash function

Avalanche Effect

Refers to how many bits in the output value change when just a single bit of the input changes.
To say that a hash function has a good avalanche effect, a single bit flip in the input should result in an average of 50% the output bits flipping.

Non-Crypographic hashes?

Can be reversed using various methods, including brute-force attacks, rainbow table attacks, or other techniques depending on the specific hash function and its design.

Is MD5 cryptographic hash or no?

Originally it was a cryptographic hash function but later attacks were shown
Now its popularly used for non-cryptographic usage

TODO Length Extension Attack? what it do with HMAC?

MAC and HMAC

See HMAC

Types by usage

See The many flavors of hashing — Volution Notes
See Use Fast Data Algorithms | Joey Lynch’s Site (You might as-well use something other than md5)

Cryptographic	Type	Example
No	Shuffling Hashes/Lookup Hashes	xxh3/SipHash
No	Integrity hashes	md5/xxh3
No	Similar Items	soundex/perceptual hash
Yes	Signature hashes

Shuffling/Lookup hashes/Similar

What: Given an input, return a small number (16, 32, or 64 bits).
performance for small inputs
Eg. xxh3/xxh128/highwayhash/SipHash/djb2/Murmur3 (See comparision)
- SipHash is mostly used as the hash function in hash-table implementation in most programming languages
Usage
- Implementing data structures like hash tables
  - Use a hash function to determine which bucket an item belongs to(evenly distribute items among the buckets)
  - Converting a key into an index for memory addresses when constructing hash tables
- Classifying inputs into a small number of buckets in sharding (see Scaling Databases)
- Can be used for storage, but for storage, usually cryptographically strong hashes are used.
Tips
- Do not use for security
- If input is controlled by user, consider using seed/key.
- Expect regular collisions
- Go with prime number or powers of 2 (avoids expensive modulo operations) for bucket count.

Find similar items hashes

Text: Soundex / metaphone / doublemetaphone.
Images: ahash/phash/dhash/whash/blockmeat/colormoment/colorhas/marrHildreth/Perceptual hashing - Wikipedia
- perceptual hash functions do not provide the same properties as cryptographic hash functions.
- AsuharietYgvar/AppleNeuralHash2ONNX#1 Working Collision?
Audio: acoustid
Digest: MD5

Integrity hashes

Checksums based on these hash functions cannot be 100% reliable due to the possibility of collisions. They provide a probabilistic guarantee of integrity, not an absolute one.

Usually not cryptographic, Eg. CRC, MD5, xxh3
- In modern systems, go w xxh3 for performance advantages etc.
We can use cryptographic hash functions as integrity hashes(they’ll be slower) if there’s a strong need for collision resistance but usually non-cryptographic functions suffice.

Signature Hashes/Cryptographic Hashes/Secure hashes

See Cryptographic and Non-Cryptographic Hash Functions · Dadario’s Blog
Even theoretically ‘broken’ ones like MD5, these hashes exhibit distribution that’s indistinguishable from randomness
Non-invertible and must be resistant to
- Collision attacks (but doesn’t have to be collision free)
- Preimage attacks
Larger output size than general hashing functions
Eg. BLAKE3, SHA256(with HMAC).
Whether or not cryptographic hash functions are truly one-way functions is like P vs NP (Complexity Theory)
Not to be used for password hashes, to be used with libraries like libsodium etc.

Usecases of hash functions

Key Derivation Hashes (KDF) / Token Hash & Password hashing

Taking a known (and secret) key to generate any number of unguessable tokens.
- Input (password) -> KDF -> Derived Key
They’re like cryptographic hashing algorithms but with:
1. Salt: A random value added to the input to prevent rainbow table attacks.
2. Iteration Count: Number of times the function is applied, increasing computational cost.
3. Memory-Hard Function: Some modern KDFs use memory-intensive operations.
Example
- HKDF is a valid KDF (stricter requirements than a PRF, Based on HMAC)
  - Understanding HKDF - Dhole Moments
- Blake3

How is KDF and PKI different?

They’re not really related, but you could build a PKI with KDF if you wanted

It is possible to construct a public key signature using a cryptographic hash as the only underlying primitive.
These can be proven secure based on the usual assumptions about the security of the underlying cryptographic hash.
A major drawback is that they are expensive to compute, and the signatures require a huge number of bits compared to other public key algorithms.

Password Hashing

This is essentially KDF applied to passwords
BUT
- In password hashing, we don’t like fast hashing functions, because password crackers do like fast hashing functions. The faster they can do, the sooner they can recover the password.
DIFFERENCE with KDF
- specifically designed password hashing functions, where a custom work factor is included as a cost.
  - These have a work factor
- digest length can be arbitrary with KDFs, whereas password hashing functions will have a fixed length output.
Eg. Argon2, scrypt, bcrypt, PBKDF2.

Encryption

See Encryption
Hashes are used as mixing functions in encryption
But you can also trivially turn a hash function into a Cipher and encrypt with it (apologies if I missed an explanation of this in the article). Just hash a key and a counter together to create a keystream and XOR your plaintext to it. That’s how Salsa20 and ChaCha20 work.

Hash tables

See Data Structures

Other uses

Properties of hash functions

See HashMap in Data Structures. Hashmap/Hashtable use a hash function to determine the collision.

Property Name	Description	Crypographic HF
collision resistance	It should be computationally infeasible to find any two different inputs that hash to the same value.	NEEDED
pre-image resistance	Given a hash, it should be computationally infeasible to find any input that hashes to that value.	NEEDED
Second pre-image resistance	Given an input, it should be computationally infeasible to find a different input with the same hash.	NEEDED
One way function		NEEDED
Avalanche behavior	Small changes in input cause significant changes in output	NEEDED
Uniform Distribution
collision free(perfect hashing)	Not the same as `collision resistant`
Performance
Input type	Some functions only work on numbers, strings etc
Other params	Some functions take a second param, seed etc

Collision free vs Collision resistant & Birthday Paradox

Due to the pigeonhole principle, collisions “must exist” for any hash function that maps a larger input space to a smaller output space.
cryptographic hash function need to be collision restistant NOT collision free
- i.e the computational in-feasibility of finding a collision from the output.
Due to the birthday paradox, collisions can be found more easily than one might intuitively expect.
- For an n-bit hash, the probability of finding a collision becomes significant after about 2^(n/2) hashes, rather than 2^n.
- For 64-bit, with 2^32 files hashed
  - There’s a 50% chance that at least two of those files will have the same hash value.
  - The 50% probability refers to the chance of any collision existing among the 2^32 files, not the likelihood of a new file colliding.
  - TODO: Need to be more clear on the math here
Solution:
- Use larger hash sizes to reduce collision probability based on requirement
- Combine multiple hash functions for more robust checksums.

🐏 mogoz

Table of Contents

Hashing

FAQ

Are collisions expected?

Avalanche Effect

Non-Crypographic hashes?

Is MD5 cryptographic hash or no?

TODO Length Extension Attack? what it do with HMAC?

MAC and HMAC

Types by usage

Shuffling/Lookup hashes/Similar

Find similar items hashes

Integrity hashes

Signature Hashes/Cryptographic Hashes/Secure hashes

Usecases of hash functions

Key Derivation Hashes (KDF) / Token Hash & Password hashing

How is KDF and PKI different?

Password Hashing

Encryption

Hash tables

Other uses

Properties of hash functions

Collision free vs Collision resistant & Birthday Paradox

Links

Graph View

Backlinks

🐏 mogoz

Table of Contents

Hashing

FAQ §

Are collisions expected? §

Avalanche Effect §

Non-Crypographic hashes? §

Is MD5 cryptographic hash or no? §

TODO Length Extension Attack? what it do with HMAC? §

MAC and HMAC §

Types by usage §

Shuffling/Lookup hashes/Similar §

Find similar items hashes §

Integrity hashes §

Signature Hashes/Cryptographic Hashes/Secure hashes §

Usecases of hash functions §

Key Derivation Hashes (KDF) / Token Hash & Password hashing §

How is KDF and PKI different? §

Password Hashing §

Encryption §

Hash tables §

Other uses §

Properties of hash functions §

Collision free vs Collision resistant & Birthday Paradox §

Links §

Graph View

Backlinks

FAQ

Are collisions expected?

Avalanche Effect

Non-Crypographic hashes?

Is MD5 cryptographic hash or no?

TODO Length Extension Attack? what it do with HMAC?

MAC and HMAC

Types by usage

Shuffling/Lookup hashes/Similar

Find similar items hashes

Integrity hashes

Signature Hashes/Cryptographic Hashes/Secure hashes

Usecases of hash functions

Key Derivation Hashes (KDF) / Token Hash & Password hashing

How is KDF and PKI different?

Password Hashing

Encryption

Hash tables

Other uses

Properties of hash functions

Collision free vs Collision resistant & Birthday Paradox

Links