tags : Cryptography, Data Structures, HMAC, Encryption

FAQ

Are collisions expected?

  • Yes, because input can be any string, but the number returned is within some promised range, it’s possible that two different inputs can return the same number. This itself is the definition of an collision.
  • Good hash functions try to minimise how many collisions they produce. It’s not possible to completely eliminate collisions.
  • We have ways to handle collisions, when they occur.
  • Few hash functions are collision-free, but most are designed so that collisions are minimized. See Perfect hash function

Avalanche Effect

  • Refers to how many bits in the output value change when just a single bit of the input changes.
  • To say that a hash function has a good avalanche effect, a single bit flip in the input should result in an average of 50% the output bits flipping.

Non-Crypographic hashes?

  • Can be reversed using various methods, including brute-force attacks, rainbow table attacks, or other techniques depending on the specific hash function and its design.

Is MD5 cryptographic hash or no?

  • Originally it was a cryptographic hash function but later attacks were shown
  • Now its popularly used for non-cryptographic usage

TODO Length Extension Attack? what it do with HMAC?

MAC and HMAC

See HMAC

Types by usage

CryptographicTypeExample
NoShuffling Hashes/Lookup Hashesxxh3/SipHash
NoIntegrity hashesmd5/xxh3
NoSimilar Itemssoundex/perceptual hash
YesSignature hashes

Shuffling/Lookup hashes/Similar

  • What: Given an input, return a small number (16, 32, or 64 bits).
  • performance for small inputs
  • Eg. xxh3/xxh128/highwayhash/SipHash/djb2/Murmur3 (See comparision)
    • SipHash is mostly used as the hash function in hash-table implementation in most programming languages
  • Usage
    • Implementing data structures like hash tables
      • Use a hash function to determine which bucket an item belongs to(evenly distribute items among the buckets)
      • Converting a key into an index for memory addresses when constructing hash tables
    • Classifying inputs into a small number of buckets in sharding (see Scaling Databases)
    • Can be used for storage, but for storage, usually cryptographically strong hashes are used.
  • Tips

Find similar items hashes

Integrity hashes

Checksums based on these hash functions cannot be 100% reliable due to the possibility of collisions. They provide a probabilistic guarantee of integrity, not an absolute one.

  • Usually not cryptographic, Eg. CRC, MD5, xxh3
    • In modern systems, go w xxh3 for performance advantages etc.
  • We can use cryptographic hash functions as integrity hashes(they’ll be slower) if there’s a strong need for collision resistance but usually non-cryptographic functions suffice.

Signature Hashes/Cryptographic Hashes/Secure hashes

  • See Cryptographic and Non-Cryptographic Hash Functions · Dadario’s Blog
  • Even theoretically ‘broken’ ones like MD5, these hashes exhibit distribution that’s indistinguishable from randomness
  • Non-invertible and must be resistant to
    • Collision attacks (but doesn’t have to be collision free)
    • Preimage attacks
  • Larger output size than general hashing functions
  • Eg. BLAKE3, SHA256(with HMAC).
  • Whether or not cryptographic hash functions are truly one-way functions is like P vs NP (Complexity Theory)
  • Not to be used for password hashes, to be used with libraries like libsodium etc.

Usecases of hash functions

Key Derivation Hashes (KDF) / Token Hash & Password hashing

  • Taking a known (and secret) key to generate any number of unguessable tokens.
    • Input (password) -> KDF -> Derived Key
  • They’re like cryptographic hashing algorithms but with:
    1. Salt: A random value added to the input to prevent rainbow table attacks.
    2. Iteration Count: Number of times the function is applied, increasing computational cost.
    3. Memory-Hard Function: Some modern KDFs use memory-intensive operations.
  • Example

How is KDF and PKI different?

They’re not really related, but you could build a PKI with KDF if you wanted

  • It is possible to construct a public key signature using a cryptographic hash as the only underlying primitive.
  • These can be proven secure based on the usual assumptions about the security of the underlying cryptographic hash.
  • A major drawback is that they are expensive to compute, and the signatures require a huge number of bits compared to other public key algorithms.

Password Hashing

  • This is essentially KDF applied to passwords
  • BUT
    • In password hashing, we don’t like fast hashing functions, because password crackers do like fast hashing functions. The faster they can do, the sooner they can recover the password.
  • DIFFERENCE with KDF
    • specifically designed password hashing functions, where a custom work factor is included as a cost.
      • These have a work factor
    • digest length can be arbitrary with KDFs, whereas password hashing functions will have a fixed length output.
  • Eg. Argon2, scrypt, bcrypt, PBKDF2.

Encryption

  • See Encryption
  • Hashes are used as mixing functions in encryption
  • But you can also trivially turn a hash function into a Cipher and encrypt with it (apologies if I missed an explanation of this in the article). Just hash a key and a counter together to create a keystream and XOR your plaintext to it. That’s how Salsa20 and ChaCha20 work.

Hash tables

See Data Structures

Other uses

Properties of hash functions

See HashMap in Data Structures. Hashmap/Hashtable use a hash function to determine the collision.

Property NameDescriptionCrypographic HF
collision resistanceIt should be computationally infeasible to find any two different inputs that hash to the same value.NEEDED
pre-image resistanceGiven a hash, it should be computationally infeasible to find any input that hashes to that value.NEEDED
Second pre-image resistanceGiven an input, it should be computationally infeasible to find a different input with the same hash.NEEDED
One way functionNEEDED
Avalanche behaviorSmall changes in input cause significant changes in outputNEEDED
Uniform Distribution
collision free(perfect hashing)Not the same as collision resistant
Performance
Input typeSome functions only work on numbers, strings etc
Other paramsSome functions take a second param, seed etc

Collision free vs Collision resistant & Birthday Paradox

  • Due to the pigeonhole principle, collisions “must exist” for any hash function that maps a larger input space to a smaller output space.
  • cryptographic hash function need to be collision restistant NOT collision free
    • i.e the computational in-feasibility of finding a collision from the output.
  • Due to the birthday paradox, collisions can be found more easily than one might intuitively expect.
    • For an n-bit hash, the probability of finding a collision becomes significant after about 2^(n/2) hashes, rather than 2^n.
    • For 64-bit, with 2^32 files hashed
      • There’s a 50% chance that at least two of those files will have the same hash value.
      • The 50% probability refers to the chance of any collision existing among the 2^32 files, not the likelihood of a new file colliding.
      • TODO: Need to be more clear on the math here
  • Solution:
    • Use larger hash sizes to reduce collision probability based on requirement
    • Combine multiple hash functions for more robust checksums.