Notes on hashing

A hash function takes some data of an arbitrary size and returns a value of a fixd size.

Properties of a hash function

  • Deterministic. The same input always produces the same output.
  • Minimal collisions. The numbner of different inputs that produce the same output is ideally none existent.
  • Uniformity. Every possible output value should be generated with the same probability.
  • Avalanche effect. Small change in the input have big impact in the output.
  • Irreversible. The input can not be determined from the output value.
  • Efficient. Use little CPU resources.

Common applications

  1. Cryptographic security (passwords, digital signatures, blockchain).
  2. Data indexing and retrieval (hash tables).
  3. Verifying data integrity in secure contexts.

Hashing function subsets

There is a large number of hashing functions and each one of them has different properties. Based on the function properties it might be suitable for a different application.

I have found two main different subsets: cryptographic functions and functions used for checksums.

Cryptographic hashing functions

For cryptographic applications the following properties are important:

  • Irreversible
  • Collision resistance. Not two inputs generate the same output.
  • Salting support. This is the possibility of adding a random string to each input as part of the hashing process.

Applications:

  • File verification
  • Digital signature
  • Password verification
  • Proof of work

For this use case a slower algorithm can be beneficial. The reason is that potential attackers will require more time to test each input. Also the time penalty paid by real users does not happen too often. For example, users log in once per session.

Algorithms used for this end: Argon2id, bcrypt, scrypt, PBKDF2. All of these algorithms have an adaptable work factor, this is the number of times the hashing algorithm is performed on the input is an input to the algorithm.

  • bcrypt (1999). Automatically salts the passwords.
  • PBKDF2 (2000).
  • scrypt (2009). Memory-hardness, not as memory hard as Argon2.
  • Argon2id (2016). Automatically salts the passwords. Memory-hardness. Configurable parallelism parameters.

Note that there are cryptocurrencies based on the scrypt algorithm (Bitcoin, Ethereum and Dogecoin among others). Therefore, cryptomining hardware which pursues to increase the number of hashes calculated per unit of time can be used for password cracking.

Avoid the use of:

  • MD5. Obsolete for security purposes.
  • SHA-1. Obsolete for security purposes.
  • SHA family. Although the algorithms in this family are cryptographically secure, they are not memory-hard and are fast to compute. They could be used in combinations with a large number of iterations and salting.
  • CRC family. These are designed for error detection and are extremely fast.
  • General hash functions: Blake2, MurmurHash, xxHash.

At the time of writing, the recomendations of the OWASP foundation is:

  1. Argon2id with a minimum use of 19MiB, iteration count 2 and 1 degree of parallelism.
  2. scrypt with min CPU/memory cost 2^17, min block size 8, parallelization param 1.
  3. bcrypt with min work factor 10 and password limit size 72 bytes.

Checksum hashing functions

Important properties:

  • Speed and efficiency
  • Low collision rate
  • Compounding. The hash of a string can be calculated as the compound of its substrings.

Note that compounding is very important since it will allow to calculate checksums for files which do not fit on the main memory. Large files can be read and hashed on chunks.

For the checksum purpose the algorithm does not need to be cryptographically secure.

Algorithms used for this purpose: CRC32, MD5, SHA-256, Blake2, xxHash.