Tutorial Implementation - Single Base Implementation - Multiple Bases Solution - Searching For Strings One Hash Two Hashes Problems

Rare

0/13

String Hashing

Authors: Benjamin Qi, Andi Qu

Contributors: Andrew Wang, Kevin Sheng

Quickly test equality of substrings with a small probability of failure.

Edit This Page

Prerequisites

Gold - Modular Arithmetic

Tutorial Implementation - Single Base Implementation - Multiple Bases Solution - Searching For Strings One Hash Two Hashes Problems

Tutorial


CPH	26.3 - String Hashing	good intro
cp-algo	String Hashing	code
PAPS	14.3 - Hashing	many applications

Optional

If "small" isn't a satisfying-enough answer for "what's the probability of collision?", then you should check out rng-58's blog post talking about hashing. This blog post talks about the Schwarz-Zippel lemma and how that can be used to calculate the probability of a collision.

It also explains how to hash rooted trees - an uncommon technique, but still useful to know!

Implementation - Single Base

As mentioned in the articles above, there is no need to calculate modular inverses.

C++

#include <vector>
#include <string>

using namespace std;

class HashedString {
	private:
		// change M and P if you want
		static const long long M = 1e9 + 9;
		static const long long P = 9973;

Java

import java.util.*;

public class HashedString {
	// Change M and P if you want
	public static final long M = (long) 1e9 + 9;
	public static final long P = 9973;

	// pow[i] contains P^i % M
	private static ArrayList<Long> pow = new ArrayList<>();

Python

class HashedString:
	# Change M and P if you want
	M = int(1e9) + 9
	P = 9973

	# pow[i] contains P^i % M
	_pow = [1]

	def __init__(self, s: str):
		while len(self._pow) < len(s):

This implementation calculates

\texttt{hsh}[i] = \left(\sum_{x = 0}^i P^{i - x} \cdot S[x]\right) \bmod M

The hash of any particular substring $S[a : b]$ is then calculated as

\left(\sum_{x = a}^b P^{b - x} \cdot S[x] \right) \bmod M = (\texttt{hsh}[b] - \texttt{hsh}[a] \cdot P^{b - a + 1}) \bmod M

using prefix sums. This is nice because the highest power of $P$ in that polynomial will always be $P^{b - a}$ .

Since $10^9 + 9$ is prime, the probability of collision when using this hash is at most $\frac{N}{10^9 + 9} < 10^{-4}$ , by the Schwarz-Zippel lemma. This means that if you select any two different strings of length at most $N$ and a random base modulo $10^9 + 9$ (e.g. $9973$ in the code), the probability that they hash to the same value is at most $10^{-4}$ .

Implementation - Multiple Bases

Resources
	CF	dacin21 - Anti-Hash Tests	regarding CF educational rounds in particular
	Benq	HashRange

It's generally a good idea to use two randomized bases rather than just one to decrease the probability that two random strings hash to the same value.

Searching For Strings

CCC - Easy

Focus Problem – read through this problem before continuing!

Solution - Searching For Strings

One Hash

Time Complexity: $\mathcal O((|N| + |H|) \cdot \Sigma)$ , where $\Sigma$ is the size of the alphabet.

We'll use a sliding window over $H$ to find the "matches" with $N$ .

Since we don't care about relative order when comparing two substrings, we can store frequency tables of the characters in the current window and in $N$ . When we slide the window, at most two values in that table change. To compare two substrings, we simply compare the 26 values in each table.

If we only needed to count the number of matches, then the above alone would suffice (in fact, IOI 2006 Writing is just that). However, we need to count the distinct permutations of $N$ in $H$ , so we need to be a bit more clever.

One way to solve this is by storing the polynomial hashes of each match in a hashset, since we expect different permutations to have different polynomial hashes. The answer would simply be the size of that hashset at the end.

Since the test data for this particular problem is very strong, we will probably get hash collisions with only one hash. To remedy this, we use two hashes for each match - this significantly decreases the probability of collisions.

Using the base $9973$ with the two modulos $10^9 + 9$ and $10^9 + 7$ works for this problem. (Note that using two bases with the same modulo works too.)

C++

#include <bits/stdc++.h>
typedef long long ll;
using namespace std;

const ll P = 9973, M1 = 1e9 + 9, M2 = 1e9 + 7;

int freq_target[26], freq_curr[26];
string n, h;

int main() {

Two Hashes

Time Complexity: $\mathcal O((|N| + |H|) \log M)$

An alternative solution without frequency tables would be to hash the substrings that we're trying to match. Since order doesn't matter, we need to modify our hash function slightly.

In particular, instead of computing the polynomial hash of the substrings, compute the product $(P + s_1)(P + s_2) \cdots (P + s_k) \bmod M$ as the hash (again, using two modulos). This hash is nice because the relative order of the letters doesn't matter, as multiplication is commutative.

Since this hash requires the modular inverse, there's an extra $\log M$ factor in the time complexity.

Alternative hashes (e.g. computing the sum $(P + s_1)^2 + (P + s_2)^2 + \dots + (P + s_k)^2 \bmod M$ ) also work for other hashing problems, but the test cases are too strong for that to pass here.

C++

#include <bits/stdc++.h>
typedef long long ll;
using namespace std;

const ll P = 9973, M1 = 1e9 + 9, M2 = 1e9 + 7;

ll inv(ll base, ll MOD) {
	ll ans = 1, expo = MOD - 2;
	while (expo) {
		if (expo & 1) ans = ans * base % MOD;

Problems

Source	Problem Name	Difficulty	Tags
CEOI	2017 - Palindromic Partitions	Easy	Show Tags Greedy, Hashing
CF	Palindromic Characteristics	Easy	Show Tags DP, Hashing
CF	Check Transcription	Easy	Show Tags Hashing
Gold	Bovine Genomics	Normal	Show Tags Hashing
Gold	Lights Out	Normal	Show Tags Hashing, Simulation
RMI	2017 - Hangman 2	Normal	Show Tags Hashing
COCI	2017 - Osmosmjerka	Normal	Show Tags Hashing, Probability
COCI	2021 - Sateliti	Hard	Show Tags Binary Search, Hashing
CF	Liar	Hard	Show Tags DP, Hashing
Baltic OI	2018 - Genetics	Hard	Show Tags Hashing
COCI	2016 - Zamjene	Very Hard	Show Tags DSU, Hashing
COI	2016 - Palinilap	Very Hard	Show Tags Binary Search, Hashing

Table of Contents

String Hashing

Prerequisites

Table of Contents

Tutorial

Optional

Implementation - Single Base

Implementation - Multiple Bases

Solution - Searching For Strings

One Hash

Two Hashes

Problems

Module Progress:

Table of Contents

String Hashing

Prerequisites

Table of Contents

Tutorial

Optional

Implementation - Single Base

Implementation - Multiple Bases

Solution - Searching For Strings

One Hash

Two Hashes

Problems

Module Progress:Not Started

Module Progress: