0w1

Hacking hash algorithms on string with birthday attack

Many people are familiar with hash algorithms on string because they are easy to write. However, not many are really aware of the probability of collision. Like me, I had always thought the probability of collision is ( 1 / MOD ), and I had always omitted because I thought declaring it as unsigned long long and letting it overflow naturally was enough. However, it is found to be quite easy to generate test cases to make it WA. Take the code below as an example. It is to solve a problem "Given string a and b, answer the occurrence of b in a. 1 ≤ |a|, |b| ≤ 1e5".

#include <bits/stdc++.h>
using namespace std;
typedef long long LL;
const int N=100003;
const int mod=1000000007, X=991;
char a[N],b[N];
LL pre[N];
LL hashval(char s[],int n) {
  LL ret=0;
  for(int i=0;i<n;i++)
    ret = (ret*X + s[i]) % mod;
  return ret;
}
void make_pre(int n) {
  for(int i=1;i<=n;i++)
    pre[i] = (pre[i-1]*X + a[i]) %mod;
}
int main() {
  int T;
  scanf("%d",&T);
  while(T--) {
    scanf("%s%s",a+1,b+1);
    int n=strlen(a+1), m=strlen(b+1);
    if(n < m) {puts("0");continue;}
    LL bval=hashval(b+1,m);
    make_pre(n);
    int ans=0;
    LL offset = 1;
    for(int i=0;i<m;i++) offset = (offset*X) % mod;
    for(int i=m;i<=n;i++)
      if((pre[i] - pre[i-m]*offset%mod + mod)%mod == bval)
        ans++;
    printf("%d\n",ans);
  }
}

It will get AC if it were on any usual judge. However, this is where it gets exciting.
There is a famous paradox against our instinct, called the "birthday paradox". The original problem was, "what's the percentage of any 2 people having the same birthday, among 23 random people?". Surprisingly, it is around 50%. It is not difficult to do some math to find that out, by doing 1 - 365 / 365 - 364 / 365 - .. - ( 365 - 23 + 1 ) / 365. Anyways, the spirit is same when it comes to hashing. Looking into the birthday problem a little more further, we can find that in order to produce 50% collision, only a number of strings that are linear to the square root of the MOD is required to enumerate, in order to find someone that has the same identity ( hash value ), but are actually different at all!!
However, note that the birthday paradox is talking about any 2 having collision. The reason why this will pass for random test cases is because it is one string for multiple matches. In other words, the probability that it will match to something it shouldn't is kind of equal to the probability a particular one walks into a room and finds out someone has exactly the same birthday with him. This is not the original birthday paradox, and the probability will become relatively small ( 23 / 365 for 23 people ).
So what we will do, is to craft an easy program as below. And all these are already a well-known method called the birthday attack.

#include <bits/stdc++.h>
using namespace std;
typedef long long LL;
const int N=100003;
const int mod=1000000007, X=991;
LL hashval(string s,int n) {
  LL ret=0;
  for(int i=0;i<n;i++)
    ret = (ret*X + s[i]) % mod;
  return ret;
}

const int LEN = 20; // enough to cause overflow
string generate_random_string(int len){
    string res;
    for(int i = 0; i < len; ++i)
        res += (char)( rand() % 26 + 'a' );
    return res;
}

int main() {
  int T;
  int sqrmod = sqrt( mod ); // actually you can just set INF to let it run until found
  map<LL, string> st;
  srand( time(NULL) ); // you will find out that it really is around 50%
  for(int i = 0; i < sqrmod; ++i){
    string s = generate_random_string( LEN );
    LL hv = hashval( s, LEN );
    if( st.count( hv ) ){
      if( st[ hv ] != s ){
        cout << s << " " << st[ hv ] << endl;
        return 0;
      }
    } else
      st[ hv ] = s;
  }
  return 0;
}

Soon you will get 2 different strings that share the same hash value.
And I still thought my MOD = limit of unsigned long long was impenetrable. Until I saw this
Anti-hash test. - Codeforces.
I had better run at least two different hashes to check, if I had to use hash in the future!
f:id:h0rnet:20160208010912j:plain