[Solved] Counting efficiently pairs of strings, which together contain all the vowels


Suppose you have a very large number of strings. Your algorithm will compare all the strings between them, and thats a terrible number of iterations.

Radical improvement:

New approach

Build a map that associates a string of ordered unique vowels (e.g. “ae”) to a list of all the string found that contains exactly these unique vowels, whatever the number of repetition and the order. For example:

ao -> aaooaoaooa, aoa, aaooaaooooooo  (3 words)
uie -> uiieieiieieuuu, uuuuiiiieeee   (2 words)
aeio -> aeioooeeiiaiei                (1 word)

Of course, that’s a lot of strings, so in your code, you would use your bitmap rather than the string of ordered unique vowels. Also note that you don’t want to produce the list of combined string, but only their count. So you don’t need the list of all string occurenes, but just to maintain the count of strings matching the bitmap:

16385 -> 3 
1048848 -> 2
16657 -> 1  

Then look at the winning combinations between the mapping existing indexes, like you did it. For a large list of strings, you would have a much smaller list of mapping indexes, so that would be a significant improvement.

For each winning combination take the size of the first list of strings time the size of the second list of strings to increase your count.

16385 1048848 is complete -> 3 x 2 = 6 combinations
1048848 16657 is complete -> 2 x 1 = 2 combinations 
                                    ---
                                     8 potential combinations  

What’s the improvement ?

These combinations were found by analysing 3×2 bitmaps, rather than looking at 6×5 bitmaps corresponding to the unique strings. A gain by a significant order of magnitude if you have larger number of strings.

To be more general, as you have 5 wovels and there must be at least one, you can have only a maximum of 2<<5-1 so 31 different bitmaps and therefore a maximum of C(31,2) that’s 465 31*30 that’s 930 combinations to check whether you have 100 strings or 10 million strings as input. Would I be correct to state that it’s roughly O(n) ?

Possible implementation:

map<int, int> mymap; 
int n;
cin>>n;

for(int i=0;i<n;i++) {
    string s;
    cin>>s;
    int bm=0;
    for(int j=0;j<s.length();j++)
        bm |= (1<<(s[j]-'a'));
    mymap[bm]++;
}
int complete = (1<<('a'-'a')) | (1<<('e'-'a')) | (1<<('i'-'a')) | (1<<('o'-'a')) | (1<<('u'-'a'));
int count = 0;
int comparisons = 0; 
for (auto i=mymap.begin(); i!=mymap.end(); i++)  {
    auto j=i; 
    for(++j;j!=mymap.end();j++) {
        comparisons++; 
        if((i->first | j->first)==complete) {
            count += i->second * j->second; 
            cout << i->first <<" "<<j->first<<" :"<<i->second<<" "<<j->second<<endl;
        }
    }
}
auto special = mymap.find(complete);  // special case: all strings having all letters
if (special!=mymap.end()) {           // can be combined with themselves as well
    count += special->second * (special->second -1) / 2; 
}
cout<<"Result: "<<count<<" (found in "<<comparisons<<" comparisons)\n";

Online demo with 4 examples (only strings with all letters, your initial example, my example above, and an example with couple of more strings)

6

solved Counting efficiently pairs of strings, which together contain all the vowels