Sunday, 11 June 2017

Repeating String Pattern Search

In this post I am going to introduce basic repeating string pattern search algorithm that I have written for fun. This algorithm’s average running time is Θ(N), worst running time is O(N^2) (see analysis below). This algorithm takes ~1 minute to search 10 million digits of Pi.

Example usage

  
string raw = "091283091283091831093810933";
Dictionary<string, List<int>> patterns = DiscoverPatterns(raw);
foreach (KeyValuePair<string, List<int>> pattern in patterns)
{
    Console.WriteLine($"Pattern: {pattern.Key}, Indexes: {String.Join(", ", pattern.Value)}");
}


Example output

Pattern: 091283091, Indexes: 0, 6
Pattern: 1093, Indexes: 17, 22


Source Code (licensed under the GNU licence v3.0)

  
public static Dictionary<string, List<int>> DiscoverPatterns(string raw, int patternSizeMin = 2)
{
    if (patternSizeMin < 1)
        throw new Exception("Minimum pattern size should not be less than 1.");

    Dictionary<string, HashSet<int>> initialPatternSearch = new Dictionary<string, HashSet<int>>();
    Dictionary<string, HashSet<int>> discoveredPatterns = new Dictionary<string, HashSet<int>>();

    int numberOfInitialSearches = raw.Length - (patternSizeMin - 1);
    for (int i = 0; i < numberOfInitialSearches; i++)
    {
        string pattern = raw.Substring(i, patternSizeMin);

        HashSet<int> patternIndexes;
        if (initialPatternSearch.TryGetValue(pattern, out patternIndexes))
        {
            discoveredPatterns[pattern] = patternIndexes;
        }
        else
        {
            patternIndexes = new HashSet<int>();
            initialPatternSearch.Add(pattern, patternIndexes);
        }

        patternIndexes.Add(i);
    }

    initialPatternSearch.Clear();

    Dictionary<string, HashSet<int>> basePatterns = discoveredPatterns;
    do
    {
        Dictionary<string, HashSet<int>> superPatterns = discoverSuperPatterns(raw, ++patternSizeMin, basePatterns);

        foreach (KeyValuePair<string, HashSet<int>> superPattern in superPatterns)
        {
            string basePattern = superPattern.Key.Substring(0, patternSizeMin - 1);
            discoveredPatterns.Remove(basePattern);

            string baseInnerPattern = superPattern.Key.Substring(1, patternSizeMin - 1);
            discoveredPatterns.Remove(baseInnerPattern);

            discoveredPatterns.Add(superPattern.Key, superPattern.Value);
        }

        basePatterns = superPatterns;
    } while (basePatterns.Count != 0);

    return discoveredPatterns.OrderByDescending(pattern => pattern.Key.Length).ToDictionary(pattern => pattern.Key, x => x.Value.ToList());
}

private static Dictionary<string, HashSet<int>> discoverSuperPatterns(string raw, int patternSize, Dictionary<string, HashSet<int>> basePatterns)
{
    Dictionary<string, HashSet<int>> initialSuperPatternSearch = new Dictionary<string, HashSet<int>>();
    Dictionary<string, HashSet<int>> discoveredSuperPatterns = new Dictionary<string, HashSet<int>>();

    foreach (KeyValuePair<string, HashSet<int>> basePattern in basePatterns)
    {
        foreach (int patternIndex in basePattern.Value)
        {
            if (raw.Length >= (patternIndex + patternSize))
            {
                string superPattern = raw.Substring(patternIndex, patternSize);

                HashSet<int> superPatternIndexes;
                if (initialSuperPatternSearch.TryGetValue(superPattern, out superPatternIndexes))
                {
                    discoveredSuperPatterns[superPattern] = superPatternIndexes;
                }
                else
                {
                    superPatternIndexes = new HashSet<int>();
                    initialSuperPatternSearch.Add(superPattern, superPatternIndexes);
                }

                superPatternIndexes.Add(patternIndex);
            }
        }
    }

    return discoveredSuperPatterns;
}


Analysis


Worst Case

If you provide input such as 010101…01 then running time will be ((N(N-1))/2)-4, this gives us a triangular number. This is because this input is just one big continuous pattern and it’s discovering patterns one character at the time i.e. (N-1)+(N-2)+(N-3)…+(N-(N-2)). If we drop all terms that grow slowly we end up with O(N^2).


Average Case

If input is not a big continuous pattern but has many small sub patterns then running time will be X*N. If we drop all terms that grow slowly we end up with average running time Θ(N).

No comments:

Post a Comment