Sunday, 2 July 2017

Testing Data for Randomness

Movie "Contact 1997"
When I was a young man I watched “Contact 1997” movie, I don’t know why but it really stuck with me. I have recently re-watched it. There were few scenes where Eleanor Arroway listened to space noise on certain frequency, eventually she heard something. It was awesome. Now, in reality I imagine what she was doing would not work at all. There are many dishes recording data on different frequencies, there would be simply too much data for one person to go through. For fun, I have decided to write a program that can detect non-randomness automatically, this way Eleanor Arroway would not need to waste hours listening to space noise.

Non Randomness Detection

To make non randomness detection work I will use pattern discovery algorithm and birthday problem calculation that I have blogged about earlier. Pattern discovery algorithm will discover patterns in the provided data. Birthday problem calculation will be given three parameters, size of the data, all possible pattern permutations given the number of unique characters and length in identified pattern, and number of unique pattern matches. If probability of a given pattern is not in the range then it’s highly likely that pattern is not random.

   

public class RandomnessDetection
{
    public static List<KeyValuePair<string, List<int>>> FindNonRandomPatterns(Dictionary<string, List<int>> patterns, string raw)
    {
        if (patterns == null || patterns.Count == 0)
            throw new Exception("No patterns found, cannot test.");

        int uniqueChars = raw.Distinct().Count();
        int rawLength = raw.Length;

        if (uniqueChars == 1)
            throw new Exception("Need to have more than one character");

        if (rawLength < 29)
            throw new Exception("String needs to be at least 30 chars long");

        List<KeyValuePair<string, List<int>>> nonRandomPatterns = new List<KeyValuePair<string, List<int>>>();

        foreach (KeyValuePair<string, List<int>> pattern in patterns)
        {
            double possibleCombinations = Math.Pow(uniqueChars, pattern.Key.Length);
            double prob = BirthdayProblemPoissonApprox.ProbabilityOfPair(rawLength - 2, possibleCombinations, pattern.Value.Count);

            if (Math.Truncate(prob * 100) / 100 <= 0.01)
                nonRandomPatterns.Add(pattern);
        }

        return nonRandomPatterns;
    }
}


Command-line interface

Now we have all the key ingredients. We just need create a program that we can interact with and we can start testing data for randomness. I have decided to use Command Line Parser Library to parse args for me, it’s awesome. You will notice that I have allowed for binary content checking. Unfortunately when I have designed pattern discovery algorithm I have not designed it to work with the binary data, so to make that work I had to encode data to base64.

   

class Program
{
    class Options
    {
        [Option('f', "file", Required = true, HelpText = "Input file to be processed.")]
        public string InputFile { get; set; }

        [Option('b', "fileisbinary", DefaultValue = false, HelpText = "If file is binary it will be encoded before pattern search.")]
        public bool FileIsBinary { get; set; }

        [Option('s', "showallpatterns", DefaultValue = false, HelpText = "Shows all patterns random and non-random.")]
        public bool ShowAllPatterns { get; set; }

        [Option('v', "verbose", DefaultValue = true,
          HelpText = "Prints all messages to standard output.")]
        public bool Verbose { get; set; }

        [ParserState]
        public IParserState LastParserState { get; set; }

        [HelpOption]
        public string GetUsage()
        {
            return HelpText.AutoBuild(this,
              (HelpText current) => HelpText.DefaultParsingErrorsHandler(this, current));
        }
    }

    static void Main(string[] args)
    {
        var options = new Options();
        if (CommandLine.Parser.Default.ParseArguments(args, options))
        {
            if (options.Verbose)
            {
                Console.WriteLine("file: " + options.InputFile);
                Console.WriteLine("fileisbinary: " + options.FileIsBinary);
                Console.WriteLine("showallpatterns: " + options.ShowAllPatterns);
            }

            try
            {
                if (!File.Exists(options.InputFile))
                {
                    Console.WriteLine("File path doesn't exist, can't do pattern discovery");
                    return;
                }

                string raw = string.Empty;

                if (options.FileIsBinary)
                    raw = System.Convert.ToBase64String(File.ReadAllBytes(options.InputFile));
                else
                    raw = File.ReadAllText(options.InputFile);

                if (String.IsNullOrEmpty(raw))
                {
                    Console.WriteLine("Input File is empty");
                    return;
                }

                Console.WriteLine("Starting to discover patterns");

                Dictionary<string, List<int>> patterns = RepeatingPatternDiscovery.DiscoverPatterns(raw);

                Console.WriteLine($"Discovered {patterns.Count} patterns, testing them for randomness");

                List<KeyValuePair<string, List<int>>> nonRandomPatterns = RandomnessDetection.FindNonRandomPatterns(patterns, raw);

                if (nonRandomPatterns.Count != 0)
                {
                    Console.WriteLine($"Data is not random, {nonRandomPatterns.Count} non random patterns found:");

                    foreach (KeyValuePair<string, List<int>> pattern in nonRandomPatterns)
                    {
                        Console.WriteLine($"Pattern: {pattern.Key}, found: {pattern.Value.Count} times, Indexes: [{String.Join(", ", pattern.Value)}]");
                    }
                }
                else
                {
                    Console.WriteLine("Data is random");
                }

                if (options.ShowAllPatterns)
                {
                    Console.WriteLine("All patterns found:");

                    foreach (KeyValuePair<string, List<int>> pattern in patterns)
                    {
                        Console.WriteLine($"Pattern: {pattern.Key}, found: {pattern.Value.Count} times, Indexes: [{String.Join(", ", pattern.Value)}]");
                    }
                }

            }
            catch (Exception ex)
            {
                Console.WriteLine($"Error: {ex.Message}");
            }
        }
    }


Testing text for randomness

To start, I am going to test text_blogpost.txt and text_pi.txt for randomness. Is π random?
C:\data>randomnesstest -f C:\data\text_pi.txt
file: C:\data\text_pi.txt
fileisbinary: False
showallpatterns: False
Starting to discover patterns
Discovered 228688 patterns, testing them for randomness
Data is random

Apparently π is random. How about one of my blog posts?
C:\data>randomnesstest -f C:\data\text_blogpost.txt
file: C:\data\text_blogpost.txt
fileisbinary: False
showallpatterns: False
Starting to discover patterns
Discovered 559 patterns, testing them for randomness
Data is not random, non random patterns found:
Pattern:
    public Purchase Checkout(Cart cart)
    {
        Purchase purchase = new Purchase()
        {
             Customer = this,
             Products = cart.Products,
             Created = DateTime.Now
        };
        this.purchases.Add(purchase);
        return purchase;
    }
}

, found: 2 times, Indexes: [6718, 8033]
...
My blog post is not random. That’s fantastic news! Sometimes I feel like it is.

Testing pictures for randomness

Random noise picture:
white noise
C:\data>randomnesstest -f C:\data\pic_noise.png -b
file: C:\data\pic_noise.png
fileisbinary: True
showallpatterns: False
Starting to discover patterns
Discovered 7685 patterns, testing them for randomness
Data is random

Van Gogh, Bridge In The Rain picture:
van gogh rain
C:\data>randomnesstest -f C:\data\pic_vangogh.jpg -b
file: C:\data\pic_vangogh.jpg
fileisbinary: True
showallpatterns: False
Starting to discover patterns
Discovered 432021 patterns, testing them for randomness
Data is not random, non random patterns found:
Pattern: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA, found: 2 times, Indexes: [1311, 1312]
Pattern: CgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCg, found: 2 times, Indexes: [3388, 3392]
Pattern: DQ0NDQ0NDQ0NDQ0NDQ0NDQ0NDQ0NDQ0NDQ0NDQ0NDQ0N, found: 2 times, Indexes: [3508, 3512]
Pattern: AAAAAWFlaIAAAAAAAA, found: 2 times, Indexes: [1383, 3319]
Pattern: TWF1cmljZSBUcm9tc, found: 2 times, Indexes: [100, 548]
Pattern: ZGM6Y3JlYXRvcj4gP, found: 2 times, Indexes: [508, 596]
Pattern: EBAQAAAAAAAAAAAA, found: 2 times, Indexes: [3594, 3742]
...

Testing audio for randomness

For a final test I am going to test white noise and building noise for randomness. I have downloaded white noise file from here.
C:\data>randomnesstest -f C:\data\sound_whitenoise.wav -b
file: C:\data\sound_whitenoise.wav
fileisbinary: True
showallpatterns: False
Starting to discover patterns
Discovered 204117 patterns, testing them for randomness
Data is random
C:\data>randomnesstest -f C:\data\sound_buildingnoise.m4a -b
file: C:\data\sound_buildingnoise.m4a
fileisbinary: True
showallpatterns: False
Starting to discover patterns
Discovered 61762 patterns, testing them for randomness
Data is not random, non random patterns found:
Pattern: AwMDAwMDAwMCAwMDAwMDAwMCAwMDAwMDAwMCAwMDAwMDAwMCAwMDAwMDAwMCAwMDAwMDAwMCAwMDAwMDAwM, found: 2 times, Indexes: [262938, 262950]
Pattern: AAAAAAQAAAAAAAAAAAAAAAAAAAAEAAAAAAAAAAAAAAAAAAEAAAAAAAAAAAAAAAAAA, found: 2 times, Indexes: [256227, 256387]
Pattern: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA, found: 2 times, Indexes: [256274, 256275]
Pattern: aGQAAAAA0v3TfNL903wAAKxEAB, found: 2 times, Indexes: [256180, 256468]
Pattern: AAAAAAAAAAQAAAAAAAAAAAA, found: 2 times, Indexes: [256223, 256691]
Pattern: AAACvAAAAqgAAAKoAAAC, found: 2 times, Indexes: [256999, 261399]
Pattern: AAAArwAAAKwAAAClAAAA, found: 2 times, Indexes: [259132, 260604]
Pattern: AAACrAAAArwAAAKwAAAC, found: 2 times, Indexes: [259159, 260599]
...


Conclusion

So, there you go, we have just saved Eleanor Arroway centuries worth of work. Just to be clear, I’ve done this for fun, I am not expecting this to be super precise and to be used in production systems, there are many other much more sophisticated and proven ways that you can use to test for randomness.

No comments:

Post a Comment