Probability of a pattern being repeated...

healyron · 02-09-2009 12:24PM #1

Hi all

I'm trying to get a definitive answer to the following 'probability'-related question. If you do have an answer, can you please include a reference to literature that might include this issue.

I want to try to calculate the probability of 'guessing' or accidentally identifying a pattern. The pattern is a string of binary digits (so can only be a '1' or a '0'), and is, for example, 100 bits long. As I understand it, the probability of identifying each and every bit in the sequence is 1/2, so the chances of getting every one of the 100 bits is 1/2^100 (1 / (2 to the power of 100))

If that is accurate, then what I need to know is, if the pattern is a twice-repeated pattern, therefore 200 bits (100 bit sequence repeated twice), but there is no pre-knowledge of the fact that it is a repeated pattern, is the probability then 1/2^200 or is it ((1/2^100)*(1/2^100)) or somethig else entirely. I'm trying to calculate the probability of a bit sequence being identified with absoultely no knowledge of the length of the sequence pattern, nor of any other informaiton except the length of the *overall* sequence, which may contain 1 or more repeated patterns.

Any advice, pointers to relevant literature etc, would be appreciated

Ron, IrishUnsigned.com

Fremen · 02-09-2009 03:23PM

Well if you don't know a priori that there's a repetition in the pattern, it's 1/2^200.

A fact that you may have missed is that 1/2^200 = (1/2^100)*(1/2^100).

Of course, if you do know that there is repetition, the probability becomes 1/2^100. If you want to get your head around it properly, reduce the size of the key to something you can work on with a pen and paper, say 4 bits.

I don't know where you'd find a reference in the literature, it's a slightly unusual problem. If it's for a college report, you could just prove it yourself. Prove it for two bits, then show that adding one more bit to each pattern (two bits total) reduces the probability of a correct by 1/4 if the guesser doesn't know there's repetition, or 1/2 if the guesser does know. That's actually a more complicated proof than you need, but at least it's rigorous.

fergalr · 03-09-2009 09:47PM

I'm not a mathematician, but...

An n digit binary string can represent 2^n numbers.[1]
In the same way that an n digit decimal string can represent 10^n numbers.

I'm trying to calculate the probability of a bit sequence being identified with absoultely no knowledge of the length of the sequence pattern, nor of any other informaiton except the length of the *overall* sequence, which may contain 1 or more repeated patterns.

Are you essentially saying here that every string is equally likely[2]?

Then whether there is a pattern in the string or not doesn't matter to the chances of guessing it correctly. With one guess, you are choosing one string from the total set of strings. Because an n digit binary string can represent 2^n numbers, you are choosing 1 number from 2^n numbers There are 2^n possible choices for this, so the probability of getting it is 1/2^n.

[1]I guess you might try and prove that an n digit binary string can represents 2^n different numbers. This seems very fundamental to me, to the point where I'm not sure how I'd go about proving it by assuming something more fundamental? Is this what you are after?

[2] I'm not sure if saying you've no information about which strings are more likely than others is the same as if they are all equally likely, but I dont think thats what you were getting at.
If there was something else to change the likelihood of certain strings (eg, those with patterns) appearing, then the problem would be different.

Anyone got more insight on this? Be curious to hear if I got that all right...

healyron · 04-09-2009 10:38AM

Hi folks

Thanks for the replies. I'm going to work on the premise that if it can be proven for a small number, then expanding it for larger numbers is valid. However, insofar as the calculation of possible strings is concerned, I should point out (I thought I did, but maybe it's either not obvious or not relevant!) that since there is no a priori knowledge that the string is repeated in a pattern or that there may or may not be more than one string, the calculation becomes a little more complex, as I understand it.

For example, if a string is 10 bits long, each bit could be a 1 or 0, therefore the possible number of strings is 1/10^2. Having said that, if the bit-sequence could be 1 bit (from the 10 bits in the message), or two, or 5 or 9, or even 0, does that make the possible permutations and combinations larger because it does not have to, but might, include any number of the bits (not necessarily all 10)

fergalr wrote: »

Are you essentially saying here that every string is equally likely?

Yes, and that the pattern could be anything from 0 bits, to any upper bound number of bits, but that the decode process knows nothing about it, except it would know how many bits could theoretically be the maximum (upper bound).

In case you're curious, this is in relation to the brute-force decryption complexity of an enxcrypted message, given no a priori information about it except the size of the encrypted message.

Thanks

Ron, IrishUnsigned.com

Fremen · 04-09-2009 11:53AM

healyron wrote: »

therefore the possible number of strings is 1/10^2. Having said that, if the bit-sequence could be 1 bit (from the 10 bits in the message), or two, or 5 or 9, or even 0, does that make the possible permutations and combinations larger because it does not have to, but might, include any number of the bits (not necessarily all 10)

I guess you mean 1/2^10, or have I misunderstood something?

The number of strings of length N or less is 2^(N+1) - 1: each string of length k had 2^k possibilities, so you sum the geometric series from k=0 to N.
The probability of picking the right one depends a little on the rules of the game you're playing, so I won't guess at it.

healyron · 04-09-2009 01:23PM

Fremen wrote: »

I guess you mean 1/2^10, or have I misunderstood something?

Nope, you're right, it should have been 1/2^10. That was my left-hand, right-brain doing stuff backwards

Fremen wrote: »

The number of strings of length N or less is 2^(N+1) - 1: each string of length k had 2^k possibilities, so you sum the geometric series from k=0 to N.

So, a 10 bit sequence could be said to have as many as 2^(10+1)-1 possible sub-sequences, and each of these subsequences could have 1/(2^x) possible combinations, where x is the length of the sub-sequence? When you get to "so you sum the geometric series from k=0 to N", I get lost. I know, in english, how to explain that, but mathematically, it's beyond me.

Fremen wrote: »

The probability of picking the right one depends a little on the rules of the game you're playing, so I won't guess at it.

Yeah, that's a fair point, but as I said I'm trying to matehmatically write the 'worst case / best case' brute force formula and that's not really case-dependent. Basically, I have the question:

"If I know a sequence is N bits long, how many guesses would I have to make, of any length up to N, before I guessed the correct sequence?"

and I want to write that calculation in a mathematical formula

Having said that, what you added has certainly helped to clarify my own thoughts on this area of Maths

Regards

Ron, IrishUnsigned.com

Fremen · 04-09-2009 01:53PM

healyron wrote: »

So, a 10 bit sequence could be said to have as many as 2^(10+1)-1 possible sub-sequences, and each of these subsequences could have 1/(2^x) possible combinations, where x is the length of the sub-sequence?

Close, but not quite. What I'm saying is

0 bits = 1 possibility (empty space) = 2^0

1 bit = 2 possibilities (1 or 0) = 2^1

2 bits = 4 possibilities (00,01,10,11) = 2^2

3 bits = 8 possibilities (000,001,010,011,100,101,110,111) = 2^3

4 bits = 16 possibilities = 2^4

etc...

so the number of strings with length 4 or less is
(2^0) + (2^1) + (2^2) + (2^3) + (2^4)
and there's a formula which says that this sum is equal to
2^(4+1) - 1

So for ten bits, there are (2^11) - 1 = 2047 possibilities. This includes all possible combinations.

healyron · 04-09-2009 02:36PM

Fremen wrote: »

...for ten bits, there are (2^11) - 1 = 2047 possibilities. This includes all possible combinations.

Thanks. That's going to be useful

Ron, IrishUnsigned.com

fergalr · 04-09-2009 11:04PM

healyron wrote: »

In case you're curious, this is in relation to the brute-force decryption complexity of an enxcrypted message, given no a priori information about it except the size of the encrypted message.

Sounds interesting.

You need to have some a priori information though, to even talk about the brute force complexity, right? Like, you need to know something about the cryptoscheme in question or the likely set of cryptoschemes.

If alice and bob decide on an encryption scheme where they decide to use the ciphertext '0001' to represent the text of the lisbon treaty, and all you come across is the ciphertext '0001', and you don't know anything else, surely there isn't very much you can say about brute forcing that?

Don't you need some assumptions about the cryptoscheme, or encodings, or key lengths etc?
Generally when I hear people talk about brute forcing encryption schemes, they are talking about key sizes and stuff like that, more so than ciphertext sizes?

JoeB- · 15-09-2009 09:45PM

fergalr wrote: »

If alice and bob decide on an encryption scheme where they decide to use the ciphertext '0001' to represent the text of the lisbon treaty, and all you come across is the ciphertext '0001', and you don't know anything else, surely there isn't very much you can say about brute forcing that?

I think there's a distinction drawn between a code and a cipher...

A code is like your example, where there need not be any correspondence between what is coded and the resulting code.. i.e if I answer the phone with a 'hello, it's Joe' as opposed to 'Hi'... both answers mean different things but this code is impossible to break without more info...

A cipher on the other hand is where there is a correspondence between the coded message and the original, even if a key is used to scramble the message..

That Enigma story from WWII is a great example.. although it's described as 'codebreaking' it was actually a cipher... and it's incredible that it was broken at all.. and apparently it was only broken when details of the Enigma machine became available, which supports your statement that some details of the cipher scheme would be very useful..although probably not essential in principle.

Disclaimer: The above (difference between code and cipher) is based on what I read in a childrens code book over 15 years ago.. so is probably very simplistic. They gave the example of clothes drying on a clothes line as being an unbreakable code... you can only understand the message if you know what the code is... i.e one shirt = 'come in' whereas a pair of knickers means 'stay away'..

Probability of a pattern being repeated...

Comments