We’re going to talk about 3-bit strings. Please note that there are 8 of them:

 000 001 010 011 100 101 110 111

In particular situations we may want to concern ourselves with a subset of these strings. For example if our strings were substantially larger than 3 bits and taken to represent images, we may be interested in the subset of strings (images) that contain a face.

We can represent such subsets of strings by assigning a true/1 to strings that are in the subset, and false/0 to strings that are not.

For example,

 000 0 001 0 010 1 011 1 100 0 101 1 110 0 111 1

is a way to denote the arbitrarily chosen subset of strings #{010 011 101 111}.

It is easy to see that there are 28 = 256 possible subsets of 3-bit strings (since every subset is representable as an 8-bit string).

Another way to represent sets of strings is by using predicate logic.

For example, the predicate logic expression (AND 0 1) represents the set of strings #{110 111} because the 0th-indexed bit is true/1 and the 1st-indexed bit is true/1 in both strings.

If we randomly select a logic expression from say the first 5000 such expressions in order of increasing complexity, what subset of strings are we likely to end up with? Are all of the 256 possible subsets equally likely when we are uniformly sampling logic expressions?

Here is a frequency chart demonstrating all 256 concepts and the number of logic expressions that represent them in the first 5000 predicate logic expressions:

Here is the same data in frequency-ranked order (note that subsets with 0 representation are not drawn):

It seems, then, that uniformly sampling logic expressions does not yield uniformly sampled subsets of strings.

The most commonly represented subset was the empty set. (Contradictions are common, e.g. (AND 0 (NOT 0))).

The next most common subsets are representable as 0, 1, 2; in other words a single bit required to be true.

The next most common subsets are where two bits are required to be true (e.g. (AND 0 1)).

Theories on why subsets are represented nonuniformly:

• Some subsets/concepts are more complex. If we view the predicate logic expression as a form of data compression, it is well known that a given compression scheme cannot compress every instance of the data. Given a compression scheme, some data is simpler to represent than other data.

• AND/OR are biased; it is statistically unlikely that two randomly drawn bits will satisfy AND. It is statistically unlikely that two randomly drawn bits will not satisfy OR.