Puzzle: Shakespeare numbers…


Imagine that there might be a real number that contains (in an encoded form) the entire works of Shakespeare. Let’s call this number a “Shakespeare number”.

Can you give an example of an algorithm to generate Shakespeare numbers?

[update]

The problem, as originally stated, is too vague. I added more specifics:
1) No tricks… You should not hardcode in your algorithm the entire text of Shakespeare works, or assume prior knowledge.
2) The real number must be well defined.

Comments (19)

  1. BillT says:

    Just ZIP the entire works of Shakespeare. Interpret the resulting file as a number.

    (Am I missing the gist of the question, or is this just trivially easy?)

  2. AndrewSeven says:

    Encode every letter in binary and save it as a string.

    Concatenate all the strings together.

    Treat the whole thing as a binary value to determine a single numeric value.

  3. I don’t have the whole algorithm, but in C# it starts off like this:

    2 * b || !(2 * b) ?

  4. Adi Oltean says:

    Funny comments 🙂

    But I was looking for an answer that would NOT require knowledge of Shakespeare works…

    Thanks, Adi

  5. AndrewSeven says:

    42 is the answer to life the universe and everything. 6 x 9 is 42.

    Can you clarify what you mean?

  6. Hilarious, Matt! I think you’re onto something. 🙂

    Theoretically, arithmetic compression could accomplish this.

  7. Mr Blobby says:

    Take aleph-zero monkeys. Give them aleph-zero copies of vi. Sample regularly, take the aleph-zeroth output. Encode to your heart’s content. Voila.

  8. Adi Oltean says:

    OK – here is one way to build a Shakespeare number:

    1) First, you take all binary sequences of 0 and 1: 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, etc.

    2) Then, you concatenate all those strings. (you get 0100011011000001010011… and so on)

    3) Now, you create an infinite-precision floating point number, where the mantissa (the fraction bits) being the string above. You get this number:

    x = 0.0100011011000001010011 …

    Done. So, every random sequence of bits can be found somewhere in this number. All you need is an index to decode this information.

    Therefore, the entire works of Shakespeare are there too.

    Silly trick, isn’t? 🙂

  9. AndrewSeven says:

    Seems to be the same as the monkeys 😛

  10. Adi Oltean says:

    One colleague pointed me out to normal numbers.

    Given that the numeric base is fixed, these numbers are also called Normal numbers: http://en.wikipedia.org/wiki/Normal_number

    It is unknown whether constants like Pi, e, etc. are normal, but it is very likely, based on empirical evidence. For example, if we represent Pi in base 2, then it looks like we can find any finite random sequence at some position…

    But it is known that real numbers are normal almost everywhere (see the link above).

  11. Mr Blobby says:

    Unrelated, but given how mindless the original question was…

    Green and Tao proved last year that there exist arbirarily long arithmetic progressions of primes. The record for an explicit one is 22 primes long.

    As a geometer with absolutely no background in additive combinatorics, I have no idea whatsoever about their proof, but it’s still a mindblowing result!

  12. Mr Blobby says:

    A note for mathematical historians: the monkeys and typewriters are an example of the Kolmogorov 0-1 law (look it up, don’t be lazy).

    It is thought to have been first conceived by Borel, who was by all accounts a giant in the world of measure theory.

  13. Lets assume you have a dictionary of all words used in Shakespear. Let each word have an associated id, a unique number. So for example the word To might have an id of 555 and the word Be might have an id of 111. Then to encode this number, simply append numbers in a string. The cool thing is that you can use your own dictionary so that the ordering of words can be such that those that occur more frequently can be higher up with smaller number. So lets say a word, THE, occurs the most, then give it a rank of 1. So you just condensed the lengthy text into a less lengthy text. Of course there are issues here. First, encoding the text will require that you lookup the id’s. Decoding the text will require a reverse lookup. Now lets say you want to transmit the data to someone else, then they will also need to know what your dictionary is in order to decode the text.

  14. Jeremy says:

    I don’t see the point of the question.

    If you’re looking to generate a number that contains arbitrary information (Shakespeare in this case) with no prior knowledge of that string you would need to choose an encoding scheme (e.g. an ASCII bit stream) ahead of time. Then you would need to find the single(!) number which exactly represented the information in the prespecified encoding. I would assert that there is no sane way to compose the exact encoding of a piece of information without prior knowledge of said information.

    However, if you’re bringing normal numbers into the mix you’ve changed the focus. Sure, any normal number would contain the entire works of Shakespeare in any encoding imaginable. It would also contain any other arbitrary piece of information somewhere. Given this, normal numbers would be no better than (and seem to be mathematically quite closely related to) pure random sequences of digits. In reality saying that "a normal number contains X information" is pointless because you have no reliable system of retrieval. You might as well ‘cat /dev/urandom’ until you see the text you’re looking for.

    Of course, I think this is just a long-winded way of making the monkeys/typewriters comment. 🙂

  15. Adi Oltean says:

    Of course, I agree that the problem doesn’t have any practical applicability in computers. At least, in the sense of storing/encoding information… But I find it fascinating that there are a special class of numbers that contain any piece of imaginable (but finite in size) information.

    BTW – a correction: Random infinite sequences of bits do NOT classify as a complete solution to the problem. You can easily get a sequence which is not normal, for example the string 010101010101010… can be randomly generated but obviously is not "normal". Doesn’t matter how do yo umodel these monkeys (you can use a Markov chain with two states for example) you can get such a string, correct?

    But the funny thing is that a randomly generated string can be a solution in almost all cases. This comes easy as a generalization of the fact that the "normal" property is true almost everywhere on R (the set of real numbers). If you define a measure over the set of infinite sequences of bits, it follows that these sequences are "normal" almost everwhere (i.e. a, infinite sequence is normal if any given finite subsequence of bits can be found somewhere in it). Therefore, the probability of getting a normal sequence is 1, since the property is true except for a set of measure zero.

  16. Mark Mullin says:

    Ummm, it sounds like you’re referring to a Godel number – Mr Blobby was on the right track

    Basically, if I define a program for a universal turing machine whos output is the collected work of shakespeare (I don’t actually need to know the work of Shakespeare, I can just posit it produces output acceptable to an another turing machine that accepts as input the output of the former), then I can say that a single number that is the product of the primes that represents the instructions is the ‘Shakespeare number’

    Two notes – First, someone has to know the Shakespeare, otherwise you can’t set up the equality and the whole thing is not well formed. Second…… expect a very very very big number

    Heres a link

    http://mathworld.wolfram.com/GoedelNumber.html

  17. Mr Blobby says:

    A more accessible overview of Godel numbers:

    The ‘standard way’ of forming Godel numbers is to assign each letter in your ‘alphabet’ to the next available prime.

    Example:

    0 –> 2

    1 –> 3

    + –> 5

    – –> 7

    etc

    Then, to calculate the Godel number of a ‘statement’, simply consider 2^[first letter of statement].3^[second letter of statement].5^[third letter of statement]… and so on.

    eg Godel("1+1=0") = 2^3.3^5.5^3.7^7.11^2 = 24214634829000

    eg2 Godel("0=0") = 2^2.3^7.5^2 = 218700

    To calculate the Godel number of a sequence of statements, simply consider 2^[Godel number of first statement].3^[Godel number of second statement].5^[Godel number of third statement]… and so on.

    eg Godel

    {

    1+1=0

    0=0

    }

    = 2^24214634829000.3^218700

    = something very big, call it H.

    How does one recover the original proof from a

    Godel number G? Well, factorize it! (Note: This could take a LONG time!). Test each exponent (or, if you work tidily, just the exponent of 2) for primality. (Note: why not use the Agrawal-Kayal-Saxena test, proven to be a polynomial time algorithm?)

    If the exponenets are PRIME, then you know you have a statement straight away. Use the exponents to decode the statement.

    If, on the other hand, the exponents are COMPOSITE then you know they represent a statement. Factorize again to recover what the statement is in terms of your alphabet.

    eg. when we factorize H, we get 2^24214634829000.3^218700. We can see that the exponents are composite, so we are dealing with a sequence of stataements (in fact, a sequence of length 2, as we have 2 primes).

    We now decompose each exponent (24214634829000 and 218700) to recover the initial sequence of statements, in order.

    This is the way Godel initially encoded his numbers uniquely, and recoverably (albeit in a completely non-feasible time!). This is in the language of ordinary arithmetic – however, there are equivalent functions (sometimes called beta functions) in any ‘equivalent’ language).

    Godel’s original proof of the first incompleteness theorem also involved these elements, which I don’t have time to go over in detail.

    FREE VARIABLES: Basically, statements can depend on free variables, and their Godel numbers vary depending on what those free variables are.

    SELF-REFERENCE LEMMA: Every ‘property’ of a statement can be encoded in a Godel number – ie, the Godel number shouts out ‘The sequence of statements I encode has this property’.

    NON-PROVABILITY: We can encode any potential proof using Godel numbers. Call such a number a ‘proof number’. Something is provable if we can deduce it from the axioms we start with and modus ponens (implication/deduction) alone. We can easily test whether a proof number actually represents the proof of a statement, given the two numbers in Godel form. This is testable in finite time (again, not necessarily very fast!). [Think ‘theorem proving’ software.]

    Provability is a testable property of a statement. Far more interestingly, non-provability is also a testable property of a statement.

    Using the self-reference lemma on the property of non-provability, and using Godel numbers/proof numbers of various things in clever ways as free variables, Godel constructs a Godel number, representing a concrete statement, which shouts out ‘The statement I encode is not provable’. [Does this remind you of the liar’s paradox?] The only consistent interpretation of this is that the statement is true, but not provable.

    This is very far from a proof, but I hope it explains the more accessible parts of Godel theory more comprehensively than just saying ‘a mathematical statement of the liar’s paradox’.

    Many computer scientists choose to work in the language of Turing machines, but I find this very complicated. However, there are deep connections with undecidability problems more generally, including the Turing halting problem.

    Exercise for the dedicated reader: Using a very simple but non-trivial system or arithmetic (addition modulo 2 would probably suffice), the latest advances in accessible beta numbers, and a stonking supercomputer, calculate a true but non-provable statement explicitly.

  18. Lauri says:

    Yes, it’s possible. Use arithmetic coding and a suitable language model. Note that the language model can start off as a uniform distribution over the alphabet, so no hardcoded algorithm goes into the real number.

    As an example of this method, see Dasher (http://www.inference.phy.cam.ac.uk/dasher/).