More Cryptic Crosswords

Recently, I’ve done a bunch of work to clean up the cryptic crossword clue solver which I first described in an earlier post. I have simplified the code layout and added more flexibility to the solver’s interface, so it should be easier to experiment and play around with. What follows are a few examples of what the new solver can do. You can also view these examples as a rendered IPython notebook on nbviewer.org.

In [1]:
from pycryptics.grammar.clue_parse import generate_clues
from pycryptics.solve_clue import Constraints
from pycryptics.grammar.clue_tree import ClueUnsolvableError
Loading synonyms from file...
...done.
Loading ngrams from file...
...done.
Loading indicators from file...
...done.

Solving a single clue

Let's demonstrate what we can do with a simple cryptic clue. Here's our clue, broken up into a list of words:

In [2]:
phrases = 'spin broken shingle'.split()

We also have constraints on the final answer, namely that it will have 7 letters and the answer won't be any of the words which directly appear in the clue.

In [3]:
con = Constraints(phrases=phrases,
                  lengths=(7,),
                  pattern='',
                  known_answer=None)

Now we can generate all possible clue interpretations of those phrases:

In [4]:
clues = generate_clues(con)
clues.sort() # sort the clues list so that we can reliably find 
# specific clues for this demo. You won't need to do this.

For example, here's one interpretation:

In [5]:
print clues[0]
(top (d spin) (clue_arg (first broken)) (clue_arg (first shingle)))

And another:

In [6]:
print clues[10]
(top (d spin) (clue_arg (ana (ana_arg (lit broken)) (ana_ shingle))))

Here are all the possible ways 'spin broken shingle' could be interpreted:

In [7]:
for c in clues:
    print c
(top (d spin) (clue_arg (first broken)) (clue_arg (first shingle)))
(top (d spin) (clue_arg (first broken)) (clue_arg (lit shingle)))
(top (d spin) (clue_arg (first broken)) (clue_arg (syn shingle)))
(top (d spin) (clue_arg (lit broken)) (clue_arg (first shingle)))
(top (d spin) (clue_arg (lit broken)) (clue_arg (lit shingle)))
(top (d spin) (clue_arg (lit broken)) (clue_arg (syn shingle)))
(top (d spin) (clue_arg (syn broken)) (clue_arg (first shingle)))
(top (d spin) (clue_arg (syn broken)) (clue_arg (lit shingle)))
(top (d spin) (clue_arg (syn broken)) (clue_arg (syn shingle)))
(top (d spin) (clue_arg (ana (ana_ broken) (ana_arg (lit shingle)))))
(top (d spin) (clue_arg (ana (ana_arg (lit broken)) (ana_ shingle))))
(top (clue_arg (sub (sub_ spin) (sub_arg (lit broken)))) (d shingle))
(top (clue_arg (sub (sub_ spin) (sub_arg (syn broken)))) (d shingle))
(top (clue_arg (first spin)) (clue_arg (first broken)) (d shingle))
(top (clue_arg (first spin)) (clue_arg (lit broken)) (d shingle))
(top (clue_arg (first spin)) (clue_arg (syn broken)) (d shingle))
(top (clue_arg (ana (ana_ spin) (ana_arg (lit broken)))) (d shingle))
(top (clue_arg (ana (ana_arg (lit spin)) (ana_ broken))) (d shingle))
(top (clue_arg (lit spin)) (clue_arg (first broken)) (d shingle))
(top (clue_arg (lit spin)) (clue_arg (lit broken)) (d shingle))
(top (clue_arg (lit spin)) (clue_arg (syn broken)) (d shingle))
(top (clue_arg (syn spin)) (clue_arg (first broken)) (d shingle))
(top (clue_arg (syn spin)) (clue_arg (lit broken)) (d shingle))
(top (clue_arg (syn spin)) (clue_arg (syn broken)) (d shingle))
(top (clue_arg (sub_init (sub_init_ spin) (sub_arg (lit broken)))) (d shingle))
(top (clue_arg (sub_init (sub_init_ spin) (sub_arg (syn broken)))) (d shingle))
(top (clue_arg (sub_final (sub_final_ spin) (sub_arg (lit broken)))) (d shingle))
(top (clue_arg (sub_final (sub_final_ spin) (sub_arg (syn broken)))) (d shingle))
(top (clue_arg (rev (rev_ spin) (rev_arg (lit broken)))) (d shingle))
(top (clue_arg (rev (rev_ spin) (rev_arg (syn broken)))) (d shingle))

Let's try solving one of these clues:

In [8]:
try:
    print clues[0].answers
except ClueUnsolvableError as e:
    print e
This clue has no solutions under the given constraints: Constraints(phrases=['spin', 'broken', 'shingle'], lengths=(7,), pattern='', known_answer=None)

That clue had no answers which matched our constraints. This raises an exception so that when we're solving a clue, any of its subparts being unsolvable will trigger the exception and let us skip out of solving the other subparts.

Let's look at the subparts of a clue. Each clue is a tree, and we can access the children of any node in the tree using the [] notation.

In [9]:
print clues[-1]
(top (clue_arg (rev (rev_ spin) (rev_arg (syn broken)))) (d shingle))

In [10]:
print clues[-1][0]
(clue_arg (rev (rev_ spin) (rev_arg (syn broken))))

In [11]:
print clues[-1][0][0]
(rev (rev_ spin) (rev_arg (syn broken)))

In [12]:
print clues[-1][0][0].answers
{'emat': ['', 'tame'], 'detsub': ['', 'busted'], 'deyortsed': ['', 'destroyed'], 'tespu': ['', 'upset'], 'degluvid': ['', 'divulged'], 'tpeknu': ['', 'unkept'], 'detcarfni': ['', 'infracted'], 'luftif': ['', 'fitful'], 'depoleved': ['', 'developed'], 'detomed': ['', 'demoted'], 'desuap': ['', 'paused'], 'degamad': ['', 'damaged'], 'nevig': ['', 'given'], 'dettod': ['', 'dotted'], 'deliaf': ['', 'failed'], 'wol': ['', 'low'], 'dehguor': ['', 'roughed'], 'deid': ['', 'died'], 'depmub': ['', 'bumped'], 'demat': ['', 'tamed'], 'detaloiv': ['', 'violated'], 'delaever': ['', 'revealed'], 'dehsad': ['', 'dashed'], 'depparwnu': ['', 'unwrapped'], 'dedneffo': ['', 'offended'], 'detageler': ['', 'relegated'], 'depmad': ['', 'damped'], 'tilps': ['', 'split'], 'deggur': ['', 'rugged'], 'desufnoc': ['', 'confused'], 'dehcaerb': ['', 'breached'], 'tcefrepmi': ['', 'imperfect'], 'dehsurc': ['', 'crushed'], 'denepmad': ['', 'dampened'], 'detarapes': ['', 'separated'], 'detrap': ['', 'parted'], 'dehsams': ['', 'smashed'], 'desolcsid': ['', 'disclosed'], 'despalloc': ['', 'collapsed'], 'enog': ['', 'gone'], 'detlah': ['', 'halted'], 'htoomsnu': ['', 'unsmooth'], 'derutcarf': ['', 'fractured'], 'deniur': ['', 'ruined'], 'denekaew': ['', 'weakened'], 'derednuof': ['', 'foundered'], 'deppots': ['', 'stopped'], 'denetfos': ['', 'softened'], 'deriapmi': ['', 'impaired'], 'delbmuh': ['', 'humbled'], 'nrow': ['', 'worn'], 'detpure': ['', 'erupted'], 'tsrub': ['', 'burst']}

The .answers property returns the clue's possible answers or raises a ClueUnsolvableException if there are no answers. The first time a clue's .answers is requested, it automatically calls ClueTree.solve() and caches that answer so that subsequent requests for .answers will be instananeous.

The format of a clue's answers is a dictionary in which each key is a possible answer and the corresponding value is the answers to the children of that clue which yielded the given answer. For example, the first answer given above is 'emat', with corresponding child answers of '' and 'tame'. The ordering of these sub-answers corresponds to the ordering in the (rev (rev_ spin) (rev_arg (syn broken))), so the rev_ reversal marker produced an empty string ('') and the syn synonym operator acting on 'broken' produced 'tame'. Reversing 'tame' gave the answer 'emit'.

We can also ask any clue how it got a particular answer:

In [13]:
print clues[-1][0][0].derivation('emat')
(rev (rev_ "spin") (syn "broken" -> TAME) -> EMAT)

If it's not clear what this means, we can also ask for a human-readible derivation:

In [14]:
print clues[-1][0][0].long_derivation('emat')

Take a synonym of 'broken' to get TAME.
'spin' means to reverse 'tame' to get EMAT.

We can ask for the derivation or long derivation of any of this sub-clue's answers:

In [15]:
print clues[-1][0][0].long_derivation('tsrub')

Take a synonym of 'broken' to get BURST.
'spin' means to reverse 'burst' to get TSRUB.

Now, let's find a clue that actually produces a final answer. How about this one:

In [16]:
print clues[9]
print "\nAnswers:", clues[9].answers
print clues[9].long_derivation('english')
(top (d spin) (clue_arg (ana (ana_ broken) (ana_arg (lit shingle)))))

Answers: {'english': ['', 'english']}

'spin' is the definition.
'broken' means to anagram 'shingle' to get ENGLISH.

In general, it would seem like generating all of these different clue interpretations and then solving them separately would be terribly inefficient. However, the way our parser is constructed ensures that identical sub-clues actually refer to the same python object, so solving one sub-clue solves all of its instances contained in other clues. We can check this easily:

In [17]:
print clues[0]
(top (d spin) (clue_arg (first broken)) (clue_arg (first shingle)))

In [18]:
print clues[1]
(top (d spin) (clue_arg (first broken)) (clue_arg (lit shingle)))

In [19]:
print clues[0][1]
(clue_arg (first broken))

In [20]:
print clues[1][1]
(clue_arg (first broken))

In [21]:
clues[0][1] is clues[1][1] # 'is' tests whether two variables refer to the same python object
Out[21]:
True

Scoring Answers

Now let's try actually searching for the best answer. For this, we'll need the full CrypticClueSolver

In [22]:
from pycryptics.solve_clue import CrypticClueSolver
In [23]:
solver = CrypticClueSolver()
In [24]:
annotated_answers = solver.solve_constraints(con)
In [25]:
for a in annotated_answers:
    print a
['english', 1, '(top (d "spin") (ana (ana_ "broken") (lit "shingle") -> ENGLISH))']
['violate', 0, '(top (sub_init (sub_init_ "spin") (syn "broken" -> VIOLATED) -> VIOLATE) (d "shingle"))']
['reached', 0, '(top (sub_final (sub_final_ "spin") (syn "broken" -> BREACHED) -> REACHED) (d "shingle"))']
['divulge', 0, '(top (sub_init (sub_init_ "spin") (syn "broken" -> DIVULGED) -> DIVULGE) (d "shingle"))']
['confuse', 0, '(top (sub_init (sub_init_ "spin") (syn "broken" -> CONFUSED) -> CONFUSE) (d "shingle"))']

The annotated_answers contains a list of possible solutions, including the actual answer, its confidence score (between 0 and 1), and its short derivation, and it's automatically sorted to put the best answers first. We can also check the long derivation of any answer:

In [26]:
print annotated_answers[0].long_derivation()

'spin' is the definition.
'broken' means to anagram 'shingle' to get ENGLISH.
ENGLISH matches 'spin' with confidence score 100%.

Combining Phrases

Now let's try a slightly harder clue:

In [27]:
phrases = 'couch unfinished until now'.split()
con = Constraints(phrases=phrases,
                  lengths=(4,),
                  pattern='',
                  known_answer=None)

First, let's just try solving this the same way as before:

In [28]:
ann = solver.solve_constraints(con)
In [29]:
print ann[0].long_derivation()

'couch' is the definition.
Take a synonym of 'until' to get BEFORE.
'unfinished' means to take an initial substring of 'before' to get BE.
Take a synonym of 'now' to get AD.
Combine 'BE' and 'AD'  to get BEAD.
BEAD matches 'couch' with confidence score 56%.

This is not a very good answer, since it only matches with 56% confidence. In fact, the correct answer is SOFA, which we get by taking a synonym of "until now" to get "SO FAR" and then removing the last letter to get SOFA. So what went wrong? Well, the solver treated "until" and "now" as separate phrases, so it looked for synonyms of both of them, but not for the phrase "until now". To make sure we get the right combinations of phrases, the solver can actually search over all possible combinations of phrases to find the best answer:

In [30]:
ann = solver.solve_all_phrasings(con)
print ann[0].long_derivation()
['couch', 'unfinished', 'until', 'now']
['couch', 'unfinished', 'until_now']
['couch', 'unfinished_until', 'now']
['couch', 'unfinished_until_now']
['couch_unfinished', 'until', 'now']
['couch_unfinished', 'until_now']
['couch_unfinished_until', 'now']

'couch' is the definition.
Take a synonym of 'until_now' to get SO_FAR.
'unfinished' means to take an initial substring of 'so_far' to get SOFA.
SOFA matches 'couch' with confidence score 100%.

The lists which are printed above show all the combinations of phrases that the solver tried. We can see that it got the correct answer by combining "until now" into a single phrase, which we've shown by joining those words with an underscore.

Robin Deits 26 December 2013