A programming language for natural language generation
Just a quick update on something I have been working on lately. What I really wanted to do is an exploratory procedurally generated text adventure. However, I am not there yet.
After playing around with natural language generation (NLG) I came to realise that it’s much harder than I first expected. I tried several approaches such as with Context Free Grammar (CFG) and statistical generation from rules I had extracted from Wiktionary. It did generate proper sentences but it was often in bad contexts, or really hard to understand sentences. Also there are so many subtile things in language that makes it hard.
Textese
So I took a step back and started to think of ways of this could be achieved. One source of inspiration was SciGen, they use a CFG to generate research papers. It works so well that some papers got accepted for conferences. I thought that I could probably do something similar. Have a set of grammar rules that expand into sentences. However I needed to be able to use more than just static rules that expand. I first tried to write the rules directly in Java, but it was not efficient enough, the rules where really hard to read (as they are a mix of natural language and CFG grammars). So instead I created a new domain specific language, Textese.
To show an example of this language, consider Tranströmers poem ‘Storm’ (translated from Swedish by Robin Robertson):
Suddenly the walker comes upon the
ancient oak: a huge
rooted elk whose hardwood antlers, wide
as this horizon, guard the stone-green
walls of the sea.
Now I “ported” this poem to Textese:
{ // Storm by Tranströmer d($sub, $obj) { $metafore = entity(drawMetafore($obj)) $metaforePart = drawPart($metafore) > drawPart($obj) > leaf(animal%1:03:00::) suddenly|abruptly np($sub) verbFrom(meeting%1:11:00::) npa($obj): np(attrFrom(motion%1:26:00::), $metafore) whose drawAdj($metaforePart) $metaforePart, adjFrom(size%1:07:00::) as np(a, nounFrom(animal%1:03:00::)), verbFrom(guard%1:18:03::) the nounFrom(blueness%1:07:00::) nounFrom(partition%1:06:00::) of np(nounFrom(body_of_water%1:17:00::)). } $sub = leaf(person%1:03:00::) $obj = leaf(tree%1:20:00::) d($sub, $obj) }
Running this program generates new poems but with different meanings:
Abruptly an emperor encounters the
assorted common osier: a flying
leaf whose orderly venation, aired
as this desert, spies the cyan
rood screen of the gulf.
The language itself is backed by the Princeton database WordNet, which represents words in a hierarchical manner. This allows me to make meaningful connections between words.
Lets see where this goes.
Interesting!
Maybe you can improve on how much sense the generated poem makes.
E.g. You could score the poem on how often a combination of words occur together using google searches.
Instead of pure random, you get familiar tuples.
E.g. ‘Tree’ would more likely be paired with ‘tall’ than ‘disobedient’.
And if you have a scoring criterium, you could add genetic programming as well.
Interesting idea there Bram, basically crowd-sourcing via Google for a “sanity” check on word pairings… does create a dependency tho.
Good Idea!
I did consider using google n-grams dataset to do something similar, but decided to leave it for now (the files are so huge). I can do something like that with WordNet too but the data will maybe be to sparse (e.g. looking at the gloss of all nodes under ‘tree’ to find common nodes under ‘size’. It may also be slow but perhaps worth a try to see what quality I can expect.
I meant online queries, not a database.
First generate poem, then look up n-gram popularity in google, then score poem.
With scores you can either find top scorer and display that.
Or you could combine and mutate top-scorers.
To score n-gram, you do a query like:
tree AROUND(1) tall
And then the score can be the logarithm of the nr of results reported.
I see what you mean! This would probably work very good for smaller independent projects and such, but it may not be suitable to integrate into a programming language (dependency wise). If so I would probably have to make a http query function so that it’s up to the coder to implement the google, yahoo or whatever call. However, I would love to try out using the n-grams just to see how it improves the quality.