dimanche 24 mai 2015

Generating correct english sentences from a list of allowed words in python

I need to solve a sentence generation problem in python. Given keywords which describe my image and I need to generate a long multi-sentence description. In more detail:

Given (not everything must be used):

  • list of ~100 keywords that strongly describe an image
  • list of ~120 keywords that are somewhat related to that image, but don't describe anything crucial about it
  • In addition to this rough prioritization, I have more detailed prioritization inside of those groups of words
  • list of ~150 keywords that clearly don't fit the image
  • I have about a million of low quality example sentences, where about 10% of them contain some kind of error (grammar, spelling, word order, cut off etc.)
  • list of default words that are always allowed in sentences (the, and, or, in, a, at, by, is...)

Constraints:

  • I need to maximize word variety. Words are not allowed to be used repeatedly, but only once (except the, and, or, in, a, at, is...). This means once a word was used, it is not available for generating the next sentence. (I guess this could be solved by updating the list of allowed words each time a new sentence is generated)
  • My keywords are prioritized which means, higher priority keywords should have higher probability to occur in initial sentences.

Freedoms:

  • if really necessary, words can be turned into different forms: love -> loving -> loves, decoration -> decorative, blob -> blobs, is -> be

  • I am aware that keywords like ["mouse", "elephant", "fear"] can result in "Mouse fears elephant" or "Elephant fears mouse". I will throw those sentences away by hand: It would be nice to detect automatically which one is more probable, but it's NOT necessary (and count of google search results might probably help I think)

  • Not all words must be used. It's ok if some words a left for which there is little possibility too make a correct sentence.

I took a peek at TextBlob but I'm not sure if it's the right tool to achieve my goal. I don't want to waste time learning something that turns out to be useless. I also found some information on markov chain sentence generators, but I'm not sure if they are powerful enough or if I could use them in combination with something else.

Does anyone have past experience or knows how to generate such random sentences?

I hope I described my problem sufficiently. If there is something I forgot to mention, I will add that as discussion progresses.




Aucun commentaire:

Enregistrer un commentaire