![]() The corpus is organized into 15 files, where each file contains several hundred postsĬollected on a given date, for an age-specific chatroom (teens, 20s, 30s, 40s, plus a Names of the form "UserNNN", and manually edited to remove any other identifying information. The corpus contains over 10,000 posts, anonymized by replacing usernames with generic There is also a corpus of instant messaging chat sessions, originally collectedīy the Naval Postgraduate School for research on automatic detection of Internet predators. wine.txt Lovely delicate, fragrant Rhone wine. singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun. ![]() pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr. overheard.txt White guy: So, do you have any plans for this evening? Asian girl. grail.txt SCENE 1: KING ARTHUR: Whoa there! [clop. firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se. The sents() function divides the text up into its sentences, where each sentence is Tells us how many letters occur in the text, including the spaces between words. So, for example, len(gutenberg.raw( 'blake-poems.txt')) The raw() function gives us the contents of the file The previous example also showed how we can access the "raw" text of the book , (In fact, the average word length is reallyģ not 4, since the num_chars variable counts space characters.)īy contrast average sentence length and lexical diversityĪppear to be characteristics of particular authors. Observe that average word length appears to be a general property of English, since ![]() Item appears in the text on average (our lexical diversity score). This program displays three statistics for each text:Īverage word length, average sentence length, and the number of times each vocabulary 5 25 26 austen-emma.txt 5 26 17 austen-persuasion.txt 5 28 22 austen-sense.txt 4 34 79 bible-kjv.txt 5 19 5 blake-poems.txt 4 19 14 bryant-stories.txt 4 18 12 burgess-busterbrown.txt 4 20 13 carroll-alice.txt 5 20 12 chesterton-ball.txt 5 23 11 chesterton-brown.txt 5 18 11 chesterton-thursday.txt 4 21 25 edgeworth-parents.txt 5 26 15 melville-moby_dick.txt 5 52 11 milton-paradise.txt 4 12 9 shakespeare-caesar.txt 4 12 8 shakespeare-hamlet.txt 4 12 7 shakespeare-macbeth.txt 5 36 12 whitman-leaves.txt print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid) num_vocab = len(set(w.lower() for w in gutenberg.words(fileid))) My implementation as well as the elaboration of the corresponding theoretical background (see below) have been a term project for the Pattern Recognition course which I took during the winter term 2006/07. This way, the language generation, recognition and training can be accessed easily by copying the files onto a Web server. My implementation consists of multiple Perl scripts with HTML pages for convenience. ![]() Hence, my implementation enables language recognition, including German and English by default. As a more serious side effect of the need for training, the obtained character sequence statistics can be compared and used to determine the language of arbitrary texts with surprisingly high accuracy (given a large amount of training data). By replacing the monkeys with a random character generator which considers language-specific character sequence statistics by using Markov models, I created an implementation which is not only more practical than placing actual monkeys in front of type writers, but also generates texts which look surprisingly similar to actual German or English texts, albeit without any meaningful structure or content.įor convenience, my implementation comes with character sequence statistics for German and English and can easily be extended by using the provided training script to process a large amount of training text in the desired language. Shakespeare’s Monkeys refers to the monkeys from the Infinite Monkey Theorem which states that an infinite number of monkeys typing on type writers for an infinite amount of time will eventually type a work of Shakespeare. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |