Next week we will continue with implementations of stacks and queues.
Assume that we want to imitate the saga Njála. The primitive version of Drifhöfundur would find the frequency of all letters in Njála and each time pick a letter at random based on those probabilities. For example if a space, ' ', is every fifth letter on averate in Njála, then the probability that the program outputs a space is 20%. The second most common letter might be the letter 'a', and then there is less probability of the program picking it, etc. The text that comes out has the same frequence distribution on the letters as Njála, but doesn't look much like Njála.
However, we can continue with this idea and base the probability on which letter came last For instance, there is a much higher probability for the letter 'ð' if the last letter was an 'a', than if it was some other letter. This is because of how common words "að" and "það" are in Njála (as in most other icelandic books). Below is an example of how such a random text could look like:
Now we can increase the number of letters that we base the probability on, such that we look at the last k letters used. In the example above we had k = 1, but if we use the last two letters we can get the random text:
As k increases the random text starts looking more like the original text. Let's look at k = 5:
and k = 10:
The algorithm that you are to use is not very complicated. The program reads the value of k, how long
the random text should be and the name of the input file. It starts by reading the whole input file
into a string (string variable or char array) and chooses a k-letter initial
string at random from the input text as a precursor and prints it out. We then search for
this k-letter precursor in the text and each time we find it we add the letter that
immediately follows it in the text to a list. For example if k is 2 and the first precursor
is "þe" then the list of letters that that come after it could be ['s', 't', 's', 'i', 'i', ... ]. We
then choose a letter at random from this list. Those letters that appear often in the list have a
higher probability of being chosen than those that appear less often.
When the first random letter has been chosen it is printed out and the precursor updated by
throwing away the first letter and appending the new letter. For example if the letter 't' had been
chosen in the example above, then the next precursor would be "et", and we would then again
make a list of all the letters that immediately follow the string "et" in the text.
There are various data structures that would make the seach for the letters that follow precursor faster, but in this project it is sufficient to just search by going sequentially through the text each time, recording the letter that follows every time we find the precursor.
This project should not be very hard. Try to do it as much as possible by yourselves, since this will benefit you most. As before, it is all right to discuss the project with your fellow students, but avoid accepting any program code.
You can find many books in icelandic at the web page of Netútgáfan. They are all in HTML format, but there are not many HTML commands in the files, and you can always get the clean text by selecting the whole text in the browser (ctrl-a), copy it (ctrl-c), and pasting it into a texteditor (ctrl-v). There is also large amounts of books in english at the Project Gutenberg.
Hand in a short report on your program, where you describe in more detail the design of the program and show the results of running the program on 2-3 different books for the values of 1, 4, 7, and 10 of k. Also hand in the printout of the program to your section teacher before 6pm on friday February 25th. They will not accept solutions that arrive later.