Here’s the tool I wrote to import “Project Gutenberg” (link: http://www.gutenberg.org) texts into OneNote. The first link goes to the setup files, and the second has the code if you want to see that.
Update as of April 6, 2009: the updated download file is below my signature at the end of this article. The “setup” and source files are combined as well – if you just want to install the powertoy, simply run the GutenWin.exe file.
Remember to run the setup.exe file if you install this. I use it to set some registry keys (more on that later) which are needed for the tool to break the book into chapters correctly. I also include the legal information in a header and footer page to meet the requirements of the Gutenberg project.
Another goal is to be able to put an imported book into a specific notebook during import. I did not necessarily want all the new pages to go into the Unfiled Notes section. I create a simple tree control which shows you all the notebooks you have open in OneNote and lets you choose which notebook to add the new work of prose. My tool even makes a guess at the name of the imported book to use as the new section name. If you downloaded “Pride and Prejudice” last week, it should get the name correctly. It’s included in the setup.zip file as well so you can use it to test the tool.
And if you don’t choose a specific notebook as an import location, I default to Unfiled Notes. Throw in a simple status bar (which uses the number of chapters to import as the percentage complete) and a completion dialog when done, and I’m done!
Now for the limitations. It just so happened that the first few books which I tested were prose books with clearly delineated chapters. Mark Twain and Jane Austen books, specifically. The tool worked great. Then I tried “Ulysses” by Joyce (link: http://www.gutenberg.org/dirs/etext03/ulyss12.txt) and got garbled results. That book doesn’t use the word “Chapter” or “CHAPTER” to delineate chapters. It just uses a Roman numeral at the beginning of each chapter. In this case, I could cook up a scheme to look for individual Roman numerals on a line by themselves and use them as chapter breaks. Unfortunately, it gets worse, since this particular text uses a pair of dashes to either side of the Roman numeral as a visual aid to see the chapter break better on screen. I looked around at a few more books and some used a table of contents to give a chapter name (unique for each chapter), no table of contents and unique chapter names, Roman numerals or individual Arabic numbers by themselves and so on.
Then someone internally to Microsoft imported a book of poetry. Ugh. Removing line breaks at the end of each line in poetry makes no sense – that logic is only applicable to prose/paragraph types of text. I don’t recommend this tool to import poetry – the formatting gets totally lost, and you wind up with three pages or so of bogus text.
What I decided to do was get it to work for the books I was interested in reading and making comments. “Wuthering Heights” was my final test – if that worked, I could “ship” my tool. It did and I did.
I left this slightly extensible for users without Visual Studio who do not want to re-write or add to the code to get around the limitation of using the word “Chapter” to break out individual chapters. You can add new words to use as separators to the string registry key at HKEY_CURRENT_USER\Software\Guinsoft\GutenWin named “delim”. Just add your new keywords you want to use as separators to the end of the list, and use a comma as a separator.
I learned quite a bit about our XML schema when writing this tool. Since English gives writers a free rein to create books in any manner, it’s very difficult to guess the author’s intention. A side goal of this particular tool was to give me a reason to create a “notebook picker” to choose where to send data I’m adding to OneNote. Let me know if you like this.
Questions, comments, concerns and criticisms always welcome,