I had a plan many months ago to build a photo database of images with text to use to test our OCR (optical character recognition) routines. At the time, I was new to the team and had not seen the testing which was in place, but wanted to add to the repository anyway.
Whiteboards at Microsoft are everywhere. You get one at a minimum in every office, and I have three. In our kitchen area, we have a huge whiteboard next to the coffee machines. It is normally used for brainstorming and the like, and I thought the text on the board would be nice to use for testing. In order to get everyone on the floor of my building involved, I decided to start writing a "question of the week" on the board near the coffee machines each Monday and make it general and interesting enough that everyone would want to answer. I would then photograph it when full and add to the repository for testing.
I thought my plan was pretty good. Since we have the new Starbucks iCup machines, each cup of coffee takes about 60 seconds to brew on demand. "While people are ‘perco-waiting*’ ," I reasoned, "they can write answers on the board."
Some of the questions posed received mediocre responses at best. The last book you read, the next movie you want to see, etc… The board would normally not get filled by the end of the week. More personal questions like "What is your favorite guilty pleasure website" generated more responses – my answer was www.cuteoverload.com, FWIW- with the board sometimes filling up after three or four days. And this week things went haywire.
I asked about the proper use of the apostrophe.
Within half a day, the board was full. People even started hanging up pages from reference materials and a printout from the "Apostrophe Preservation Society" in the UK. I’m just waiting for the Bob the Angry Flower poster.
While this has been and still is fun, from a test point of view, this was a bad plan. We (OneNote) do a very good job of recognizing text in images, and a very good job of recognizing handwriting in ink form on a page. Photographing handwritten text like this does not work well since the writing on the whiteboard is almost literally ink. My plan would have taken an algorithm based on formatted text and tried to use it in a free form handwriting environment. We don’t get the clues that ink typically provides like starting and ending points of the pen, directional information and the like. And as you can see below, the OCR does not work well. (This is the right half of the board, by the way). My tests would have had erratic results and not tested anything useful at all.
But it was (and is) still a fun project.
And notice the final, last gasp at OCR from the bottom image. Slash. Dot. Kind of appropriate in a way, don’t you think?
Questions, concerns, comments and criticism always welcome,
?,r 7’3z(vt: Whah pRopL
(I’m snipping the rest. You get the point.)
yYbflcAR (‘f £01rb4
* Perco-waiting: the act of standing around waiting for a cup of coffee to complete brewing. I see it daily.