CAPTCHA’s, reCAPTCHA’s and their cousins have become common  anywhere you respond to something on a website.  You may not know the name, but you probably have seen something that looks like the image below (taken from the CAPTCHA site).

A CAPTCHA is a computer generated distortion of words (such as you see above) that allows the computer to ensure that the responder is actually a person.  Humans clearly can understand the two words above are “overlooks” and “inquiry,” but a computer that is trying to do word recognition would have trouble identifying the words because they are distorted both by the waves and background images.  (All CAPTCHA’s have an audio option for those who are visually impaired.)

According to the official website,

The term CAPTCHA (for Completely Automated Public Turing Test To Tell Computers and Humans Apart) was coined in 2000 by Luis von Ahn, Manuel Blum, Nicholas Hopper and John Langford of Carnegie Mellon University.

Over 200 million CAPTCHA’s are solved every day.  You tend to see the CAPTCHA’s on sites where you are registering (such as creating an email account) or voting  (such as answering a poll).  The goal is to prevent users from writing a program that could do actions automatically and thus overwhelm the system or bias the results.  The CAPTCHA is used, for example, on sites that provide free email addresses to ensure that it was actually a human applying for that email address.  Many people with questionable purposes were applying for those accounts, and using them to send spam or break into applications.  While it is still possible to open an email account for those purposes, the existence of the CAPTCHA requires a person to be involved, and thus slows down the bad guys.

The CAPTCHA’s are also used for any kind of online poll.  So, for example, suppose your municipality wants to vote on the best location for a festival.  In order to avoid computer-generated voting (to bias the results), they might include a CAPTCHA to be sure there is a person casting the vote.    Sometimes blog sites include a CAPTCHA to ensure the comments on posts are not just spam.

The CAPTCHA is an interesting use of the technology, but the reCAPTCHA does more.  Early on we saw only one word used in a CAPTCHA, but usually you now see two as in the image above.  When there are two words involved, it is called a reCAPTCHA.  These puzzles not only prevent bots and spams from causing problems, they help to digitize books, newspapers and old time radio shows.

Think about how long it would take to type in the content of an old book.  Even if you use scanning software and text recognition software, it will take a long time because the original is old, yellowed, perhaps torn and for other reasons hard to read;  the results of the scan and text recognition are not very reliable.  An example of a scanned bit of text (taken from the CAPTCHA site) is shown below.  You can see the errors in the interpretation.

The results of a scan of an old text

It would be useful to find a group of people who would help identify and correct mistakes from the scanning of the old documents.  As stated before, there are over 200 million CAPTCHA’s solved each day, each taking about 10 seconds to do.  If you could get the people using the websites to check the words, that would give you over 150,000 hours of work each day — FREE!

But, you ask, how do you know if the person is entering the data correctly?  The reCAPTCHA provides two words, one you know and one which is taken from one of these old books or magazines.  When the person enters the two words, the software checks the first word (the one that is known) and if that word correctly matches, then we can assume the second word is correct too.  If we have multiple people who receive the second word (the one from the old document), and they all agree on the word, we have more confidence that we have gotten the correct word.

So, every time you enter a reCAPTCHA, you are not only preventing SPAM, but also helping to digitalize an old document!  The CAPTCHA folks have other projects that we will address at another time.