09 June 2009

Demystifying CAPTCHA & RECAPTCHA


Human Interaction Proof: Captcha and reCaptcha

A common joke about the Internet goes on like this: “Anything can be anyone on the Internet” or like “On the Internet, Nobody Knows You're a Dog”. These seemingly humorous but realistic views are based on the fact that anyone can have an online identity that they wish to have, far apart from reality. An average aged man can pretend to be a teen interested in online games and enter a chat room where teens engage in candid conversations.

The same applies to obtaining profiles in social networking website and email addresses. There is no need to declare one’s real world identity and there are no verifications to whatever details are submitted.

Anonymity online

This has been used by Spammers who create false accounts top to send spam emails and some programmed software-bots can indulge in visits to increase site traffic artificially. Hackers may also tempt you with free malware to encourage you to download these and share your contacts.

Spam emails are operated software programs that login automatically without any human intervention and these accounts are increasingly misused. The only means to avert this situation is to restrict based on human interaction which requires a confirmed proof that it is a human sitting at the keyboard interacting.


Gotcha a Captcha

In order to avoid such automatic login by spurious methods, a simple text-based solution is commonly used. This system called the CAPTCHA (Completely Automated Public Turing Test to tell Computers and Humans Apart) displays specific set of a meaning less word rendered in a skewed graphic style as an image.

The most common form of CAPTCHA is an image of several distorted letters. The visitor identifies and types the correct series of letters in the form. If these letters match the ones in the distorted image, then the visitor has passed the test and proceeds to the requested service page. For visually impaired users, there are alternative versions that use audio versions of Captcha.

This method has been quite successful in eliminating malware programs in signing into thousands of accounts automatically. Certain techniques used to create CAPTCHA are complex, and so software is less likely to identify the characters or remove any background noise created purposefully.

Server-side programming

For those webmasters or web-developers who wish only legitimate humans to enter their login forms, scripts are available for download and they need to be installed in their web server. Once properly placed these scripts will activate the code and generate a Captcha before proceeding with the rendering of the web page or form. Only when the user correctly identifies the Captcha, the server presents the requested web form for further action in the website.

It is important for a webmaster to test his Captcha generating software in his server system randomly as the graphic rendered sometimes has too much noise or distortion that even humans find it difficult to decipher. Yet another common problem is that normally Captcha images come is smaller sizes and magnifying them is not possible. This makes it difficult for readers with short vision ailments.

Techniques such as overlapping characters or free-style connected characters are used to make Captcha harder to crack using software. Use of shades, background prints and other resizing distortion techniques are also sometimes implemented.

In the recent times, malwares are floating around the net misusing the Captcha. On infected machines they pop-up text saying something like “if you don't solve this captcha within three minutes then your machine will shut down” and reader are warned not to respond to such messages; just close the pop-up message box.

Microsoft’s Asirra


Microsoft has its own version of Captcha called the Asirra which implements HIP (Human Interactive Proof) using pets. ASIRRA stands for (Animal Species Image Recognition for Restricting Access) where images of animals are displayed and users are required to identify the species. To prevent brute-force attack on repeated images, Microsoft has partnered with a website (www.petfinder.com) for homeless pets containing over 2million images of pet animals. Here randomly selected animal pictures are presented to identify the species. Log on to http://tinyurl.com/ltch5w to test Asirra in your website.


Project Gutenberg


While the Captcha project aims to test humans and computers apart, scientific research at the Carnegie Mellon University (CMU), USA uses this simple human effort in recognising letters in totally different context. Some of you readers would remember reading about project Gutenberg http://www.gutenberg.org/.


Project Gutenberg aims to produce free electronic books through digitisation of old books by tens of thousands of volunteers. Thousand of books and newspapers are being scanned using robotic devices and using Optical Character Recognition (OCR) software, their electronic text-based versions have been created. Currently this project site at has a collection of over 28,000 free books as per their Online Book Catalogue and a grand total of over 100,000 titles available through Project Gutenberg Partners, Affiliates and Resources.


CMU’s ReCaptcha


For the newer books, the OCR technique is about 90% accurate, but this drops to as low as 60% for older texts, which often contain fonts that are blurry and less uniform. Again robotic suction cups flip pages for scanning and this sometimes induces letter distortions.


CMU’s ReCaptcha project takes words from old books and newspapers that optical character reading software has marked as unreadable by computers. By deciphering these words, users are helping to complete the conversion of old texts to digital form. As millions of Captcha are correctly recognised by real humans worldwide, valuable knowledge is being created for free exchange worldwide.


ReCaptcha for your websites


CMU’s ReCaptcha can also help digitize the text of books while protecting websites from bots attempting to access restricted areas. Recaptcha supplies subscribing websites with images of words that optical character recognition (OCR) software has been unable to read. The subscribing websites (whose purposes are generally unrelated to the book digitization project) present these images for humans to decipher as CAPTCHA words, as part of their normal validation procedures.


They then return the results to the ten-second increments, millions of hours of a most precious resource: human brain cycles. service, which sends the results to the digitization projects. This provides about the equivalent of 160 books per day, or 12,000 man-hours per day of free labour for a valuable cause to the global community.


ReCaptcha is acclaimed to deliver over 30 million images every day and currently in the process of digitizing text from the Internet Archive and the archives of the New York Times. Apart from free mail-service sites, even social networking sites such as Facebook, Twitter and StumbleUpon support this project.


For users who wish to protect their email addresses from being captured by spammers, a mail ID hide ReCaptcha comes to rescue. Using this utility your sent emails are encrypted and shielded with a Mailhide API key. Anyone wishing to see your mail Id is challenged with a Captcha which can be only solved by humans and not automated programs of spammers. For more details log on to http://mailhide.recaptcha.net/. This is the miracle of harvesting in ten-second increments, millions of hours of a most precious resource: human brain cycles and using for a more worthy cause.