reCAPTCHA logo

Sunday, December 14, 2008

Funny reCAPTCHAs

Every day we serve over 30 million randomly chosen pairs of words from scanned books and newspapers to users around the world. Although we heavily filter the words presented to avoid offensive combinations (there are over 1,000 words in our block list), some amusing pairs slip through. Below are some of our favorites. All are real examples emailed by users.

While obtaining tickets for a concert:


What can we say?


Made an unlucky user insult themselves:


Marital advice:


This is one of the all-time best:

The user emailed the site 20 minutes later complaining he had followed the instructions to wait, but nothing was happening.

Heh.

Sunday, December 7, 2008

New Audio reCAPTCHA

One of the main goals when we launched reCAPTCHA was to provide an accessible system to visually impaired individuals (who surf the Web using screen-reading software). Most other CAPTCHAs do not provide an audio alternative, and therefore block blind people from freely navigating the Web. We're proud of the fact that reCAPTCHA has always had an audio alternative.

Today we are announcing a significantly improved audio CAPTCHA which is both easier for humans than our previous one, and most importantly, by far the most secure audio CAPTCHA we know of.

Like many of the other audio CAPTCHAs, our previous version consisted of distorted spoken digits. We collected thousands of voices saying the digits zero through nine, and formed audio CAPTCHAs by concatenating digits from different speakers and adding noise distortions in the background. To maintain the security of the audio CAPTCHA, our distortions were quite heavy. We now believe that even such heavy distortions are not enough when the audio CAPTCHAs are restricted to only spoken digits or letters.

This week, Jennifer Tam, a PhD student at Carnegie Mellon University who has been working with us, will present her results about the security of audio CAPTCHAs at the Annual Conference on Neural Information Processing Systems. In her paper, she shows that audio CAPTCHAs based solely on distorted digits (or even letters) can be broken using machine learning techniques. This includes all commonly used audio CAPTCHAs.

Although we have not seen anybody abuse our previous audio CAPTCHA in the wild, we have taken preventive measures against this potential attack. So today we announce the release of a new audio CAPTCHA that is significantly more secure and in particular not susceptible to Jenn's attack. In fact, breaking this new audio CAPTCHA would require major advancements in speech recognition technology.

Instead of using spoken digits or letters, our new audio CAPTCHA presents entire spoken sentences or phrases that the best speech recognition algorithms failed to recognize. In other words, this new audio CAPTCHA uses the same idea as the standard visual reCAPTCHA: we play audio from old time radio shows that speech recognition software could not decipher correctly, and then use the results of humans solving these CAPTCHAs to transcribe the old time radio shows. Not only is this audio CAPTCHA more secure, but it will also have a positive side-effect. Much like the visual reCAPTCHA has helped to digitize billions of printed words so far, we expect that the audio version will help transcribe large amounts of historical audio content.

You can hear the new audio CAPTCHA by going here and clicking on the audio button. You'll hear a short clip with people speaking and will have to type what they are saying. To account for spelling mistakes and homophones, the verification algorithm uses a phoneme-based encoding and allows a small number of mistakes.

We'll be rolling this update out to all of our users over the next few weeks. For now, if you are using our custom theme option, we ask that you update the instructions for the audio CAPTCHA to say something along the lines of "type what you hear".

We have a blog!

After a year and a half of running reCAPTCHA, we finally had time to start a blog.

Perhaps the best way to begin is with a run-down of our milestones: The media has noticed us with coverage in NPR, the Wall Street Journal, the Boston Globe, the Guardian, Wired, and hundreds of other outlets; we published a paper in the journal Science about the accuracy of the reCAPTCHA transcriptions; over 75,000 Web sites have signed up to use our service (including some household names like Facebook, Ticketmaster and Craigslist), and to this day over 300 million people (more than 5% of the world's population!) have helped us digitize content from the New York Times and the Internet Archive. So far, close to 5 billion words have been served.

Needless to say, we're a bit overworked here :)