API Abuse – Lessons from the Duolingo Data Scraping Attack

It’s been reported that 2.6 million user records sourced from the Duolingo app are for sale. The attacker apparently obtained them from an open API provided by the company. There’s a more technical explanation available here.

While we talk a lot about the vulnerabilities in the OWASP API Top-10 and the exploits associated with those vulnerabilities, this incident provides a good reminder that not all vulnerabilities are flaws in code. In fact, this API was working as designed. The OWASP API Top 10 accounts for these kinds of attacks as API6:2023 Unrestricted Access to Business Flows.

If you’re interested in seeing the API in action, you can actually access it via a browser. Just go to this URL, replacing the example email with your own: https://www.duolingo.com/2017-06-30/users?email=example@example.com (assuming you have a Duolingo account). The response is in JSON, so it won’t produce a pretty web page for you, but you can see the information that’s publicly available via the API.

Duolingo's API Query JSON Response (Source: Black Owl Intelligence)

Is Scraped Data Dangerous?

The information shared via the API may seem relatively benign, but it’s important to consider how it might be combined with other data and used by an attacker. For example, if you have a list of email addresses that you’d like to phish, knowing some details about their Duolingo account could make a much more effective attack. Imagine receiving an email that appears to come from Duolingo and contains information about the languages you’re learning, whether you’ve logged in recently, how many ‘crowns’ or ‘xp’ you have. All of that accurate data serves as soft authentication to drive you to click a malicious link.

How to Protect Your APIs from Data Scrapers

If we assume that there’s a valid business purpose for this particular API to be open, then we have to ask how Duolingo could detect and prevent attacks while still meeting the business requirements. A good place to start is by making sure you’re aware of the API endpoint and the sensitive data it might expose.

An API discovery tool should help here. It also might help to employ rate limiting, or even rate limiting based on user agents. It’s hard to say from the outside whether that kind of a control would work in their specific situation, but it’s a start. Of course, detecting API abuse is a key capability. It’s hard to pull 2.6 million records without being detected as query abuse or other behavioral flags.

The Wallarm platform can help with situations like this. API Discovery will enumerate APIs and endpoints, including whether they expose sensitive data. The platform offers rate limiting, including rate limiting by user agent, and our API Abuse Prevention is designed to address automated attacks like content scraping.