Cache Deception: How I discovered a vulnerability in Medium and helped them fix it

As a full-stack developer and security researcher, I spend a lot of time analyzing web applications and mobile apps, searching for interesting vulnerabilities to poke at. Recently, while reverse engineering the Medium Android app, I uncovered a tricky cache deception flaw that allowed me to access private user data. In this deep dive, we‘ll explore all the nooks and crannies of how this bug worked, the steps I took to get it fixed, and the wider implications for securely deploying web caches.

Web Caching 101

Before jumping into the juicy technical details, let‘s make sure we‘re on the same page about web caching. At a fundamental level, a web cache sits between the client (browser) and the server (origin), intercepting requests and storing copies of the server‘s responses. The next time a client requests the same resource, the cache can short-circuit the process and serve its stored response directly, without contacting the origin server again.

Web caching is used pervasively across the internet to reduce latency, save bandwidth, and decrease load on origin servers. Caches can exist at multiple layers:

  • Browser caches (e.g. Chrome, Firefox, Safari)
  • Intermediary caches (e.g. corporate proxies, anti-virus scanners)
  • Reverse proxy caches (e.g. Varnish, Squid, Nginx)
  • Content Delivery Networks or CDNs (e.g. Cloudflare, Akamai, Fastly)

While the implementation details vary, at a high level all these systems follow the same basic principle: when a cacheable response comes in, store it using the request URL as the key, and serve it up for subsequent requests to the same URL.

So which responses are "cacheable"? That‘s where things get nuanced. There‘s a whole RFC devoted to the intricacies of HTTP caching, but the tl;dr is that it‘s based on a combination of the request type (e.g. GET vs POST), response status code, and cache-control headers.

For example, a response to a GET request with a 200 status code and Cache-Control: max-age=86400 header is considered cacheable for 86400 seconds (1 day). The exact caching behavior also depends on the type of cache (e.g. browser vs CDN) and its configured caching policies.

The crucial thing to understand is that caching is based on the URL requested, not the actual content returned. And that brings us to the crux of cache deception vulnerabilities – when attackers can manipulate the URL to trick a cache into storing something it shouldn‘t. Let‘s see how that played out with Medium‘s Android app.

Reverse Engineering Medium for Fun and Bounties

One of my favorite pastimes is digging into the internals of Android apps to understand how they tick. Not only is it a great learning experience, it occasionally leads to discovering security vulnerabilities, which is exactly what happened with Medium‘s app.

Using a combination of apktool, jadx, and Burp Suite, I began systematically reverse engineering the Medium APK. Apktool is a handy utility for decompiling Android app packages into human-readable smali code and resources. Jadx goes a step further and converts the Dalvik bytecode into Java source code. And of course, Burp Suite is the go-to tool for intercepting and analyzing network traffic between the app and Medium‘s API servers.

After some initial reconnaissance, I honed in on the API client code, where I struck gold. Turns out, Medium‘s Android developers used the popular Retrofit library to cleanly define their REST API endpoints as Java interfaces.

Here‘s a simplified example of what those Retrofit interfaces looked like:

public interface UserService {  
  @GET("/@{username}")
  Call<UserProfile> getProfile(@Path("username") String username);

  @GET("/_/api/users/{userId}/follows")
  Call<Void> followUser(@Path("userId") String userId);
}

In the decompiled code, I found dozens of these Retrofit interfaces specifying hundreds of API endpoints with varying path parameters, query strings, and request/response types. It was a treasure trove of information about how the app interfaced with the server. I wrote a quick script to extract all the API definitions and began probing them for unexpected behavior.

Discovering the Cache Deception Flaw

As I was iterating through the various API endpoints, I noticed a peculiar pattern in how user profile pages were accessed. The app would make a GET request to a URL like https://medium.com/@username whenever it needed to load a user‘s profile.

Intrigued, I tried accessing my own profile page with a slight tweak. Instead of my regular username, I edited it to include a file extension like .css, .js, or .png at the end. To my surprise, the server returned my profile page content as expected, but with one key difference – the response was cached!

I confirmed this behavior by loading the profile page URL with a .png extension while logged into my account, then opening the same URL in an incognito window. Amazingly, the full HTML content of my profile page, including private details, was served from the cache without any authentication.

Here‘s an example of the kind of sensitive data that was leaked:

<script type="application/json" data-hypernova-key="user_sidebar_json">
  {
    "user": {
      "id": "1f5d07b378fc",
      "username": "jsmith",
      "name": "John Smith",
      "email": "[email protected]",
      "bio": "Just another hacker exploring the world of code.",
      ...
    },
    ...
  }
</script>

Example of cached profile data

The implications were severe. An attacker could use this vulnerability to trick Medium‘s caching layer into storing private user details like email addresses, session tokens, and CSRF tokens just by accessing a specially crafted profile URL. They could then retrieve that sensitive information from the cache without needing to authenticate as the victim user.

Digging deeper, I observed that the cached responses were served with a Cache-Control: max-age=14400 header, meaning they would persist in the cache for 4 hours (14400 seconds). Furthermore, issuing new requests to my .png profile URL wouldn‘t overwrite the cached entry – the cache would continue serving the initial response for the full duration.

Based on this behavior, I suspect Medium was using some sort of CDN or reverse caching proxy (e.g. Fastly, Cloudflare) that was configured with a broad rule to cache all URLs containing a file extension. This is a common performance optimization to cache static assets like images, scripts, and stylesheets at the edge. The problem was that user-controlled input (the username) was used as part of the URL path, allowing attackers to "fool" the cache into storing dynamic, private content.

Cache deception flow diagram

Overview of the cache deception attack flow

In summary, the key aspects of this vulnerability were:

  1. Medium‘s caching layer used URL paths to determine cacheability, regardless of the actual content
  2. Attackers could control part of the URL path by setting their username to include a file extension
  3. Responses meant to be dynamic and private were cached and served to unauthenticated users

Exploitation Scenarios

So what could a malicious hacker actually do by exploiting this cache deception vulnerability? Here are a few potential scenarios:

  • Steal CSRF tokens from a victim‘s cached profile page, then use them to perform actions on the victim‘s behalf (e.g. publishing a new blog post, changing account settings)
  • Access private user details like email addresses, bio, and social media links for targeted phishing, stalking, or harassment
  • Leak session tokens to hijack a victim‘s account and impersonate them on Medium
  • Poison the cache with misleading or inappropriate content that would be shown to other users who request the cached profile URL

Depending on the prevalence of vulnerable API endpoints and cache misconfigurations, the same exploits could be extended to other sensitive pages beyond just user profiles (e.g. private blog posts, payment forms, admin consoles). The impact really depends on what content the attacker can trick the cache into storing.

To make matters worse, attackers can automate the process of discovering and exploiting cache deception flaws. Using tools like Param Miner or Arjun, it‘s possible to quickly find URL paths with user-controlled input, fuzz them with common file extensions, then analyze the responses for caching headers and sensitive data. This is a great example of an area where traditional security scanners fall short, but a clever attacker can find valuable bugs.

Mitigation and Prevention Strategies

Fortunately, there are a number of defenses that developers and security teams can implement to prevent cache deception vulnerabilities:

  1. Use strict, granular caching policies: Instead of blindly caching based on file extensions, use explicit allow/deny lists for specific URLs that should be cached. Avoid caching any URL that includes user input.

  2. Separate static and dynamic content: Serve static assets like images, CSS, and JS from a separate (sub)domain that has a different caching policy than the main application. This helps prevent dynamic content from getting cached accidentally.

  3. Strip sensitive data from cached responses: Before caching a response, make sure to remove any user-specific information like CSRF tokens, session IDs, or PII. You can use tools like Edge Side Includes (ESI) to selectively exclude blocks of content from being cached.

  4. Add cache-control headers: For authenticated pages or other dynamic content, explicitly set the Cache-Control header to no-store to prevent them from being cached entirely. For public responses that can be cached, use more granular headers like s-maxage, stale-while-revalidate, and must-revalidate to control the caching behavior.

Here‘s an example of how to set a strict no-cache policy in Node.js/Express:

app.get(‘/profile‘, (req, res) => {
  res.set(‘Cache-Control‘, ‘no-store, no-cache, must-revalidate, private‘);
  res.send(/* profile data */);
});
  1. Scan and test for cache vulnerabilities: Integrate automated scans for common cache-related misconfigurations into your SDLC. Tools like Web Cache Vulnerability Scanner and CacheFuzz can help identify areas of concern.

  2. Use cache-busting techniques: For critical pages that should never be cached, consider adding a random nonce or version number to the URL query string. This ensures that even if the base URL gets cached, the query parameter will force a fresh response from the origin server.

  3. Validate cache keys: If user input must be included in a URL that gets cached, make sure to carefully validate and sanitize the input to prevent malicious characters or file extensions from being injected.

By adopting these best practices and educating developers about the risks of caching user-supplied data, we can make cache deception attacks a thing of the past.

Bug Bounty Insights

From a bug bounty perspective, cache deception vulnerabilities can be quite lucrative. While Medium only paid out $100 for my find, other companies have awarded much higher bounties for similar bugs.

For example, in 2020, security researcher Pouya Darabi earned a $3,000 bounty from Helium for a cache poisoning vulnerability that allowed arbitrary JavaScript injection. Similarly, Shubham Jain found a cache deception flaw in a private program that paid out $2,000.

Even within Medium‘s own bug bounty history, there are several public reports of cache-related issues with bounty awards ranging from $100 to $750. So while cache deception may not be as well-known as other bug classes, it‘s still a viable area for bounty hunters to explore.

The key is to focus on applications that heavily utilize caching (e.g. content-heavy sites like blogs, news outlets, forums) and look for URL paths that incorporate user-controlled input. Experiment with different file extensions, cache-buster query parameters, and variations of the URL structure to see how the cache behaves. And of course, automate as much as you can!

Conclusion

Web cache deception is a powerful technique that highlights the importance of implementing secure caching mechanisms. As we saw with the Medium example, even a subtle bug in how usernames are handled can lead to significant data exposure.

It‘s clear that web developers need to be extremely cautious about what content they allow to be cached, and put strong safeguards in place to prevent dynamic data from being stored and served to the wrong users. By adopting a defense-in-depth strategy that combines strict caching policies, input validation, and secure coding practices, we can keep our apps fast and safe.

For my fellow bug bounty hunters and security researchers, I encourage you to dig deeper into this underexplored vulnerability class. There are still many sites and APIs that haven‘t been thoroughly tested for cache deception flaws. With the right mindset and methodology, you could be the next person to discover a critical bug and make the web a bit safer.

If you enjoyed this deep dive into cache deception vulnerabilities, consider sharing it with your networks. You can also find me on Twitter where I regularly share web security insights and exploit demos. Until next time, happy hacking!

Further Reading

Similar Posts