From Out of memory to Optimus Prime: Fixing a Mobile App Crash for Our Power Users

Imagine you're a small development team of just two people, building an Android app with a loyal user base of 40,000 daily active users. You release a new version, excited to deliver fresh features. But then, reports start trickling in—crashes and lags plague the app for some users. Yikes!

This article explores how we, as a small team, tackled a seemingly random Out of memory crash that impacted a specific group of our most valuable users – the highly engaged ones.

The Mystery of the Out of memory Crash

Firebase Crashlytics, our crash reporting tool, pinpointed the culprit: an Out of memory error. But here's the twist – the crash only occurred for a single user, and not on any of our own devices used for testing. Additionally, previous app versions functioned flawlessly for everyone.

With a limited reproduction window and a crucial user affected, a quick fix wasn't the answer. We needed a deeper dive.

Uncovering the Power User Connection

Digging into user data revealed a fascinating detail – the affected user was a "power user" with a whopping 4,000 active conversations, placing them in the top 100 most engaged users. We suspected a connection between the high conversation volume and the crash, but replicating the issue remained elusive.

While waiting for user feedback (which can be slow!), we explored different theories. One suggestion was to optimize the app for a large number of conversations, potentially with features like paging. However, this conflicted with our app's core functionality – seamless offline access.

Reproducing the Problem: The Script Steps In

Knowing that a fix required replicating the issue, I created a script to automatically generate a large number ofconversations – mirroring the affected user's scenario. Unfortunately, this script-generated account still didn't trigger the crash.

The Plot Thickens: More Information Emerges

Just as frustration began to set in, additional information surfaced. The crashes started happening for the user on the publicly available app version, not just the beta. We even had the user test two older versions – feedback takes time!

This development introduced a new question: was it a user device issue, or a broader app bug? Metrics provided some relief – our North Star metric (a key indicator of app success) remained largely unaffected, while other positive metrics like Daily Active Users and crash rates actually improved for the production release.

The Missing Piece: Blocked Users

Determined to pinpoint the cause, my teammate noticed a connection – several crashes correlated with a high number of blocked users. The affected user, we discovered, had blocked a staggering 1,000 users! We identified sections of code that interacted with both blocked users and conversation data.

Here's where the script came in handy again. We modified it to not only create a ton of conversations but also a significant number of blocked users. This time, bingo! The crash was successfully reproduced, confirming the culprit – the interplay between a high number of conversations and blocked users.

The Fix and Beyond: Prioritizing Power Users

The fix itself was a simple code adjustment, just a few lines. We released a new version, proactively tested it with heavy user accounts, and received confirmation from the affected user within a day – the crash was gone!

Lessons Learned: Building a User-Centric Development Process

While this issue impacted a small fraction of users, it affected some of our most engaged ones, highlighting the importance of prioritizing their experience. Here's what we implemented to prevent similar situations:

Heavy User Approval: Before releasing new versions, we'll seek approval from a group of power users.
Sanity Testing Enhancements: Our sanity testing script now includes scenarios with accounts mirroring high conversation and blocked user volumes.
Integration Testing: Every code change undergoes integration testing to ensure it doesn't crash with these power user scenarios.
Benchmarking: A new benchmark suite monitors login times and app responsiveness for power user accounts with each release.

These automated measures not only streamline our development process but also minimize the risk of future issues impacting our most valuable users. By focusing on a rapid feedback loop, proactive testing, and user-centric development, we aim to deliver a consistently smooth experience for everyone, from casual users to our app's champions – the power users.

Summary

This article details how our small development team resolved a critical crash affecting our Android app with 40,000 daily active users. The crash was difficult to reproduce and didn't occur on our test devices. By analyzing user data and simulating high conversation and blocked user volumes, we identified and fixed the issue. This experience led us to prioritize power users by enhancing our testing processes, seeking their feedback before releases, and monitoring performance metrics to ensure a smooth experience.

Ready to Achieve Similar Results for Your Company?

We specialize in delivering high-impact solutions with small, agile teams. If you're facing challenges with your app's performance or user experience, we can help you achieve similar outstanding results. Contact us to see how we can tailor our expertise to meet your company's unique needs and drive your success.