Binary Size Woes
I wrote this post last week for privately sharing with some people that asked for it. Coincidentally, McLaren wrote a thread that blew up a few days later. It was surprising to me—I had no idea this was that interesting to people! So, I’m polishing up what I had already written and am sharing it, more or less unchanged, below. Not too much is news after what was already disclosed on Twitter, but a few details are fleshed out. Maybe it’s interesting to some people.
Uber rewrote its iOS Rider app in Swift in 2016. We were early adopters of Swift, so we encountered a few surprises along the way.
About a year after the rewrite was released to users, in March 2017, a couple of engineers (robbert & mclaren) simultaneously & coincidentally realized that our app’s download size was growing much faster than our previous Objective-C app had grown. Robbert did some projections with historical size data he pulled using fastlane and calculated that we would surpass Apple’s then 100 MB over-the-air (OTA) download limit in three months. Crossing this threshold would mean that users would be required to connect to WiFi to download the app. We didn’t know how exactly this would impact business metrics, but we hypothesized that it would depress growth.
Internally, this hypothesis was controversial. Colleagues who had worked at other companies with Very Large Apps said that they had seen no business impact to crossing the threshold — results that we were not sure generalized to our app (these companies had a better-known mobile web version of their app than we did). In addition, if we concluded that crossing the threshold had significant business impact, we would have to limit new code additions until we found a more sustainable solution to code growth. Naturally, product teams who depended on adding new functionality to the app to achieve their goals did not like the idea of being limited in what they could add.
In order for this to get the attention it needed, we needed to quantify the business impact of crossing the OTA download limit. If it was high, then we would:
- build tooling to give insights into binary size (e.g., across releases, commits, etc.);
- investigate solutions to reducing the binary size, short-term and long-term; and
- Recommend a process for managing releases close to or over the OTA limit.
At the time, no team “owned” binary size, so this quantification exercise fell on a couple of us. It wasn’t straightforward. Looking at historical data, when we blew past the iOS 8 OTA limit, we couldn’t definitely say anything about the business impact. And if we intentionally bloated the app for one release, we would not have a real control group. Comparing growth metrics across releases was unreliable, due to seasonality, other experiments affecting the comparison, variable marketing spend, etc.
I proposed we take advantage of app thinning. A universal build, containing the executable architectures and resources for all devices, is uploaded to the App Store. Users download thinned variants, which contain a subset of the universal build’s executable architecture and resources — only those that are needed for the target device and operating system.
Our data scientist designed a switchback experiment, where we alternated bloating the download size of one group of devices (non-plus device), then the other (plus devices). By manipulating the download size in a controlled way, we were able to observe variation in download size independently of time effects or device-user characteristics. The plus and non-plus devices basically served as control groups of one another.
The release schedule looked something like this:
We analyzed the collected data and found an enormous drop in installs, with a disproportionally large effect on first trips (presumably people who download over cellular have higher intent to ride). Without going into specifics, the numbers were an order of magnitude larger than anybody had expected and triggered the immediate formation of a task force to fix it.
Mapping Out the Solution Space
We found early on in our research that there was no silver bullet solution to this problem that met all of our constraints, meaning we would have to simultaneously investigate many ways to address it. At a high level, the work was organized into a few tracks:
- Talking to Apple
- Process changes
- Tooling improvements
- Build improvements
- Platform changes
- Cross-cutting changes
Talking to Apple
Given that Apple has control over the Swift compiler, the Xcode toolchain, and the App Store, they were in the best position to work on more permanent remedies. But, because of Apple’s longer release cadence, we knew we could not expect an immediate solution. Even if they acknowledged the problem, we would still have to pursue other solutions in the interim.
Our goal was to make Apple aware of the problem that Swift executables were much larger than their Objective-C equivalents in hopes they would prioritize work to improve executable size output.
The next major release of Swift (Swift 4), wound up having a few improvements that were able to deliver somewhat decent size improvements. And in September 2017, Apple increased the over-the-air limit to 150 MB.
The process changes were actually pretty interesting, but fortunately we never had to rely on them too much. The engineering and product leads in charge of our Rider app and I worked on defining a process to keep our iOS binary size under the OTA limit. We wanted a process that:
- established a protocol on how to handle releases that were close to or over the OTA limit;
- gave us headroom to add essential fixes/features to a release if needed; and
- still allowed product teams to develop and release new features.
So, we began categorizing release candidates as Green, Yellow, or Red, based on their size:
- Green: sufficient headroom that we were not worried about the release.
- Yellow: small enough to not be over the OTA limit on any device, but the headroom was small. A yellow build would be released, but subsequent builds would not, until the size was Green again. Critical new features and bug fixes would be cherry-picked onto the release branch.
- Red: over the OTA limit for at least one device class and would not be released. We never had a Red build, which was the goal.
In addition, we introduced per-team size accounting, with the aim of eventually enforcing quotas. Fortunately, this was never needed.
We also kicked off an effort to remove stale feature branches from the code, as well as requiring teams to condition features-in-progress out of the binary. We saw a few megabytes of one-time impact from removing stale branches. The Experiments team later added tooling to automatically detect and suggest the removal of stale experiments from the app.
The Amsterdam-based iOS Developer Experience team worked to introduce a number of tooling improvements. At the time of detection of the problem, we had no way to accurately measure the binary size before App Store submission. Xcode provides tooling to produce builds for all devices and measure their sizes, but the tools are wildly inaccurate. This is primarily due to the opaque FairPlay encryption process used by Apple.
When a user downloads an app from the App Store, the app is signed and encrypted with FairPlay DRM for that specific user. Not every part of the app is encrypted, only certain parts of the
__TEXT segment of the Mach-O executable file. Afterwards, the signed and encrypted IPA is compressed, and then delivered to the user’s device.
Note the order of operations here. Encryption is done before compression. Executable files are generally fairly compressible, since they contain repeated strings, function sequences, etc. However, encrypting the executable beforehand makes it virtually incompressible since encrypted data has maximal entropy. When an app is built and tested locally, FairPlay DRM is not done, so it compresses much better. Thus Xcode’s size estimates are far off.
To work around this, the iOS DevEx team reverse engineered which segments were encrypted for App Store distribution and mimicked the encryption with tools locally. We then compressed the app and outputted size metrics as part of CI builds. We never achieved 100% accuracy, but we were generally within a few hundred kilobytes, which was good enough to have reliable per-commit size metrics.
The team then added a number of features on top of this, such as size increase alerting and per-module size breakdown.
The Amsterdam-based iOS DevEx team and our Palo Alto-based Programming Languages team also investigated a number of build-level size improvements we could do. We had a former LLVM engineer on the team, who uncovered a number of optimizations for us that reduced the executable size by close to 20%. Some of the changes were just using uncommon compiler/linker to flags to tune the build process to be size-optimal, but many were quite brilliant, like using simulated annealing to determine the best compiler optimization pass ordering.
- Enabling link time optimization for Objective-C
- Relocating strings to non-encrypted locations of the Mach-O executable
- Disabling loop unrolling
- Disabling Swift generic specialization
- Increasing the function inlining threshold
- Disabling Swift Whole-Module Optimization
- Running a simulated annealing algorithm to determine the binary size-optimal order of compiler optimizations and using this order instead of the standard order
Here we explored changes to platform code that Uber wrote internally and product developers built on top of. Most had some developer impact — potentially a large one-time impact — but would not have a large ongoing cost.
Some of our platform code was re-written in Objective-C as well as some of the generated networking code. We also changed most of our uses of Swift structs to classes due to size advantages.
A few binary-size suboptimal code patterns were discouraged by adding lint checks for them. Heavy use of structs was one of them, there were others I don’t remember anymore.
We investigated a number of larger-impact changes, including rewriting significant portions of the app in Objective-C. In our testing, we found Objective-C executables to be at least 50% smaller than similar Swift code at the time. If needed, as an absolute last resort, we believed we could rewrite parts of the app in Objective-C and encourage more future code to be written in Objective-C. This would require rewriting most of the platform libraries in Objective-C and porting features over, which would be an arduous process.
Simultaneously, a team was developing a framework for server-driven UI. It was not ready for general use at the time, but we advocated for increasing its funding since it would provide a scalable way to add more features to the app without increasing the app size significantly. We hoped this was going to be the solution for scalable product growth.
Our goal was to stay under the OTA limit while also not stopping product development for any significant time. We achieved this while also finding other important goals along the way.
The experiment brought awareness to the problem, tooling was built and is still maintained to track binary size. Our results proved compelling enough for further experiments on binary size to be run to measure the impact of incremental megabytes of download size on growth metrics, on both iOS and Android. The Android team found a significant impact for each marginal megabyte of download size and this helped them advocate for building a light version of the Android Rider app. We also began to measure the impact of other metrics, such as on-disk install size, on user retention.
After the initial scramble to make sense of the problem space, developer productivity was not significantly impacted and our process largely worked. We found sufficient fixes at the build and platform level to bring down the binary size without any significant impact to developer productivity.
It was a lot of effort, but we overcame most of the issues we saw with Swift, so we never switched back to Objective-C. There was an unfortunate side-effect to the problem we saw with Swift: we became very wary of new language adoption. When Android engineers wanted to adopt Kotlin, there was a lot of resistance to doing so.
Members from our iOS DevEx & Programming Languages teams participated more in the Swift mailing lists to encourage or add compiler optimizations to generate smaller executables and, indeed, future versions of Swift have improved in this regard (e.g., #8018, #8909, and -Osize).
A few months after we approached Apple, they increased the OTA limit to 150 MB. Some time later, they increased it to 200 MB. As of iOS 13, they removed the limit entirely and simply prompt users to confirm they want to download a large app over cellular data.
Did we have to rewrite the app in Swift in the first place? I think the rewrite was necessary, but Swift was a mistake. It was a mistake, though, that pushed me, and others, to learn and figure out things that we wouldn’t have otherwise. I got deeper into Xcode, the Swift compiler, the linker, and iOS than I would have otherwise. The language and tooling got better as a result.
Why is the app so big? It can be hard to imagine any justifiable reason for Uber, or almost any app for that matter, to be so large. Uber is deceptively complex. At some point there were multiple mapping libraries embedded in the app due to regulatory and business strategy reasons. Geofences for airports, no pickup zones, etc. are included. And a ton of business logic, complying with municipal, state, federal laws, has to be included. The app also ships with localization files for 43 languages, which adds a few megabytes. Add to that, every product team wants to add a features in the app, so the architecture has to scale to let them do so, while also not breaking the app if they get it wrong. It’s complicated, and it kind of has to be.
Though Apple removed the hard limit to OTA downloads, I suspect binary size still matters, though I’m not aware of any recent studies to show that. Segment did a study once that suggests ~0.5% install-rate drop per incremental MB, which is roughly in line with what we saw. Would be interesting to see what the impact is nowadays.
Thanks to Richard Howell & Reid Main for proofreading, feedback, and keeping me honest.