Hilary Roberts cut her teeth in product working with dozens of startups from the University of Edinburgh to test their value propositions and find their first customers. In 2013 she moved to Skyscanner, one of the world’s largest travel search sites. She is now product manager for the Flights Group, the company’s largest business vertical, with more than 50 million users per month globally.

Video

Slides

Transcript

So, good afternoon. Thanks for sticking with us. It’s a big day today, loads of talks. Skyscanner started about 13 years ago as 3 guys in an attic in Leith with a spreadsheet that they used to find the cheapest flights to take their ski holidays every year. And 13 years later, we do effectively the same thing. We just do it at scale. So, we aggregate millions of prices and content from thousands of travel providers, and when you come to our site, we help you compare all of those options and find the best flight for your trip.

Now, I joined Skyscanner just over three years ago, and so I’ve had a chance to see some of that scale in action and some of the differences that it’s made in how we approach product development. So today I just want to share a few thoughts with you about the role of experimentation in particular, and go through three case studies which show some of the things we’re learning from experiments.

So, first…you know, I was warned about this clicker, so we’re just gonna do a test here. Okay. The first thing, in 2011 Eric Ries published his book The Lean Startup, and since then, it’s become best practice for how we do product development in startups and many other internet companies as well. And the basic premise of The Lean Startup is that we need to identify our hypotheses about who our customers are, what they need, and we need to validate each of these hypotheses in turn using an experiment. And in his book Eric Ries outlines this in three stages: the build, measure, learn loop, and we typically tackle it in reverse order.

So we think about which assumption or hypothesis we want to learn about, and then we figure out how we can measure whether or not that hypothesis is true, and then we build an experiment that helps us take that measurement, and finally we take the learning from that measurement back into our next iteration of the loop. And two more critical things about this methodology. So, one, the whole point of it is to reduce waste in our product development cycle and make sure we build something that somebody genuinely needs. And the second thing is, we want to go through that loop as quickly as possible because that’s our engine for growth. If we can quickly learn about what our customers need, and equally what they don’t, then we can drive accelerated growth in our company.

Now, if you think about those things, it’s called the Lean Startup methodology, but big companies want that, too, right? We want to eliminate waste in our product development cycle and we want to have the growth profile of a startup. And the thing is that some of these dynamics change as you scale, and Lean Startup doesn’t apply in the same way. So, just to give you some figures on how Skyscanner has grown, we now have 50 million users a month, we have 800 people in our company, and we have 40 product managers in the team. And some of the things that are different when we’re trying to the Lean Startup methodology at this scale is that, for one thing, when you have that many people, you have to divide them into teams, right, and they can’t all focus on the same thing. They can’t all focus on the major metric for your company. They have to start to optimize in different areas of the product.

So, that’s one big difference from a startup to a larger organization. The second thing is, you’re just focusing on different problems, right? When you’re a startup, you’re trying to figure out who’s your first customer? What value can you bring to them? When you start to scale, you’ve largely answered those questions already. So the kinds of problems you can use to continue to drive growth are very different. And the third thing that’s really different is just quite practical. As a startup, you have to be laser focused, because if you introduce waste into your development cycle or you build things that people don’t need, then you go out of business, right? But as you scale, you start to have a little more of a cushion, and so you can relax a little bit and not go out of business tomorrow. There’s always a risk that a competitor will eclipse you a year down the line, or some other startup will come up and make you redundant, but that risk is way more remote for your workforce as you start to scale.

So these are some practical differences when you’re trying to apply The Lean Startup as your company grows. And the risk then is that as you go from startup to scale, your build, measure, learn loop starts to look like this. Instead of being a really quick, tight iteration, it starts to kind of slow down and be a bigger iteration like this. And instead, what we really want is to have lots of tight iterations like this and we want to take advantage of the fact that we have lots of people, lots of teams who want to be running build, measure, learn loops in parallel, and we want them to have the intensity of focus that they have in a startup, so that we can have the same growth profile.

So, I want to talk to you now about something we’ve been doing at Skyscanner. We’ve had a lot of practice recently thinking about these loops and how to eliminate waste in our development cycle because we’ve run hundreds of them. And one thing, one advantage of having scale and lots of users is that you can run, in particular, A/B tests, and that’s just one form of experiment you could use in a build, measure, learn loop, but it’s one that’s particularly effective at scale. And what an A/B test is is you give two variants of your product simultaneously to two different cohorts of your user base. And by doing that simultaneously, the main thing you get from an A/B test that you can’t get anywhere else is you can identify the causal relationship between the change you’ve made to the product and the impact it’s having on your users.

So we decided that we really wanted to start doing A/B testing at scale, and about 18 months ago we’d never run a single A/B test, and this chart shows our acceleration, and last month we ran about 250 tests. Each of these was an opportunity for us to practice build, measure, learn at scale, right? So I’m gonna show you one other data point which is pretty interesting, which is the count of individual hypotheses that we were testing with these experiments, and this isn’t a perfect metric. We’re kind of collecting it manually. But even if it’s not perfect, I think it’s pretty obvious that there’s a big gulf between these two lines, right? So, what this shows is that for every hypothesis we want to test, we’re running multiple experiments, and that looks like a form of waste in our product development cycle. Okay, so what’s going on there? So let’s look at hypothesis first in more detail.

This is the standard format we use for all hypotheses at Skyscanner, and I know we saw another example from a talk earlier today. This is pretty similar. So, based on a particular insight, we believe or predict that a specific product change will cause a very specific impact. An example might be that we observe that travelers disproportionately purchase flights to very popular destinations for their holidays, and so we believe that by changing the sorting of our results to popularity instead of price, we can increase conversion rates. That would be an example of a hypothesis. And with a hypothesis like that, there’s two answers, right? Either you run the test and it’s successful. So, we change the sort order and, indeed, we do see an increase in conversion rate, or we run that test, and we decrease conversation rate. And from either of those, we learn something we can take into the next iteration of our build, measure, learn loop.

Now, if these are the only outcomes that we had at Skyscanner, we should see a one-to-one relationship between the two charts I showed you earlier, right? For every hypothesis, we run one experiment, and we get an answer. So that’s obviously not happening. So we looked at some of the tests we were running, and we found that there were two other outcomes… I’m glad the slide came with me. Okay. So, first successful and failed, and then two other outcomes, invalid and flailed. So an invalid experiment is one that looks a lot like a successful experiment, except that it is successful because it could not have failed. And we’ll talk a little bit more in one of the examples about how that happens.

The second example is flailed, and a flailed experiment is one that failed because it never could have succeeded. And I have a…yeah, this guy. This is how you feel when you run one of these experiments, and that’s why it’s called a flailed experiment. So, I’ll give you an example. Okay, so the first case study, we ran an A/B test where we have this insight that we have a flights product. A lot of people like us for flights comparison, but we also have a hotels product, and we do exactly the same thing for hotels that we do for flights. We aggregate hundreds of results, millions of results, and we allow you to do comparison on our site to find the best deal for your trip.

And so our insight was, if people like us for flights, they’ll probably like us for hotels if they like that value proposition. And also it’s pretty obvious that lots of people who buy flights also buy hotels for those same trips. So we ran this A/B test where instead of redirecting people or landing people on our flights homepage, which is on the left, we landed them on our hotels homepage. And the idea was, if we introduced our users to hotels, what we should see, if we’re successful, is more searches on our hotels product and more conversions on our hotels product.

Okay, so what happened? Well, we ran this test, and what we saw was, indeed, many more searches in hotels and many more conversions, and we also didn’t notice any degradation in the flights metric. So we saw that even though not everybody, obviously, wanted a hotel when they landed on our website, they found it pretty easy to navigate back to flights and continue with their journey, so it looked really successful. Well, it looked like that until we started to get some really interesting user comments, and I’ll show you this one. These are amazing, right? This guy is so angry that we’ve run this test, and he writes, “Skyscanner, you are really annoying me with your hotels and car hire options. You are not called Hotel Scanner or Car Scanner.” And I can’t argue with that, right? “And if you want to diversify into other services, you should have thought about that before you named your company.” And then he gives this amazing example of Compare the Market, and if they’d chosen to be called Compare the Car Insurance Market. And then finally he ends by saying, “Please stop defaulting me to hotels. I’m here for flights.”

And this example is so interesting because you can see he’s clawing at all the logic he can come to to try and compel us not to run this test anymore and to change the behavior. And so, we call this test in the end invalid, and the reason is, it was really…it was just not capable of failing during the A/B test. We were able to measure the upside, which was the increased conversion rate in surges in hotels, but we weren’t in the A/B test able to measure the agony that it was causing some of our users, and the friction that inevitably would have led them to leave our product. So that’s an example of an invalid test. And the hallmark of these tests, the thing to watch out for in your own organizations, is that they tend to happen when you’re trying to release a new product or a new feature, and you’re so concerned about seeing whether people engage with that feature that you don’t develop an experiment that actually tests the possible downside. That’s how they tend to occur.

Okay, so, next case study. We wanted to increase the propensity of travelers to enter our search funnel from our news pages. And we write lots and lots of travel content, and that content gets picked up by news organizations and it gets pushed around social media, and millions of users arrive on our website through this kind of content. And the thing that’s interesting about that for us is that the aha moment for our customers is when they do their first search. That’s when they understand the value that Skyscanner provides above and beyond going to an airline site or a travel agent, it’s when they do a search. And so, here we have millions of users landing on our site. They tend to read these articles and then leave again. And so, we wanted to run a test to see if we could bring more of them into the search funnel.

So, let’s see, yeah, okay, so now here’s the test. We’ve put our search controls, which is that big gray box on the right-hand side. We parked them under the news pages. And the hypothesis here was, these controls work very well, they’re highly optimized, they’re on our homepage, and they’re a trivial way for us to introduce travelers to search, or, sorry, readers to search. And also, we know that these people are interested in travel because they’re reading our travel content. Okay, so what happened? Well, we ran this test, and nothing happened. So we put these search controls on these pages, and you cannot miss them, right, but no one was entering the search funnel, or it was just a negligible amount.

And so, our verdict here was that this test was flailed. It could not possibly have succeeded, and it’s kind of easy to see that looking back. The people reading these pages, they don’t necessarily know what Skyscanner is yet, right? They just found a link on a social site. They found it on a news site. They wanted to read this content, but they’re not necessarily looking to book a flight today, and they don’t know that that’s what we offer. So this is a really clunky way of trying to introduce them to the value proposition that we have. And this test, in the end, probably wasn’t worth running. We should have just thought this through in advance, and then thought of a way to do a more targeted introduction to our product for this type of user.

So, again, the hallmark of a test like this is where you design something…sorry, where you design something, which is probably a little too minimum to accomplish your goals. The conversations that will start leading you to a flailed test are things like this, “Why don’t we just test this idea?” or, “I don’t know, why don’t we just throw it out there and see what happens?” That should be like a big red flag that you’re about to do something that is totally worthless, okay?

So, final case study, the right metric for travelers. Earlier this year, I was writing one of the most difficult emails of my life to our senior management team, and the email was basically this, “We’d found two major problems for travelers on our site, and the good news was we’d found a solution, but it required voluntarily taking the single largest degradation to our conversion metric that we had ever seen.” So it was not a pleasant email to write, and I’ll just show you what the problems were. So, here’s a traveler on our site, and we started to see these in usability tests. They would find a flight, and then they would have this thought, “Okay, I like this flight, now I want to see more detailed information.”

And so, we have a very big call to action button. It’s bright green, like it shows in this cartoon, and they would tap that button, and immediately they would be redirected off our site to book with one of our partners. And the reason is, we think if you click that button, you’re ready to go book, so that’s what we do. We send you off to the partner site to make a booking, and we count that as success because that’s our number one conversion point in our company, right? So here we’d identified a case where we’re hurling people off our site who were not ready to go. We hadn’t solved a problem for them, and we were counting it as success. So that was problem number one.

Problem number two, we identified in our analytics, which was that travelers appeared to be only really choosing the cheapest provider that we had on the site. And this was important to us because our whole job is comparison, and if everybody just chooses the cheapest provider, it’s not obvious that travelers are really engaging in comparison on our site, and so maybe they’re missing part of the value we could be bringing to them. So we’d identified those two problems, and then we ran a test to try and fix it. So, here we are again, and I’ll show you the new layout we tried. So same user finds a flight that they like, and this time, when they hit the select button, we change the layout, and when they hit the button, instead of immediately ejecting them from the site, we drop down a list of all the providers who they could choose to book a flight with and the prices.

Okay, now, this doesn’t answer the traveler’s problem immediately, but suddenly she’s still on our site, and she realizes, “That was not the button I was looking for,” right? And then she notices that we have a details button, and when she clicks that button it takes her to the content that she’s looking for. In this case, it was more detailed information about the stopover, how long it is, etc. Yeah, I think someone at the back is helping me because I’m just really struggling with the clicker. So, perfect. We solved her problem. What about the other problem that people aren’t doing comparison on our site? Well, we also found that by running this test, people started to look at providers lower down the list, and, in fact, just by changing this layout in this way, we increased comparison behavior by 400%.

And I’ll show you that now. So, here’s a guy saying, “Hey, this flight looks good to me.” He clicks the Select button, and we drop down that list of providers, and now he realizes there’s somebody else he could book with, like British Airways, and he’s got a loyalty scheme with them, so he chooses them instead of the cheapest provider. So, that’s success on both counts, two major problems solved for travelers. We’re increasing engagement with travelers on our site. They’re getting more value out of our product, right, but in order to do that, we have to stop ejecting them off our site when they’re not ready, and the trouble is that’s how we count conversion at Skyscanner.

So, in the end, we called this test successful, but it was because we discovered that there was an underlying flaw in the way we were counting success, right? Because normally, if you introduce a major drop in conversion rate, that would be a pretty easy test to call failed. In this case, the value for travelers was still there, so we called the test successful, and we’ve implemented this change now in our product. Okay, so, what have we learned? Using Lean Startup methodologies is meant to help us reduce waste in our product management cycle. But when you start to scale, those dynamics change, and we’ve walked through two case studies today of things you can look for in your own experimentation where you might be introducing waste in your product development cycle, and those are invalid and flailed tests.

We also looked at an example where it would have been very easy to make the wrong decision about what was best to do for travelers because the metric itself had underlying biases and assumptions. And so what we’ve learned is that it’s really important when you’re building a data-driven culture to balance both science and sensibility. And Eric Ries wrote in his book something reasonably prescient, which was that any time a team attempts to justify its failures by resorting to learning as an excuse, it’s engaged in pseudoscience. And we find that as our company scales, we’re perhaps more susceptible to engaging in pseudoscience because we don’t the concrete risk of going out of business like a startup does. It’s very easy to engage in lots of experimentation but not have a direct benefit for travelers.

So, as we learn about these things, we’re writing a lot about what we learn, and we’re also open-sourcing some tools on our engineering blog, “Code Voyagers.” And there’s two specific things that we’ve open-sourced, which may be of interest to you. We have a template for writing hypotheses and for experiment plans. We also have an analysis toolkit for A/B testing. So I invite you to…well, go see what we have and use them if it’s useful to you. So my final thought for you today is that as we scale we have to be on our guard against this kind of pseudoscience, and the only way around it is to apply both science and sensibility. And as product managers and product people, you have a critical role to play in helping your organizations find that balance. Thank you very much.

Video is great, but nothing beats being there. Join us at Turing 2017!