Chapter 6 · Two-Sided Marketplaces · Dynamic Pricing · Generative AI

Dynamic Pricing & Generative AI
in a Two-Sided Marketplace

Airbnb runs a two-sided marketplace where guests need to find destinations and hosts need to price homes — two problems most of the industry solves badly. It uses one AI system to help guests, a very different one to help hosts, and a third system, built on a large language model, that is now quietly rewiring how Airbnb's engineers build everything else.

Company: Airbnb Industry: Travel / Two-Sided Marketplace Core concept: Marketplace matching, dynamic pricing, generative AI in production
Also in this chapter: Lab 6: Build a Demand-Based Pricing Model in Python →
MIS 432 · AI in Business · Case Study

Airbnb:
Dynamic Pricing & Generative AI in a Two-Sided Marketplace

From destination recommendations to Smart Pricing to LLM-driven engineering — how one company uses three very different AI systems to reduce friction for three very different kinds of users
Level: Upper-division undergraduate Topics: Two-sided marketplaces, dynamic pricing, hybrid ML + structural modeling, generative AI in the enterprise Concepts introduced: 18 key business AI terms

Primary sources: This case study is based on three Airbnb Engineering blog posts: Recommending Travel Destinations to Help Users Explore (2026), Learning Market Dynamics for Optimal Pricing (2018), and Accelerating Large-Scale Test Migration with LLMs (2025).

Contents
1. Company Background 2. The Big Idea: Two Sides, One Platform 3. The AI Factory at Airbnb
Part I — The Guest Side
4. Data: The Language of Travel Intent 5. Model: Embeddings, Transformers, and Region/City Prediction 6. From Dormant Users to Autosuggest: Closing the Prediction–Decision Gap
Part II — The Host Side
7. Dynamic Pricing and Lead Time 8. Why Pure ML Wasn’t Enough: Hybrid Models & Demand Aggregations 9. The Prediction–Decision Gap on the Host Side
Part III — Turning AI Inward
10. LLMs for Test Migration: 1.5 Years to 6 Weeks
Synthesis
11. Responsible AI at Airbnb 12. Competitive Advantage: The Two-Sided Data Moat 13. Summary Table & Discussion Questions

1 Company Background

2008
Founded in San Francisco
5M+
Hosts worldwide
8M+
Active listings
220+
Countries & regions
2 sides
Guests & hosts

Airbnb was founded in 2008 by Brian Chesky, Joe Gebbia, and Nathan Blecharczyk. The original pitch was absurdly simple — rent out an air mattress in your apartment during a design conference when hotels were sold out. That idea grew into one of the largest travel platforms in the world, with more than 8 million active listings in over 220 countries and regions, and more than 5 million hosts who earn income by renting space to travelers.

What makes Airbnb different from a hotel chain, an airline, or even a platform like Booking.com is the shape of the business. Airbnb owns no properties. It does not employ the people who clean the rooms, answer the doors, or set the prices. Every one of its listings is a unique home offered by an independent host, and every booking is a transaction between two people who may never meet. Airbnb’s job is to be the layer in the middle that makes those transactions happen.

That middle-layer position turns out to be an enormous AI problem. Every night, millions of potential guests are trying to figure out where to go, and millions of hosts are trying to figure out what to charge. Neither side fully knows what they want, and neither side can be expected to. The guest has a vague idea of a trip; the host has a vague sense of what similar homes nearby are charging. Airbnb’s job is to turn those two vague signals into a specific match at a specific price on a specific date — hundreds of thousands of times a day.

Strategic context
Netflix uses AI to pick the best image for a show. Spotify uses AI to pick the next song. Uber uses AI to forecast demand. Airbnb has to do all of those kinds of things — but on two sides of a marketplace at the same time. That is the defining challenge of the business, and it is the lens through which every AI decision in this chapter should be read.

2 The Big Idea: Two Sides, One Platform

Every business in this course so far has had one primary customer. Netflix has subscribers. Spotify has listeners. Uber has riders (yes, drivers matter, but the business problem is framed around getting a rider a ride). Airbnb is structurally different. Airbnb has two customers who need completely different things from the same platform — and the platform only works when both sides get what they need.

Key concept
Two-sided marketplace and the matching problem
A two-sided marketplace is a platform whose value depends on bringing together two distinct user groups who need each other — buyers and sellers, riders and drivers, guests and hosts. The platform’s core job is matching: connecting a specific person on one side to the right specific person on the other side, at the right moment, at a price both sides will accept. Airbnb is a two-sided marketplace for stays. If there are no hosts, guests have nothing to book. If there are no guests, hosts have no income. If the matching is slow, wrong, or priced badly, both sides leave. Every major AI system Airbnb builds is ultimately in service of better matching.

Now here is the twist that shapes the rest of the chapter: neither side of the Airbnb marketplace fully knows what it wants. That might sound strange — aren’t people who visit a travel website the ones with an idea of where they want to go? Sometimes. But often, no.

Key concept
User intent modeling: exploratory vs. transactional users
Every person who lands on a product has some level of intent — some combination of what they want, how urgently, and how specifically. Transactional users know what they’re doing: they’re searching for a specific city on specific dates because their trip is already planned. Exploratory users don’t. They’re thinking “I want to go somewhere in France this summer” or just “I need a vacation.” Both types open the same app and click the same search bar — but the right experience to show them is completely different. A good recommendation for a transactional user is a listing; a good recommendation for an exploratory user is a destination. A business that can’t tell these users apart will frustrate both of them.

Exploratory users introduce an even harder problem for any recommendation system: if the user doesn’t know where they want to go, how is the system supposed to know what to suggest? This is a classic problem in recommender systems, and it shows up whenever a user has very little history for the system to learn from.

Key concept
The cold-start problem
The cold-start problem is what happens when a recommendation system is asked to personalize for someone it doesn’t know much about. A brand-new user has no viewing history, no listening history, no previous bookings — so standard “people like you also liked” logic has nothing to work with. At Airbnb, cold-start also includes a subtler version of itself: a user who has booked before but only returns to the platform every few months, or a user who knows only that they want to travel but hasn’t decided where. The cold-start problem on the demand side is the main problem Airbnb’s destination recommendation system is built to solve. The cold-start problem on the supply side shows up too — a brand-new listing has no booking history, which is one reason Smart Pricing leans heavily on patterns from similar listings rather than the listing itself.

On the host side, the parallel is nearly exact. A host has a home, a calendar, and an approximate sense of what to charge. But they don’t know what a Thursday in early October will actually be worth three months from now — nobody does. Markets move. Big events pull demand forward. Weather shifts bookings around. A host trying to price by intuition is guessing against information the platform already has in aggregate. That is the supply-side version of the same fundamental problem: the person on this side of the marketplace does not fully know what they want (or what they should charge), and the platform is better positioned to help them figure it out than they are on their own.

Guests need help discovering destinations. Hosts need help pricing listings. Both problems are really the same problem — understanding demand patterns at scale — viewed from opposite ends.

3 The AI Factory at Airbnb

We introduced the AI Factory framework in Chapter 3 with Spotify and applied it again with Uber in Chapter 4. The same five steps organize Airbnb’s AI systems too — data, model, prediction, decision, value — with one important twist. Airbnb runs parallel AI Factories: one on the guest side, one on the host side, both drawing from the same underlying data layer, and in the last few years, a third running inside the engineering organization itself.

1. Data
2. Model
3. Prediction
4. Decision
5. Value
Guest-side factory
Destination Recommendations
Uses guests’ browsing, searching, and booking history to suggest where to go — in the search bar and in re-engagement emails. Covered in Part I.
Host-side factory
Smart Pricing
Uses demand patterns, lead times, and similar-listing data to suggest how much to charge for each night on a host’s calendar. Covered in Part II.

Each track has its own data, its own model, and its own deployment surfaces — but they share a foundation. Every search a guest does becomes training data for the destination recommender, but it is also evidence about demand that feeds Smart Pricing. Every booking a host accepts teaches Smart Pricing about what prices clear the market, but it is also a signal the destination recommender uses to infer where travelers actually want to go. The two factories compound each other. The more guests Airbnb has, the better it can price for hosts. The better prices hosts offer, the more guests convert. This is the marketplace flywheel, and AI is what keeps it spinning.

Part III of this chapter looks at a third, newer use of AI at Airbnb: not customer-facing at all, but pointed at Airbnb’s own engineering organization. That will raise a different set of questions — but the fundamental framework stays the same.

Part I4 Data: The Language of Travel Intent

Here is a problem. A user opens the Airbnb app. They have done four things in the last three months: booked a home in Lisbon, searched twice for “Tokyo” without booking, clicked on a listing in Mexico City, and opened the app today without typing anything. What does that tell you about where they want to go next?

A human looking at that list can probably spot a pattern — an adventurous traveler who likes cities, maybe thinking about somewhere warm. But a machine cannot reason that way unless the data is structured to let it. The central insight in Airbnb’s destination recommendation system is that a user’s actions can be treated the way a language model treats words.

Key concept
Behavioral sequences as training data
In a language model, each word in a sentence is a token, and the model learns that certain tokens tend to follow certain other tokens — “peanut butter and” is usually followed by “jelly.” Airbnb applies the same idea to user behavior. Each action a user takes — a booking in Lisbon, a search for Tokyo, a click on a Mexico City listing — is treated as a token in the user’s personal travel history. Sequences of those tokens become training data. The model learns that certain patterns of past actions tend to lead to certain future destinations, in the same way a language model learns that certain sequences of words tend to lead to certain next words. This is a powerful reframe, because it means Airbnb can reuse the architecture of modern language models (transformers) to understand travel intent — a problem that has nothing to do with language on its face.

But not every action carries the same meaning. A booking three years ago and a search three minutes ago are both data points, but they say very different things about what a user wants right now. Airbnb’s system is explicitly designed to pull two different kinds of signal out of the same behavioral sequence.

Key concept
Short-term vs. long-term interest signals
A user’s behavior on Airbnb carries two kinds of information. Short-term signals — views and searches in the last few days — reveal what the user is actively considering right now. They’re urgent, specific, and fragile (the user could change their mind tomorrow). Long-term signals — historical bookings — reveal durable preferences: this person likes coastal cities, books in summer, tends toward mid-range prices. Neither signal alone is enough. If Airbnb only looks at recent searches, it overreacts to momentary curiosity. If it only looks at history, it ignores the fact that the user is actively researching Tokyo right now. The model explicitly combines both — short-term signals to catch active interest, long-term signals to ground it in durable taste. This is why the model looks at booking history, view history, and search history separately rather than mashing them into one giant list.

On top of these behavioral sequences, Airbnb layers contextual signals — things like what time of year it is. A user’s long-term pattern might say “beach destinations,” but if they’re searching in December, the model should probably weight warm-weather beaches over Cape Cod. The current date is an input to the model in the same way a word’s position in a sentence is an input to a language model.

5 Model: Embeddings, Transformers, and Region/City Prediction

Once you’ve decided to treat user actions as tokens, you need a way to represent each token as something a model can actually do math with. Spotify’s chapter introduced the answer: embeddings.

Key concept
Embeddings (at Airbnb)
Recall from Chapter 3: an embedding is a compact list of numbers that represents everything the system has learned about an item. Similar items get similar embeddings and end up close together in a mathematical space; very different items end up far apart. Spotify uses embeddings for songs and users. Airbnb uses them for cities, regions, and dates. Lisbon gets an embedding, Porto gets an embedding, and if they co-appear in enough guests’ travel patterns, the two embeddings end up close together — meaning someone who shows interest in one is likely to be interested in the other. Each user action in the sequence is represented as the sum of three embeddings: the city it involved, the region that city sits inside, and how many days ago it happened. That single compact representation captures both what the user did and when.

Once each action is an embedding, the model processes the whole sequence of actions with a transformer — the same class of architecture that powers modern language models like the ones behind ChatGPT and Claude. The technical details are beyond the scope of this course, but the business intuition is straightforward: a transformer is very good at looking at a sequence and figuring out which parts of it matter most for predicting what comes next. When the user’s history is “booked Lisbon two years ago, searched Tokyo three times last week, clicked Mexico City yesterday,” the transformer learns which of those actions deserves the most weight in predicting today’s destination.

But there is one more wrinkle. The model is not just trying to predict a city. It is trying to predict geography at multiple levels at once.

Key concept
Multi-task learning
Multi-task learning is a design choice where one model is trained to predict several related things simultaneously instead of just one. At Airbnb, the destination model predicts both which region a user is interested in (say, the San Francisco Bay Area) and which specific city within that region (San Francisco, Oakland, San Jose). The two predictions share almost all of the same underlying structure and are forced to be consistent with each other. This matters for two reasons. First, users who like San Francisco often end up booking in San Jose or Oakland — the system shouldn’t treat those as unrelated. Second, training on both tasks together gives the model more signal to learn from than training on either one alone — the patterns that help predict regions also help predict cities, and vice versa. Multi-task learning is a common and practical way to squeeze more value out of the same dataset.
Multi-task learning for region and city prediction A diagram showing a shared model trunk with two prediction heads for region and city User action sequence (tokens) Transformer (shared trunk) learns sequence patterns Region head “Bay Area” City head “San Francisco” One model, two predictions — forced to be consistent with each other
Figure 1: Multi-task learning in Airbnb’s destination model. A single transformer processes the user’s action sequence and feeds two prediction heads — one for region, one for city. The two heads share their underlying understanding of geography, which produces richer and more consistent predictions than training two separate models would.

6 From Dormant Users to Autosuggest: Closing the Prediction–Decision Gap

Here is a training-data puzzle worth sitting with. Airbnb wants the destination model to work for everyone — not just the user who just searched five times in the last week, but also the user who booked once in 2024 and hasn’t logged in since. If you train the model only on users who are currently active, you get a model that is great at the obvious case and useless for the harder, more valuable case of bringing back a lapsed user. If you train it only on dormant users, you can’t serve the active ones well. Airbnb’s answer is to design the training data to include both.

Key concept
Active vs. dormant users as a training-data design choice
Most ML decisions happen in the modeling step — what architecture to use, what features to include, how to train. But some of the most consequential decisions happen one step earlier, in how the training data itself is constructed. At Airbnb, every past booking generates not one but fourteen training examples: seven examples that look as if the booking were made by an “active user” in the final week of planning (using recent searches and views), and seven examples that look as if the booking were made by a “dormant user” who hasn’t been on the platform for months (using only the user’s historical booking data, not their recent activity, because a dormant user has no recent activity). This deliberately teaches the model how to behave in both situations. It is a reminder that the people designing AI systems make decisions about who the system is being built for before any code is written — and those decisions are ethically and commercially consequential.

Once the model is trained, Airbnb has a predicted region and a predicted city for each user. But a prediction is not a decision. What should the business actually do with it?

Key concept
The prediction–decision gap (guest side)
We met the prediction–decision gap in Chapter 2 with Netflix. A model tells you what a user is likely to do; a decision tells you what the business should actually do about it. The model here predicts “this user is interested in Tokyo, Osaka, and Kyoto,” but that prediction on its own does nothing. Airbnb has to turn it into specific actions at specific moments: which three cities to show in autosuggest when the user taps the search bar, which destination to feature in the subject line of an abandoned-search email, how prominently to surface each one. Different decisions will follow from the same prediction depending on the surface, the user’s recent behavior, and what Airbnb is trying to accomplish (drive a booking today? re-engage a dormant user? encourage discovery?). The destination model is the inference engine. The product decisions around it are where the business value actually lives.

Airbnb has deployed the destination model in two places so far, and the two deployments illustrate opposite ends of the user-intent spectrum:

Business insight
Notice how the same model powers two completely different product experiences. Autosuggest is a real-time, high-intent surface — the user is literally about to type. Abandoned-search email is a delayed, low-intent surface — the user has left and may or may not come back. One model, trained once on cleverly designed data, now serves both. This is the economics of AI that matures: the most expensive part of an AI system is usually not the model itself — it’s assembling the data and infrastructure. Once that’s done, the same prediction can be deployed across many surfaces at low marginal cost.

Part II7 Dynamic Pricing and Lead Time

Now flip the marketplace around. A host has just listed a two-bedroom apartment in San Diego. They need to decide what to charge for every single night on their calendar — not just tonight, but six months from now, twelve months from now. Prices that are too high leave the calendar empty. Prices that are too low leave money on the table. And the “right” price for a Thursday in early October is different in July than it is in September, because the market keeps moving.

Key concept
Dynamic pricing
Dynamic pricing is the practice of adjusting the price of a product or service in near-real-time as market conditions change — demand rises, supply tightens, competitors move. Airlines have done this for decades; hotels and ride-sharing platforms do it now; Airbnb does it through a feature called Smart Pricing, which suggests a nightly price for each date on a host’s calendar and updates those suggestions as the booking date approaches. For Airbnb hosts specifically, dynamic pricing is harder than it is for a hotel, because every listing is unique. There’s no “standard double room” to benchmark against — every home has its own size, quality, location, and personality. That uniqueness is the heart of the problem the pricing model has to solve.

One of the most important variables in dynamic pricing has nothing to do with the listing itself. It has to do with time — specifically, when a booking is made relative to the check-in date.

Key concept
Lead time distribution
The lead time of a booking is the gap between when it was made and when the guest actually checks in. A guest booking New Year’s Eve on December 1st has a lead time of 30 days. A guest booking a last-minute stay for tonight has a lead time of zero. The lead time distribution is the full shape of how bookings spread out across those gaps for a given check-in date and location: what share are booked 90 days out, what share 30 days out, what share last-minute. This matters enormously for pricing, because the shape differs. High-demand dates like New Year’s Eve see a lot of early bookings — people plan ahead. A beach town like South Beach sees a lot of last-minute bookings. A supply-constrained city like San Francisco during a conference gets booked weeks ahead. If a host prices a peak date as if most bookings will come in the last two weeks, they will leave the calendar open too long and miss the early wave. Understanding the lead time distribution lets Smart Pricing tell the host when to raise prices, when to lower them, and by how much.
Key concept
Stochastic arrival process
The word stochastic just means “random with a pattern.” A stochastic arrival process is the technical name for what bookings do: they arrive over time in a way that looks random at any single moment, but has clear patterns when you zoom out. You can’t predict exactly when the next booking for a specific listing will come in — but across thousands of listings and check-in dates, the overall pattern is highly predictable. Smart Pricing’s job is not to guess the next individual booking. It’s to forecast the shape of that arrival pattern for each check-in date, so hosts can price appropriately as the date gets closer. Framed this way, dynamic pricing becomes a forecasting problem (Uber’s chapter is worth re-reading here) with a pricing decision bolted onto the end.
Lead time distribution for two types of markets Two curves showing bookings over lead time: high-demand dates peak early, last-minute markets peak near zero Days before check-in (lead time) — check-in is at 0, far in advance is to the right Share of bookings 0 45 days 90+ days High-demand date (e.g. NYE) bookings skew early Last-minute market bookings skew late
Figure 2: Two different lead time distributions. The same nightly rate is wrong for both of these dates. A high-demand date like New Year’s Eve needs prices to rise early, because most bookings will happen far in advance. A last-minute market needs prices held back, because holding out longer yields better returns. The lead time distribution is what tells Smart Pricing which regime each night on the calendar is in.

8 Why Pure ML Wasn’t Enough: Hybrid Models & Demand Aggregations

Here is where the Smart Pricing story gets interesting — and where it connects directly back to Uber. The obvious move for a company like Airbnb is to throw a powerful machine learning model at the problem. Collect every feature you can find about each listing and each check-in date, train a huge model to predict the optimal price, and deploy it. Airbnb’s data science team considered exactly that approach, and rejected it.

The reasons were specific. With millions of listings, each with its own characteristics, and each checkin date getting booked at most once, the data is desperately sparse at the individual-listing level. The prediction problem has three dimensions at once (listings × check-in dates × lead days), which makes it computationally unwieldy. And perhaps most importantly, Airbnb’s team had good domain intuition about the shape of the lead time distribution — they could see that bookings tended to accumulate in a pattern that matched a well-known family of statistical distributions. A pure ML approach would ignore all of that intuition.

Key concept
Structural modeling vs. pure machine learning
A structural model is one where a human expert specifies the general shape of the relationship between the inputs and the outputs, and the model’s job is just to fill in the parameters. A pure machine learning model specifies no shape — it learns the entire relationship from data. Each approach has a strength and a weakness that mirror each other. ML models tend to be more accurate when there’s enough data; structural models tend to be more interpretable when you need to explain what’s happening and why. Chapter 4 framed this as the black-box-vs.-interpretable tradeoff at Uber. Airbnb’s Smart Pricing team framed it as a choice: accept the accuracy of pure ML and lose the ability to explain it to hosts, or accept the interpretability of a structural model and lose some accuracy. They refused to pick one. Instead, they built a hybrid.
Key concept
Hybrid models: ML + domain knowledge
A hybrid model combines the best of both approaches: a human expert specifies the structure of the relationship (say, “we believe bookings arrive in a pattern that matches a known statistical distribution”), and an ML model’s job is reduced to predicting just the handful of parameters that define that pattern. This is exactly what Smart Pricing does. The team knew from both theory and data that lead time distributions had a specific mathematical shape. So instead of asking the ML model to predict a full curve for every listing and every date, they ask it to predict just five parameters that define the curve. The ML model does what ML is good at (learning from lots of features), and the structural assumption does what expert knowledge is good at (imposing a shape that prevents nonsense predictions and makes the outputs interpretable). In Airbnb’s own reporting, this hybrid approach cut forecast error roughly in half compared to a pure ML baseline — and, crucially, produced predictions the team could actually explain.

But there’s still one problem left. Airbnb has millions of listings, and each one is unique. Even with a hybrid model, trying to fit five parameters for every individual listing would run into the same sparsity wall — most listings simply don’t have enough bookings to estimate reliable numbers. The workaround is an idea that should feel familiar from Spotify.

Key concept
Clustering and demand aggregations
Recall from Chapter 3 that clustering is what happens when a system groups similar items together without being told what the groups should be — it’s an unsupervised learning technique. Spotify uses clustering to discover that certain songs appeal to the same kinds of listeners even when those songs sound nothing alike. Airbnb uses the same idea on listings. By looking at which listings guests tend to consider together in a single search session — what the team calls the “path to purchase” — the system learns which homes share similar audiences. Listings that frequently co-appear in guests’ browsing end up grouped together into clusters Airbnb calls demand aggregations. Each cluster then shares a lead time distribution, which fixes the sparsity problem: instead of fitting parameters for a single listing with three bookings a year, the model fits them for a cluster of thousands of listings that share a common audience. The lesson travels: when the data is too thin at the individual level, group intelligently and model at the group level.
Why this matters for managers
Pure ML is not always the best answer — even when you have lots of data and lots of compute. Smart Pricing is a case where combining a human intuition about the shape of the problem with an ML model produced better results than pure ML alone. This is a recurring pattern in applied AI. Teams that know their domain well enough to say “the answer has to take this form” often beat teams that just throw bigger models at the problem. AI does not replace expertise; it amplifies it. A manager evaluating an AI proposal should ask not just “what model?” but “what do we know about this problem that the model doesn’t need to learn from scratch?”

9 The Prediction–Decision Gap on the Host Side

Smart Pricing takes the predicted lead time distribution for each cluster, combines it with information about the specific listing and the specific check-in date, and produces a suggested nightly price. That suggestion gets displayed in the host’s calendar view. The host can accept it, override it, or ignore it entirely.

That single design choice — suggest, don’t set — is worth pausing on, because it’s a live example of a tension this course has been tracking since Chapter 2.

Key concept
Automation vs. augmentation (on the host side)
Chapter 2 defined this tradeoff: an AI system can automate a decision (the machine chooses and acts) or augment a human decision (the machine recommends and the human chooses). Netflix automated artwork selection. EveryCure augmented drug researchers. Smart Pricing is explicitly augmentation, not automation. Airbnb could set the price directly — it has the data and the models to do it — but it chose not to. Hosts keep final control over their own calendars. Why? Partly legal (Airbnb doesn’t own the home), partly ethical (the host bears the consequences of a bad price, not Airbnb), and partly strategic (hosts who feel coerced into pricing decisions churn off the platform). The same AI capability that could have been automation is deployed as augmentation because of a business judgment about the relationship between platform and host.

That design choice does not make the ethical questions go away. When Smart Pricing suggests lowering a nightly rate to match an apparent market softening, the host is under real pressure to accept — the alternative is a dark calendar. The suggestion is not a command, but it is not neutral either. A suggestion from an algorithm that sees the entire market is hard to argue with from a position of seeing just your own property. We’ll return to this in §11.

Part III10 LLMs for Test Migration: 1.5 Years to 6 Weeks

The first two parts of this chapter looked at AI systems Airbnb built to serve its customers. This third part is different. It is about a system Airbnb built to serve itself — and it is the clearest example in this course so far of how generative AI is actually being integrated into companies that already have deep ML capability.

The problem: Airbnb had about 3,500 React component test files written with a testing framework called Enzyme, which had gone out of fashion. They needed to rewrite those test files using a newer framework called React Testing Library. The tests had to preserve the original testing behavior — you can’t just throw out tests without losing coverage of the code. Engineers originally estimated this rewrite at about 1.5 years of engineering time if done by hand. In 2025, Airbnb completed it using a pipeline built around a large language model in six weeks.

3,500
Test files to migrate
1.5 yrs
Original manual estimate
6 weeks
Actual time with LLMs
75%
Files migrated in first 4 hours
97%
Files migrated after 4 days of tuning

That is not a small result, and the underlying mechanics matter more than the headline number. To understand what Airbnb actually did, you first need a clear distinction between the kind of AI we’ve been discussing in this course and the kind of AI that handled this migration.

Key concept
Generative AI vs. predictive AI
Every AI system in this course until now has been predictive. A predictive model is trained to answer a specific question (what’s the click-through rate? what’s the demand? where does this user want to travel?) by learning patterns in historical data. The output is usually a number or a ranked list. Generative AI is different. It produces new content — text, code, images — that didn’t exist before, by learning patterns from huge collections of unstructured data. Large language models (LLMs) like the ones behind ChatGPT and Claude are generative. Airbnb’s destination recommender is predictive: it takes a user’s history and predicts a destination. An LLM is generative: it takes a prompt and produces new code. The two kinds of AI are built differently, scale differently, fail differently, and should be governed differently — but an increasing number of real production systems (including the one in this section) stitch them together.

Why was this problem a good fit for an LLM and not for the kind of custom model built in Parts I and II? Three reasons. First, the task is code, not behavior — there’s an enormous amount of public JavaScript code in the world that an LLM has already been trained on. Second, the task is generative, not predictive — the output is a new test file, not a score or a ranking. Third, the task has an objective correctness check — the migrated test either passes or fails when you run it. You don’t need a human to rate the output; the machine can tell immediately whether it worked. That third property is crucial, and it led directly to the most important design decision in the whole project.

The pipeline: the LLM is the engine; the workflow is the engineering

Here is the part that business students should pay closest attention to. Airbnb’s engineers did not just hand 3,500 files to a language model and hope for the best. They built an automated pipeline around the model that looked more like a factory than a chatbot. The structure was roughly:

  1. Break the job into small, checkable steps. Each file was pushed through a sequence of distinct stages — refactor the Enzyme code, fix any issues in the test framework, fix linting and type errors, then mark complete. Each stage had a validation check: did the output actually work?
  2. When a step fails, retry with better context. If a stage failed, the system automatically re-prompted the LLM — this time including the specific error message and the most recent broken version of the file. Most files that failed the first attempt succeeded within about 10 retries.
  3. For the hard cases, give the model more to work with. On complex files, the prompts grew to include up to 50 related files from the same project — sibling test files written in the right style, the source code of the component being tested, examples of well-written tests. Prompts ballooned to 40,000–100,000 tokens each.
  4. Run the loop at industrial scale. Files ran in parallel, progress was tracked by automatically inserting a comment into each file recording its migration status, and engineers could re-run any specific step against any specific subset of files. After the first bulk run, 75% of files were done in four hours.
  5. Chip away at the long tail. For the remaining 25%, the team ran a “sample, tune, sweep” loop: look at a handful of stuck files, figure out the common issue, update the prompts to fix it, re-run everything. Four days of this got them to 97%. The last 3% were fixed by hand.
Key concept
Human-in-the-loop engineering workflows
The most important lesson from Airbnb’s test migration is not about the LLM. It’s about everything around the LLM. The language model itself is one component in a much larger system: a validation pipeline, a retry loop, a context-gathering mechanism, a progress-tracking system, and a human-directed tuning loop. The LLM provides capability; the engineering around it provides reliability. You could swap in a different LLM tomorrow and the system would keep working — but if you removed the validation, retry, and context machinery, the raw LLM would be unusable at scale. This is the same lesson from Spotify’s chapter applied to a different generation of AI: the model is not the moat. The infrastructure around the model is the moat.

The business math

Now the arithmetic. The original estimate was 1.5 years of engineering time. The actual cost was six weeks of engineering time plus the LLM API bill. Airbnb reports the total came in at a fraction of the original estimate. But a careful student should not just celebrate the headline number — they should think about what those six weeks of engineer time were spent on. Almost none of it was “writing tests.” It was building the pipeline, tuning the prompts, reviewing the failures, and fixing the last 3% by hand. The LLM did the typing. The engineers did the architecting.

Key concept
AI as labor substitution vs. labor augmentation
When an AI system takes over a task that used to be done by a person, two things can happen. In substitution, the work that was previously done by humans is now done by the machine, and fewer humans are needed. In augmentation, the same humans still do the work, but the machine makes them much more productive. The Airbnb test migration looks like substitution at first — the LLM wrote code that engineers would otherwise have written — but looked at more closely, it’s augmentation. The engineers didn’t go away. They moved up the stack: from writing individual test files (what the LLM did) to designing a pipeline, selecting context, and debugging edge cases (what the engineers did). The total output was dramatically higher, but the engineers were still there. A recurring business question as generative AI spreads through companies will be: does this deployment substitute for a role, or augment the people in it — and which does the company actually want?
What this says about GenAI in established businesses
Notice which kind of company pulled this off. Airbnb was not a startup tinkering with a chatbot. It was a company with more than a decade of ML experience, a mature engineering culture, well-developed monitoring and testing discipline, and engineers who already knew how to build production pipelines. The LLM was a new ingredient, but everything around it — the validation infrastructure, the retry logic, the progress tracking — was built using the same skills Airbnb’s engineers had been practicing for years. The companies best positioned to benefit quickly from generative AI are often the same ones that already invested in predictive AI. Generative AI doesn’t replace the AI Factory. It plugs into it.

11 Responsible AI at Airbnb

Every chapter in this course circles back to the same basic question: what happens when the algorithm is wrong, or right in a way the business didn’t intend? Airbnb’s version of this question is especially sharp because it touches three different constituencies — guests, hosts, and now the engineering organization itself.

On the guest side: geographic bias and the filter bubble

The destination recommender is trained on guests’ past bookings. If a user has only ever booked in Europe, the model will tend to recommend European destinations. That might be exactly what the user wants. But it might also narrow their world in ways neither they nor Airbnb would choose deliberately. A destination model that learns from past behavior will systematically underweight places where the user has never been — which is, by definition, where exploration would actually take them.

A subtler version of the same issue is geographic coverage. Airbnb reports that autosuggest drives the biggest booking gains in regions where English is not the primary language — which is great, but also tells you that the feature was helping users in non-English markets more than users in English-speaking ones. That might be equity; it might also mean the underlying dataset was tilted toward English-language markets in a way that made the baseline worse there. A responsible team monitors for both.

On the host side: whose margin is Smart Pricing optimizing?

Risk: algorithmic price nudging
When Smart Pricing suggests lowering a host’s nightly rate to match an apparent market softening, the host faces a real choice: accept the suggestion and take a lower margin, or reject it and risk an empty calendar. A suggestion from an algorithm that sees the entire market is hard to argue against from the position of seeing just your own home. Over time, hosts who consistently override Smart Pricing may find their listings ranked lower in search results, which makes overriding the suggestion feel more costly. The “suggestion” frame is not neutral — and a host reading this chapter should be thinking about whose interests the algorithm is optimizing when its recommendations also happen to increase booking volume on the platform.

There is a structural reason to ask this question. Airbnb’s revenue depends on completed bookings. Lower prices produce more bookings. A pricing algorithm that very slightly biases toward “lower” produces more bookings at every individual decision, produces more revenue for Airbnb, and produces slightly less revenue per night for each host. None of this requires bad faith from anyone. It just requires that the objective function the algorithm optimizes is aligned with the platform’s interests rather than the hosts’. Asking whose interests a suggestion serves is one of the most basic questions of responsible AI, and it’s a question hosts have a legitimate right to ask.

On the pricing side: price discrimination

Dynamic pricing at scale raises a broader question that airlines have been living with for decades: is charging different prices to different people for the same service fair? At Airbnb, two guests booking the same home for the same dates will generally see the same price — but prices move dramatically across dates, and hosts in certain neighborhoods may be nudged toward systematically different price curves than hosts in other neighborhoods. When the data reflects a city’s history, the algorithm can reflect that history too. A responsible pricing team monitors for this, particularly at the intersection of neighborhood demographics and dynamic pricing suggestions.

On the engineering side: the LLM’s code was correct, but was it right?

The test migration in §10 is a governance story in miniature. The engineers built validation into every stage — a test either passes or it doesn’t, a file either compiles or it doesn’t. That worked because the task had hard, objective correctness checks. Not every LLM deployment does. When a generative model is used to draft a customer email, summarize a legal document, or write code that will ship to production, the question “did it work?” becomes much harder to answer mechanically. The discipline Airbnb applied to this project — validation at every step, automated retries, human review of the long tail — should be the default assumption for how LLMs get deployed in serious business contexts. It is not how they are currently deployed in most companies.

12 Competitive Advantage: The Two-Sided Data Moat

Why can’t a competitor just copy all of this? Airbnb’s engineering team has published detailed descriptions of every system in this chapter. The models are not secret. The architectures are not proprietary. And yet the advantage compounds.

The moat is not the algorithm. It is the data, the infrastructure, the organizational learning, and the compounding feedback loop between two sides of a marketplace that reinforce each other. Everything else is copyable. That combination is not.

13 Summary Table & Discussion Questions

AI Factory model: Airbnb mapped

Step Guest side (recommendations) Host side (Smart Pricing) Internal (test migration)
Data Booking, view, and search history as action sequences Historical bookings, calendar data, market context 3,500 test files plus sibling files and examples
Model Transformer on behavioral sequences with multi-task heads Hybrid: ML predicts 5 parameters of a structural distribution Frontier LLM with retry loops and dynamic prompting
Prediction Predicted region and city for each user Predicted lead time distribution per cluster & date Generated replacement code for each test file
Decision Autosuggest rankings; abandoned-search email content Suggested nightly price; host keeps final control Accept if validation passes; retry if not; human fix at tail
Value Higher conversion; re-engagement of dormant users Better calendar utilization; higher host earnings 1.5 years of engineering compressed into 6 weeks

Key vocabulary introduced in this chapter

Two-sided marketplace
A platform whose value depends on matching two distinct user groups that need each other — guests & hosts, riders & drivers
Matching problem
The core operational job of a two-sided marketplace: getting the right person on one side to the right person on the other at the right price and moment
User intent modeling
Distinguishing transactional users (know what they want) from exploratory users (still figuring it out) and serving each differently
Cold-start problem
What happens when a recommender has very little history to personalize from — new users, dormant users, or users with ambiguous intent
Behavioral sequences as tokens
Treating each user action like a word in a sentence so language-model architectures can learn patterns in travel behavior
Short-term vs. long-term signals
Active searches and views reveal current intent; historical bookings reveal durable preference — both needed for good recommendations
Embeddings
Compact numeric representations of items (cities, regions, dates) where similar items end up close together in space
Multi-task learning
Training one model to predict several related things at once — at Airbnb, both region and city — for richer shared understanding
Active vs. dormant users
A deliberate training-data design choice that teaches one model to handle both recently active and long-lapsed users
Prediction–decision gap
The gap between what a model predicts and what a business should actually do — closed differently on each side of a two-sided marketplace
Dynamic pricing
Adjusting prices in near-real-time as demand, supply, and time to check-in change
Lead time distribution
The shape of how far in advance bookings arrive — different for peak dates vs. last-minute markets, and central to good pricing
Stochastic arrival process
“Random with a pattern” — bookings arrive unpredictably one at a time but follow clear patterns in aggregate
Structural vs. pure ML
Structural models specify a shape; pure ML learns the whole thing — each has strengths. Hybrid combines both
Hybrid models
Combining domain knowledge about the structure of a problem with ML’s ability to learn parameters — halved Smart Pricing error
Demand aggregations
Clusters of listings grouped by shared guest audiences (not physical features), solving the sparsity problem for per-listing pricing
Generative vs. predictive AI
Predictive AI produces scores or rankings from historical patterns; generative AI produces new content (text, code) from huge unstructured training sets
Human-in-the-loop workflows
Wrapping an AI model in validation, retries, and human-directed tuning to make unreliable capability into reliable production output
Substitution vs. augmentation
Does an AI deployment replace people doing a task or make those same people dramatically more productive? Rarely as clear-cut as it sounds

Discussion questions

These work well as written assignments or in-class discussion prompts. Questions 2, 3, and 7 tend to generate the most debate.

  1. Two sides, one platform — advantage or risk? Is Airbnb’s decision to optimize for both sides of its marketplace simultaneously a competitive advantage or an ethical risk? Take a position and defend it using specific evidence from the destination recommender, Smart Pricing, or both.
  2. Whose margin is Smart Pricing optimizing? When Smart Pricing suggests a lower price to a host during a quiet period, whose interests is that algorithm actually serving? Argue it through — is it the host’s, Airbnb’s, the guest’s, or some combination? Would a host-first version of the algorithm look different?
  3. Who was underserved before the model? Airbnb’s destination recommender performs best in regions where English is not the primary language. What does that tell you about which users were underserved before the model existed — and what does it say about how Airbnb should measure the success of a recommender like this?
  4. When is interpretability worth losing accuracy? Smart Pricing could have used pure ML and likely would have been somewhat more accurate than its hybrid approach. Instead, Airbnb chose a hybrid that gave up some raw accuracy in exchange for interpretability. Where is the line? Pick a business context where a manager should insist on interpretability even at a real cost to accuracy, and one where they shouldn’t.
  5. Travel intent as text. Airbnb treats user actions as “tokens” the way a language model treats words. What are the risks of borrowing a framework designed for text and applying it to human travel intent? Where could this abstraction break — and what would the failure mode look like in the product?
  6. 1.5 years to 6 weeks. Airbnb’s engineers originally estimated the test migration at 1.5 years of work. They finished it in 6 weeks with LLMs. What should managers conclude from that number — and what should they not conclude? Under what conditions would you generalize this result to other teams or other kinds of work, and when would you resist?
  7. Filter bubble or personalization? The destination model learns from a user’s past bookings. A user who has only ever booked in Europe will see mostly European recommendations. Is the model personalizing correctly — or narrowing the user’s world? Argue both sides.

MIS 432 · AI in Business · Case Study · For classroom discussion purposes.

← Chapter 5: Waymo Lab 6: Build a Demand-Based Pricing Model →