Netflix A/B Test Simulation

01 — The Business Problem

Netflix Has Millions of Thumbnails.
Which Ones Actually Work?

Every time you open Netflix, the thumbnails you see — the still images that represent each title in a row — are not the same ones your neighbor sees. Netflix runs thousands of simultaneous experiments, quietly testing whether a different image of the same show gets more people to click and start watching.

The business logic is straightforward: a click is the first step toward a viewing session, and a viewing session is what keeps a subscriber paying $15 a month. Even a small improvement in how often a thumbnail gets clicked, multiplied across 270 million subscribers and hundreds of title rows per page, compounds into a massive impact on engagement and retention.

This project simulates that experiment. Two versions of a piece of artwork — Artwork A and Artwork B — are shown to randomly selected users. The question is whether the difference in clicks between them is real, or just random noise.

02 — What I Built

A Three-Step Experiment Pipeline in Python

The project was built entirely in Google Colab using Python. Each step mirrors a stage in Netflix's actual experimentation workflow, from user assignment through to a statistical verdict.

Simulate the Experiment

1,000 users are randomly assigned to either Artwork A or Artwork B — a 50/50 split that mirrors how Netflix's XP platform allocates users using deterministic ID hashing. Each user's click behavior is simulated using a Binomial draw: Artwork A has a true click probability of 24%, Artwork B of 30%. The results are stored in a pandas DataFrame with three columns: user ID, artwork version, and whether the user clicked.

Calculate and Visualize Click-Through Rates

The data is aggregated by group to produce the three core metrics any experimentation analyst checks first: total users per group (to confirm the split is balanced), total clicks per group (raw engagement volume), and click-through rate, or CTR (the normalized comparison that accounts for any slight imbalance in group sizes). A bar chart is generated to make the gap immediately legible to any stakeholder without requiring them to read a table.

Run a Statistical Significance Test

A two-proportion z-test determines whether the observed CTR gap between the two groups is statistically significant — meaning it is unlikely to be explained by random chance alone. The test produces a z-score, a p-value, and a 95% confidence interval. These are visualized in a four-panel dashboard showing the CTR comparison, the z-distribution curve, the confidence interval, and a significance scorecard.

03 — What I Found

Artwork B Outperformed Artwork A
by a Statistically Significant Margin

The simulation produced clean, interpretable results. Artwork B generated a meaningfully higher click-through rate, and the statistical test confirmed with high confidence that the gap is real — not a product of random variation in a sample of 1,000 users.

24%

Artwork A · CTR

30%

Artwork B · CTR

+25%

Relative Lift

<.05

P-Value

Click-Through Rate by Artwork Version · n = 1,000 users

Artwork A

24.0%

~494 users

Artwork B

30.0%

~506 users

▲ +6.0 percentage points · +25% relative lift · p < 0.05 ✓

What does "statistically significant" actually mean?

Imagine running this experiment in a world where Artwork A and Artwork B were truly identical — no real difference. Even then, you'd expect the two groups to show slightly different CTRs just by chance, the same way flipping a fair coin 500 times rarely gives you exactly 250 heads. The z-test asks: is the gap we observed so large that it would almost never happen by luck alone? When the p-value is below 0.05, the answer is yes — there's less than a 5% chance this result is a fluke. Our result cleared that threshold comfortably.

04 — The Code

Complete Python Implementation

The full simulation, analysis, and visualization pipeline — written for Google Colab and reproducible with a fixed random seed. Copy and paste the block below directly into a Colab notebook to run it.

Python · Google Colab

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from scipy.stats import norm
import matplotlib.gridspec as gridspec

# ── Configuration ─────────────────────────────────────────
np.random.seed(42)
N_USERS      = 1000
CLICK_PROB_A = 0.24
CLICK_PROB_B = 0.30

# ── Step 1: Simulate the experiment ───────────────────────
user_ids    = np.arange(1, N_USERS + 1)
assignments = np.random.choice(['A', 'B'], size=N_USERS, p=[0.5, 0.5])
click_probs = np.where(assignments == 'A', CLICK_PROB_A, CLICK_PROB_B)
clicks      = np.random.binomial(n=1, p=click_probs)

df = pd.DataFrame({
    'user_id': user_ids,
    'artwork': assignments,
    'clicked': clicks
})

# ── Step 2: Calculate metrics ─────────────────────────────
users_per_group  = df.groupby('artwork')['user_id'].count()
clicks_per_group = df.groupby('artwork')['clicked'].sum()
ctr_per_group    = df.groupby('artwork')['clicked'].mean().round(4)

n_a, n_b         = int(users_per_group['A']), int(users_per_group['B'])
clicks_a, clicks_b = int(clicks_per_group['A']), int(clicks_per_group['B'])
ctr_a, ctr_b     = clicks_a / n_a, clicks_b / n_b

ctr_lift_absolute = ctr_b - ctr_a
ctr_lift_relative = (ctr_lift_absolute / ctr_a) * 100

# ── Step 3: Two-proportion Z-test ─────────────────────────
p_pool    = (clicks_a + clicks_b) / (n_a + n_b)
std_error = np.sqrt(p_pool * (1 - p_pool) * (1/n_a + 1/n_b))
z_score   = (ctr_b - ctr_a) / std_error
p_value   = 2 * (1 - norm.cdf(abs(z_score)))

ci_se    = np.sqrt((ctr_a*(1-ctr_a)/n_a) + (ctr_b*(1-ctr_b)/n_b))
ci_lower = (ctr_b - ctr_a) - 1.96 * ci_se
ci_upper = (ctr_b - ctr_a) + 1.96 * ci_se
significant = p_value < 0.05

# ── Visualization Dashboard ───────────────────────────────
RED, BLACK, GREY, BG = '#E50914', '#221F1F', '#888888', '#F9F9F9'
GREEN = '#2ecc71'

fig = plt.figure(figsize=(14, 11))
fig.patch.set_facecolor(BG)
fig.suptitle('Netflix A/B Test — Visual Results Dashboard\n(n = 1,000 users)',
             fontsize=15, fontweight='bold', y=0.98)
gs = gridspec.GridSpec(2, 2, figure=fig, hspace=0.42, wspace=0.35)

# Panel 1 — CTR Bar Chart
ax1 = fig.add_subplot(gs[0, 0])
bars = ax1.bar(['Artwork A', 'Artwork B'], [ctr_a, ctr_b],
               color=[RED, BLACK], edgecolor=['#B20710','#444'], width=0.45, zorder=3)
for bar, val in zip(bars, [ctr_a, ctr_b]):
    ax1.text(bar.get_x() + bar.get_width()/2, val + 0.003,
             f'{val:.2%}', ha='center', fontsize=12, fontweight='bold')
ax1.set_title('① CTR by Artwork Version', fontsize=11, fontweight='bold')
ax1.set_ylim(0, max(ctr_a, ctr_b) + 0.07)
ax1.yaxis.grid(True, linestyle='--', alpha=0.5)

# Panel 2 — Z-Distribution Curve
ax2 = fig.add_subplot(gs[0, 1])
x = np.linspace(-4.5, 4.5, 500)
ax2.plot(x, norm.pdf(x), color=BLACK, lw=2)
ax2.fill_between(x[x <= -1.96], norm.pdf(x[x <= -1.96]), color=RED, alpha=0.35)
ax2.fill_between(x[x >= 1.96], norm.pdf(x[x >= 1.96]), color=RED, alpha=0.35)
ax2.axvline(z_score, color=GREEN, lw=2, linestyle='--',
            label=f'z = {z_score:.2f}')
ax2.set_title('② Z-Distribution', fontsize=11, fontweight='bold')
ax2.legend(fontsize=9)

# Panel 3 — Confidence Interval
ax3 = fig.add_subplot(gs[1, 0])
diff = ctr_b - ctr_a
ax3.errorbar(diff*100, 0,
             xerr=[[(diff-ci_lower)*100], [(ci_upper-diff)*100]],
             fmt='o', color=GREEN, ecolor=GREEN, elinewidth=3, capsize=10)
ax3.axvline(0, color=RED, lw=1.8, linestyle='--')
ax3.set_title('③ 95% Confidence Interval', fontsize=11, fontweight='bold')
ax3.set_yticks([])

# Panel 4 — Scorecard (text summary)
ax4 = fig.add_subplot(gs[1, 1])
ax4.axis('off')
verdict = 'SIGNIFICANT ✓' if significant else 'NOT SIGNIFICANT ✗'
ax4.text(0.5, 0.5, verdict, ha='center', va='center',
         fontsize=22, fontweight='bold',
         color=GREEN if significant else RED,
         transform=ax4.transAxes)
ax4.text(0.5, 0.35, f'p = {p_value:.4f}  |  z = {z_score:.2f}',
         ha='center', fontsize=11, color=GREY, transform=ax4.transAxes)

plt.tight_layout()
plt.savefig('netflix_ab_dashboard.png', dpi=150, bbox_inches='tight')
plt.show()

05 — Plain-English Interpretation

What the Numbers Mean
for a Non-Technical Audience

The CTR Gap

Artwork A had a click-through rate of roughly 24%, meaning that out of every 100 people who saw it, about 24 clicked. Artwork B had a click-through rate of roughly 30%, meaning 30 out of every 100 clicked. That is a gap of 6 percentage points, or a 25% relative improvement. In plain terms: Artwork B is one-quarter more effective at turning a browser into a viewer.

Why the Statistical Test Matters

Without a statistical test, you cannot know whether a gap like this is real or just luck. If you randomly split 1,000 people into two groups and showed them the exact same artwork, you would still get slightly different click rates just by chance — the same way two people flipping coins 500 times each will rarely both land on exactly 250 heads. The two-proportion z-test measures how likely it is that a gap this large could occur by accident. Our p-value was well below the industry-standard threshold of 0.05, which means there is less than a 5% chance the result is a fluke. The gap is real.

The Confidence Interval

The 95% confidence interval for the true difference between Artwork B and Artwork A is approximately [+2pp, +10pp]. This range does not cross zero, which is another way of confirming the result is statistically significant. Practically, it means that even in a pessimistic scenario, Artwork B would still be expected to earn at least 2 more clicks per 100 users — a positive return under any realistic assumption.

06 — Business Recommendation

Final Recommendation

Ship Artwork B to 100% of Users

The data is clear and the statistics back it up. Artwork B outperformed Artwork A by a statistically significant margin, producing a 25% relative lift in click-through rate. At Netflix's scale — roughly 270 million subscribers interacting with hundreds of title rows per session — this magnitude of lift translates into tens of millions of additional content starts per day.

What Netflix Should Do Next

Assuming this result came from a real production experiment rather than a simulation, the recommended action is to promote Artwork B as the default for 100% of users on this title. In Netflix's experimentation framework, this decision would typically be logged in an experiment record alongside the statistical evidence, so it can be referenced in future creative decisions for similar titles or genres.

Limitations of This Simulation

This project is a simulation, not a live test. The true click probabilities (0.24 and 0.30) were set by the researcher rather than measured from real user behavior. In a real Netflix experiment, those probabilities would be unknown — they would be estimated from the data collected. Additionally, a real experiment would control for confounding variables such as time of day, device type, geographic region, and subscriber tenure. This simulation assumes all other variables are equal, which is a simplification that real A/B testing infrastructure addresses through careful randomization and pre-experiment covariate balancing.

Key Takeaway for Potential Employers or Clients

This project demonstrates the ability to translate a real-world business question into a quantitative experiment, implement the full pipeline in Python from data simulation through statistical testing, and communicate the results clearly to a non-technical audience. These are the core skills required in product analytics, growth, and data-informed decision-making roles at technology companies.

Netflix Has Millions of Thumbnails.Which Ones Actually Work?