Lab 2: Netflix A/B Testing

🎬

What you build

A/B testing system

👥

Simulated users

1,000

📊

You measure

CTR + significance

🌐

You publish

GitHub Pages site

How to use Claude in this lab

Use Claude as both a coding assistant and a thinking partner. Do not just ask for code — ask for explanations, push back on answers you do not understand, and use Claude to help you think through the business logic. This is closer to how AI tools are actually used in professional settings.

What Makes a Good Prompt?

A good prompt to Claude includes three things: context (what you are doing and why), the task (what you want Claude to produce), and the output format (how you want it explained or presented).

Weak prompt

"Write Python code for an A/B test."

Stronger prompt

"I am simulating a marketing experiment where two groups of customers see different versions of an ad. I want to randomly assign 500 customers to each group, simulate whether they click based on different probabilities, and store the results in a table. Can you write Python code that does this and explain each step in plain English?"

Connection to the Netflix Case

Your lab mirrors Netflix's real system:

Netflix has real users; you will simulate users
Netflix shows different artwork variants; you will compare Version A and Version B
Netflix measures click behavior; you will calculate click-through rates
Netflix decides which image performs better; you will make that decision using data and statistical testing

Data: impressions & clicks

→

Model: compare variants

→

Prediction: which artwork wins

→

Decision: deploy at scale

→

Value: more engagement

Understand the Business Problem

Before writing any code, build a clear mental model of the business problem. This step has no coding — it is entirely about understanding. Use Claude as a thinking partner to deepen your understanding of the key concepts. Tell Claude what course you are in and ask for explanations in plain business terms, not technical jargon.

Write and reflect — answer these in your own words:

What problem is Netflix trying to solve with artwork testing, and why does it matter for the business?
Why might different images lead to different user behavior?
Why is guessing not sufficient — what does experimentation give you that intuition cannot?
Is Netflix's artwork system an example of AI automating a decision or augmenting a human decision? What is the difference?
Netflix shifted artwork selection from human designers to a data-driven system. What do you think that shift changed — and was it entirely a good thing?

Design the A/B Experiment

Before writing a single line of code, think through the logic of a well-designed experiment. In a real organization, data analysts, data scientists, product managers, and business stakeholders would all weigh in on experimental design. Do not use Claude for this step — work through the thinking yourself first.

Write and reflect:

What does random assignment mean in this context, and why does it matter?
What would go wrong if users were not randomly assigned?
Why does the size of your experiment matter — what is the risk of testing on too few users?
Before you build anything: what will you need to measure to determine which artwork performed better?

Simulate the Experiment in Python

Before opening Claude, talk with a partner about what your dataset should look like — what columns it needs, what each row represents, and what you are measuring. Then use this prompt:

Claude prompt

I am simulating a Netflix-style experiment where 1,000 users are randomly assigned to see one of two artwork versions — Artwork A with a click probability of 0.24 and Artwork B with a click probability of 0.30. I want to simulate whether each user clicks based on those probabilities and store the results in a pandas DataFrame with three columns: a user ID, the artwork version they were assigned to, and whether they clicked (1 for yes, 0 for no). Please do two things: First, give me the complete Python code in a single code block that I can copy and paste directly into Google Colab. Second, below the code block, explain each step in plain English and connect each step to what Netflix actually does in its real experimentation system.

Prompt reflection

Look at the prompt you used and the code Claude produced. What specifically in the prompt led to that output — and if you ran the same experiment with a weaker prompt, what do you think would have been missing or wrong?

Write and reflect:

What does each row in your dataset represent?
Netflix runs experiments on millions of users simultaneously. You just ran one on 1,000 simulated users. What does that difference in scale mean — for the results, for the decisions, and for the organization?

Measure and Visualize Performance

Add a measurement layer. Think of this as Netflix asking: now that we have the data, what does it tell us?

Claude prompt

Can you give me the complete Python code from the previous step again in a single code block, and then add code that calculates the total number of users in each group, the total number of clicks in each group, and the click-through rate for each group? Please also add a clean bar chart that visually compares the click-through rates for Artwork A and Artwork B. Keep it all in one code block so I can copy and paste it into Google Colab, and then below the code explain what the results and the chart mean in a business context.

Prompt reflection

Look at the bar chart Claude produced. Does the visual tell a clearer story than the numbers alone — and what does that suggest about how data should be communicated in a business setting?

Write and reflect:

Which artwork performed better? By how much (difference in CTR)?
You now have a number — a difference in click-through rate. But a number alone is not a decision. What additional information would you want before recommending that Netflix switch to Artwork B across the entire platform? What could go wrong if you acted on this result too quickly?

Evaluate the Decision: Statistical Testing

Move from results to a decision. This is the prediction-decision gap in action — is the result trustworthy enough to act on?

Claude prompt

Can you give me the complete Python code from all of my previous steps in a single code block — including the simulation, the click-through rate calculations, and the bar chart — and then add code that runs a simple statistical test to determine whether the difference in click-through rates between Artwork A and Artwork B is likely real or due to chance? Keep everything in one code block so I can copy and paste it into Google Colab. Then below the code, explain the statistical test you chose, why you chose it, and what the result means in plain English without assuming I have a statistics background.

Prompt reflection

This prompt asked Claude to choose a statistical test rather than naming one. Did that produce a better or worse result than if you had named a specific test — and what does that tell you about when to give Claude more direction versus less?

Write and reflect:

What does it mean for a result to be statistically meaningful?
Based on your results, would you switch to Artwork B? Why or why not?

Build and Publish Your A/B Testing Case Study Website

In industry, it is not enough to run an analysis — you need to communicate your results clearly. A hiring manager or client is unlikely to open a Jupyter notebook, but they will click a link.

Part 1: Ask Claude to Create Your Website

Claude prompt

I am a student in MIS 432: AI in Business at Western Washington University's College of Business and Economics. For class this week, we discussed a Netflix business case and spotlighted experimentation through A/B testing. I have just completed a Netflix-style A/B test in Python across three steps. Here is what my code does: Step 1: Simulates 1,000 users randomly assigned to two artwork versions (A and B) with click probabilities of 0.24 and 0.30 and stores the results in a pandas DataFrame with a user ID, artwork version, and clicked variable. Step 2: Calculates the total users, total clicks, and click-through rate for each group and displays the results in a bar chart comparing the two versions. Step 3: Runs a statistical test to determine whether the difference in click-through rates between the two groups is likely real or due to chance. Can you take the complete Python code from all of my previous steps and turn it into a single clean HTML file that I can publish on GitHub Pages? The website should: - Include all of the Python code from my previous steps in one place - Explain the Netflix case, the business problem, the role of experimentation in machine learning and A/B testing, what I did with the code, what I found, and what Netflix should do about it in plain English - Display the bar chart comparing click-through rates - End with a clear business recommendation - Look professional using Netflix style colors and be easy to read on any device - Make clear that this project was completed as part of MIS 432: AI in Business at Western Washington University's College of Business and Economics The audience is a potential employer or client who has never seen my code and has no technical background.

Prompt reflection

You gave Claude an audience — a potential employer with no technical background. Look at the website it produced and ask yourself: did it actually write for that audience? What would you change, and how would you modify the prompt to get a better result?

Part 2: Publish on GitHub

Copy the HTML Claude gives you
Open GitHub and create a new repository named netflix-ab-testing
Turn on README and create the repository
Click Add File → Create new file
Name it index.html and paste the code
Click Commit Changes
Go to Settings → Pages, select main branch, click Save
Wait a minute — your URL will appear under Pages. Save it.

Write and reflect: You just turned a Python analysis into a public-facing website. What does the ability to communicate your analysis clearly — not just run it — tell an employer about your skills?

Connect Back to the Netflix Business Case

Step back and interpret the system as a whole:

How does your simulation mirror Netflix's real system? What parts of the real system are missing from your lab?
Could you improve the experiment by adding user segments — for example, documentary lovers vs. romance lovers seeing different artwork for the same horror movie? How might Netflix use machine learning to enable that kind of personalization?

AI Factory Reflection

Map each component of the AI Factory to what you actually built — be specific about what each component looked like in your simulation:

Data →

Model →

Prediction →

Decision →

Value

Then answer: What did you just build? What does it do, and what does it not do?

What to Submit

Your GitHub Pages URL (the live website)
Responses to all written questions and reflections from each step

← Back to Case Study Chapter 3: Spotify →

Build a Netflix-StyleA/B Experimentation System

What Makes a Good Prompt?

Connection to the Netflix Case

Part 1: Ask Claude to Create Your Website

Part 2: Publish on GitHub

Build a Netflix-Style
A/B Experimentation System