Blast Analytics and Marketing

Analytics Blog

Supporting Leaders to EVOLVE
image of free online revenue significance calculator
Category: Conversion Testing

The Best Revenue Significance Calculator for A/B Testing

May 11, 2017

If you’re conducting A/B tests on your ecommerce website and are not tracking revenue, then you are missing out on a crucial component for successful testing: having the right KPI.

Tracking revenue allows your team to make effective business decisions, because you’re measuring performance in a way that actually impacts the bottom line.

So which revenue metrics should you choose?

Some common revenue metrics don’t tell the whole story, which is why we recommend using revenue per visitor (RPV). RPV measures the amount of revenue generated each time a user visits your site:

RPV = Total Revenue
Total Users

We’re about to explain:

  • why revenue per visitor is such a crucial (composite) metric
  • the need to rewrite the RPV’s formula to include transaction rate and AOV
  • the right way to measure its statistical significance
  • how to use our free online revenue significance calculator
  • how to hack your way around sampled data to get the most accurate results

Why Use Revenue Per Visitor in A/B Testing?

If your team tracks only transaction rate (the percentage of visitors that purchased) or average order value (AOV) as your primary metric for testing, your results are at risk of having blind spots.

graphic showing revenue metrics in a/b testing

Some people assume that AOV is relatively constant and they only need to focus their efforts on increasing transaction rate in order to increase revenue. However, this logic doesn’t always apply.

In some circumstances, increasing conversion rate can negatively affect your overall revenue.

For example, if you have a test variation that increases the conversion rate, but users choose to purchase the lower-priced product instead of the more expensive product, this can decrease AOV and overall revenue.

chart: a/b test variation with increased conversion rate

chart: negative impact on revenue a/b testing results

Alternatively, people may focus their efforts on improving only AOV to increase revenue which can lead to a decrease in transaction rate, ultimately hurting revenue.

For example, consider an ecommerce website test where the variation increases the spend threshold to qualify for free shipping. This can lead to a higher AOV, but can also decrease transaction rate because there may be visitors who want free shipping but don’t want to spend the extra money to qualify. As a result, they may choose not to purchase.

table: a/b test results with lower transaction rate

chart: a/b testing results that decreased revenue

The examples above illustrate the need to have a solid conversion strategy for revenue that incorporates both metrics. Revenue per visitor is that composite metric, which accounts for both transaction rate and AOV.

In fact, we can rewrite the RPV’s formula to include these two elements:

Total Revenue = AOV x Transactions
Transaction Rate = Transactions/Total Users
RPV= AOV x Transaction Rate

So if your business had 1,000 transactions for every 15,000 users with an AOV of $50, the RPV would be:

Total Revenue = $50 * 1,000 = $50,000
Transaction Rate = 1000/15,000 = 0.067
RPV = $50 * 0.067 = $3.35

Monitoring trends in RPV can help your team analyze sales performance. It’s useful for evaluating your new visitor acquisition and paid user acquisition efforts.

Generally, a positive trend in RPV shows that your company’s sales efforts are working well.

However, if your revenue per visitor is trending downward, this could be the result of an increase in unqualified users to the site or potential site problems (e.g. broken shopping cart), which negatively affects your transaction rate.

Or your visitors may be converting at the same rate but are spending money on lower value items (e.g. higher priced product is out of stock), which negatively impacts your AOV.

Taking the example above, let’s say the number of users increased to 20,000 due to a social campaign that recently launched. Assuming the AOV stayed the same, your team would find that RPV is trending negatively:

Transaction Rate = 1,000/20,000 = 0.05
RPV = $50 * 0.05 = $2.50

Now let’s assume that the traffic stayed the same but your most expensive product was out of stock, causing the AOV to decrease to $37.30:

Transaction Rate = 1,000/15,000 = 0.067
RPV = $37.30 * 0.067 = $2.50

RPV does not replace the need to keep an eye other metrics like AOV and transaction rate. It removes potential blind spots that can occur if you choose to track only those metrics. In essence, it gives your team a better sense of the bigger picture.

How NOT to Calculate Statistical Significance

If your team is already using revenue per visitor as the main KPI for your tests, you may have figured out why you shouldn’t use the standard online revenue significance calculators to determine whether your test variation is having an actual impact on RPV. These standard “tools” perform calculations using a T-test, which operates on one critical assumption: that the metric you’re tracking follows a normal distribution.

example: normal distribution in a/b testing statistical analysis

Source: Statistics Cheat Sheet

Revenue per visitor doesn’t follow a normal distribution and therefore violates this assumption, because the majority of visitors to your site will not convert or make a purchase. As a result, you’ll discover that RPV’s distribution contains a greater concentration of $0 values and there is no limit on how much a visitor can spend, which may result in your RPV data containing some extreme values.

a/b test data distribution chart for revenue per visitor

For these reasons, RPV’s distribution tends to be right-skewed, making the standard T-test less reliable for measuring statistical significance.

right skewed data distribution for revenue per visitor metric

The Right RPV Confidence Calculator for the Job

To solve this problem, we launched a free online Revenue Per Visitor confidence calculator designed specifically for calculating RPV’s statistical significance. Our RPV calculator utilizes the Wilcoxon Rank Sum Test, which is not based on the assumption that the data follows a normal distribution.

In fact, the Wilcoxon Rank Sum Test employs a non-parametric technique — a technique that does not rely on any specific distributional assumption — in order to test whether there is a difference.

This calculation is far more reliable in determining whether there is an actual impact on RPV. It includes a two-tailed calculation, so you can use it to determine whether the variation had a positive impact or a negative impact when compared to the control.

How to Use the RPV Calculator

If you take a sneak peek at our testing confidence calculator, you’ll notice it looks different from the standard statistical significance calculators.

Standard Online Calculators

example of standard revenue significance calculator

Blast’s Revenue Per Visitor (RPV) Calculator

screenshot of revenue significance calculator by blast

As mentioned above, you cannot simply enter total visitors and total revenue per variation to determine statistical significance.

To accurately measure whether there is an impact on RPV, you need to have user-level data.

Most businesses choose to integrate their A/B tests with their preferred analytics platform and analyze test performance there. This allows teams to make an apples-to-apples comparison when looking at performance across different channels, such as testing and marketing efforts.

The problem is that while you can see overall revenue for test variations within analytics, it is much more difficult to get access to user-level data.

Unsampled Google Analytics Data Hack

The Blast team has a solution for obtaining user-level data so you can make use of the revenue significance calculator.

It may take a little leg work in the beginning, but your team will reap the benefits for the long term. To get user-level data within Google Analytics, follow the steps below and you’ll be on your way to A/B testing success.

1. Create a Custom Dimension for Client ID

Google Analytics (GA) has recently started offering a new User Explorer report. The best part of this report is that it has a Client ID dimension that tracks user-level behavior, which is specific to browser and device.

screenshot client id dimension in google analytics user explorer report

Now the downside!

In its current state, you can’t access this dimension outside of this report, so your team can’t pull this data into a custom report.

To get around this problem, your team will need to create a custom dimension for the Client ID. This step should take roughly 1-2 hours for your analytics team to create, QA, and implement. Once this is implemented you’ll be able to use the Client ID dimension for your test reports as well other Google Analytics reports.

Google Analytics screenshot: where to create custom dimension for revenue significance calculator

You may think this step isn’t worth the effort and that you can just export the data from the User Explorer report, but that will only work if you have minimal traffic to the site. The User Explorer report caps the data to 10,001 rows.

If your site receives more than 10,000 visitors within the time frame you select, then you won’t be able to see all user-level data and instead will get a sampling of the data. By creating the Client ID custom dimension, you can create a custom report for your test, containing the Client ID, where you’ll be able to capture all the rows of data.

User Explorer Report: Limits Client ID and accompanying revenue data to 10,001 rows.

screenshot: limited rows in google analytics user explorer report

Custom Report: Provides Client ID (via a custom dimension) and accompanying revenue data greater than 10,0001 rows.

screenshot: google analytics custom report

2. Utilize unSampler to Export All Data

popup showing number of rows google analytics allows for exportAs your team uses the Client ID custom dimension within other Google Analytics reports, there is another challenge that lies ahead. Google Analytics caps the number of rows you can export at one time to 5,000 rows.

If you really have the time, you can attempt to export your report data 5,000 rows at a time, but for most people this is completely inefficient. Previous hacks like altering the number in the url to show more rows no longer work.

If your business has Google Analytics 360, then your team has the feature to export all data by utilizing the Unsampled report.

Resolve the sampling issues from the standard version of Google Analytics is as simple as creating an unSampler account and linking your Google Analytics account to it. Doing so will enable your team to easily create a test report (where you will have access to your custom dimensions) and export all of your data to CSV.

screenshot of workaround for sampled data in google analytics

image showing where to export unsampled google analytics data to csv file

3. Format & Upload CSV

Once you’ve exported data from your unSampler Report, you’ll need to take a few quick steps to format it so it will be ready to use with the revenue significance calculator. First, you’ll need to filter your data for the control:

screenshot showing how to filter data before using revenue significance calculator

Then copy the revenue data and paste it in a new tab (optional: you can rename the header to Control Revenue).

Repeat this step with your test variation. After doing so, in the new tab you should have two columns for revenue (Control Revenue and Variation Revenue). Please note, if you have more than one test variation, you’ll need to create separate tabs for each one (e.g. Control vs Variation 1, Control vs Variation 2, Control vs Variation 3).

screenshot showing how to filter control data before using revenue significance calculator

Save this new tab as a CSV file (or multiple CSV files if you have more than one test variation) and then it’s ready for the RPV Calculator.

Before uploading your file to the calculator, you can adjust the threshold for determining statistical significance — the default is set at 95%. The last step is simply uploading your file.

where to upload your file to use the revenue significance calculator

The results you get are fast, reliable and easy to understand.

example of revenue significance calculator results

A/B Test Results You Can Trust

While it takes a little bit of effort in the beginning to properly measure revenue per visitor, once it’s set you can easily analyze this KPI for future tests. Further, by using the free online revenue significance calculator, you can trust that the correct method of analysis was applied.

Your team can rely on test performance results to make those important business decisions.

Please share your comments or let us know if you have questions regarding this process or the calculator.

  • Martin

    Good stuff!
    How would you handle a situation where there are not the same number of users in control and variation?

    • Roopa Carpenter

      Hi Martin! The RPV Calculator utilizes the Wilcoxon (Mann-Whitney U) test to compare the means, and so it does not require that your control and test sample sizes be exactly the same. However, generally speaking, it’s a good practice to have similar sample sizes for both groups. Thanks!

    • Roopa Carpenter

      Hi Martin! The RPV Calculator utilizes the Wilcoxon (Mann-Whitney U) test to compare the means, and so it does not require that your control and variation sample sizes be exactly the same. However, generally speaking, it’s a good practice to have similar sample sizes for both groups. Thanks!

  • Nice we just got a prototype for a calculator up by our selves for internal need, but you have more features and polish already available. 🙂
    I was searching for a mann whitney calculator, but since your page is SEO optimized for less specific (and probably more accessible) search terms I missed it in my initial search

    Whats your take on the difficulty of reaching significance on the revenue per user metric compared to for example transaction CR (I fully realize different implications of the different metrics, but testers tend to gravitate to test that have a higher chance of “winning”)

    • Roopa Carpenter

      Hi Patrik! I feel that it’s important to have goals or metrics that affect the bottom-line for a company. Our philosophy is that testing for the win is not as important as making the right business decision. Often times the right business decision is tied to revenue and that’s why we recommend having it as a primary KPI. However, other metrics, such as transaction CR, should be considered as well. Thank you!

  • Silvia Bordogna

    Very interesting solution to test revenues!

    Just a question:
    you said that distribution of RPV should be left skewed and with modal value equal 0. Why in csv provided as sample dataset you don’t have any 0 value?

    I’ve tried to implement this test and my RPV (daily) is never 0. Maybe I should calculate RPV group by different tense unit? What did you mean in your article?


    • Roopa Carpenter

      Hi Silvia! You are correct that in real RPV data the modal value will be zero and thus left skewed. The sample data in the csv, however, is only meant for demonstration purposes regarding visualization and calculations and is not intended to mirror actual RPV data that you will encounter. I hope this helps!

  • Yann

    That’s really great, thanks so much for doing this.
    I just wish the calculator wasn’t limited to 5mb files as my files are generally between 10-20mb, so I can’t use it.

  • paulmkoch

    Hi Roopa, thanks for the tool!

    I tried to quickly estimate how much longer we’d need to run a test, assuming our RPV numbers stay the same. I copied and pasted the identical columns of our actual RPV data to double the sample size, then re-uploaded to see how much the confidence level improved. But, the P value doesn’t change. I can add multiple copies of our live data to the columns, and the P value always stays the same. I’d have expected the higher sample size to increase the confidence level, even if each version’s total RPV stayed the same. Do you know why that’s not happening?


    • Roopa Carpenter

      Hi Paul! You are right that the higher sample size should increase the confidence level even if each variation’s total RPV stays the same. We re-tested the calculator using the same method you outlined to double the sample size, and the confidence level (P value) does change with the higher sample size. I would encourage you to try it again and please let us know if you continue to see this happening! Thank you!

  • Ioana

    Hi! Thanks for this article, sounds great. Can you please clarify the structure of the file that we upload to the calculator? Assuming we have just one variation, would it be a csv file with a column of all revenue values each user has spent in the control, and then another similar column for the variation? Of course, that will include values of 0 for users who haven’t spent anything.


    • Roopa Carpenter

      Hi Ioana! The calculator is meant to compare revenue data for two or more variations and so if you only had one variation, you would need a second column (with a value of 0) for users who haven’t spent anything. Thank you!

      • Ioana

        Hi Roopa, thank you for your reply. So I have a control group and a variation group. The file should have just 2 columns (one for control and one for variation) with the revenue associated with each client ID in each of the two groups, right?

        [I was confused by the headers in the sample dataset which read “control revenue per user” and “learn more revenue per user”, but clearly the metric we need to use in the columns isn’t “Revenue per user” but rather “Client ID Revenue”, right?]

        Thanks a lot

        • Roopa Carpenter

          Hi Ioana! Yes, the columns would contain revenue associated with each client ID. Thank you!

  • Saurabh Sinha

    Nice! I hope I’m not late to the discussion, just had a couple of follow up questions.

    As you mentioned, revenue data is heavily left-skewed, but the rightmost data is usually most important. A very common occurrence is that revenue of most customers on the control and variation paths is quite similar, but the top-end users (maybe the top 5 percentile) have very different revenues. Is this method able to identify this difference as significant, and can it isolate the population that is impacted?

    Also, how do we quantify the magnitude of the impact?

    • Jack Dwyer

      Hey Saurabh, Thanks for the question. The calculator has the ability to compare two data samples and determine whether or not the two samples are significantly different. If you wanted to just compare the top 5%, you could manually isolate the data and feed it into the calculator, or you could, as you said, just use all of the data because if most of the data is the same but there is a significant difference at the more extreme end, the calculator will signal a significant difference in the two samples.

      The calculator is not designed to isolate or partition any data, only tell you the difference between two sets. Because it provides metrics on your two columns of data, you can see for yourself which sample is the larger of the two given the RPV the calculator provides after the calculation. Presumably you would know which sample is which before feeding it into the calculator, so you could combine your knowledge of the data with the calculator metrics to make any further conclusion on the data.

      Hope that helps,

  • Joe Minas

    Hi there! If i want to compare spend per user, most of the values are going to be 0. How does this impact the ranking, where the majority of the rows are identical? Also is there a maximum sample size which this method works with?

    • Jack Dwyer

      Hey Joe, The testing approach we use is designed to accommodate ties (in your case, a lot of zeroes). With the ranking, the calculator will give all of the zeroes in both samples the same ranking, so neither is factored any more or less than another zero. The beauty of this test is that it makes these kind of accommodations while still factoring in every data point. There is no upper limit on sample size, just the 5MB limit on the calculator.

  • Jack Dwyer

    Hey Georgi,
    Thanks for the comment! To touch on your first point, that’s a good recommendation; I’ll see about getting something like that added! For part 2 of your comment, our team of data scientists tested many different datasets using a whole array of methodologies, and we found that for data containing a lot of zeros and ties, the MWW test often picked up on differences that something like the t-test did not detect. As for why we use RPV, as Roopa mentioned in the post itself, some of the aggregated metrics fail to capture blind spots in the data, so by using all of the data, that concern is ameliorated.

    Moving on to why we use MWW, we wanted to go with a test that most probably detected any differences, however minute. We’re principally concerned with the rigor of our testing, and when we weighed our options, parametric vs. non-parametric, we found that the non-parametric test exercised a lot more scrutiny (gave us significance much later) than the t-test. While we do understand that statistical significance isn’t the only thing to consider when comparing two data sets, we know that many people rely on it to make their decision. With something like a t-test yielding significance much sooner, our concern is that people may be moving forward with decisions under the assumption that their two datasets are significantly different, when in fact they may not be. This is our main motivation behind using the MWW. At the end of the day, we encourage teams to research the methodology behind the calculations they use for their primary metrics so they can make an informed decision on what will work best for their data.
    Thanks again for the comment!

    • Hey Jack,

      Thanks for the response. I’m a bit confused by it, since it seem s that the first part contradicts the second. If the MWW test is more sensitive than the t test, doesn’t that make it more prone to type I errors? If it is not the difference in the average revenue per user that it is sensitive to, what is it that it detects and why is it useful in an A/B test?

      When you say that a t-test yields significance “much sooner”, does it suggest a higher type I error than the guarantee? If so, maybe it’s due to peeking (optional stopping). If not so, then I see this is as a positive – the t-test is more powerful while maintaining the type I error guarantee, so not is it not an issue, but it is a significant plus many companies are looking for in their statistical solutions.

      Any elucidation on the above would be appreciated. If you can share simulation codes or results, that would be even better.


  • Allie

    I have a couple additional questions regarding the analysis that can be performed using this method.

    1. According to my understanding of the Mann-Whitney U test, it tells you whether two distributions are significantly different from each other. I would also like to know directionally whether one distribution’s median/mean RPV is higher than the other’s. According to Wikipedia’s explanation of this test, a comparison of means is only feasible when the data points (revenue in this case) are continuous. How is this assumption to be met when in a business AB test there are only a few price points that are possible to choose from, making revenue a discontinuous variable?

    2. Is there a simple way to determine the sample size necessary for this test to yield significant results, especially when most people who visit the site during the test do not convert to paying?

    • Jack Dwyer

      Hey Allie, thanks for the comment! I hope I’m interpreting your questions correctly with my response.

      1. Our test will provide summary statistics (not the typical mean/median, because it’s non-parametric) that will provide insight into which distribution is larger. As for continuity, we are making the assumption that revenue is continuous, not only because it can span from $0.01 to very large rational numbers (it doesn’t need to contain every number, but it could), but also because as the number of data points increase, this assumption will hold.

      2. The test is agnostic to sample size. As a general rule, the smaller the sample, the less definitive the result (similar to the parametric case). In the case of revenue, all of our $0.00 values are still data points, so you may have more data than you think.

      I hope I interpreted everything correctly. it can certainly be a confusing topic, so I appreciate you reaching out!

  • Michael C. Yu

    Hello! Really appreciate having an article going into such details. I have one question, how do you determine the sample size before running the test? For test with the main metric being XXX per users? I searched for this topic but couldn’t find resources about this one. Thanks!

  • Kelvin Ye


  • Kelvin Ye

    Great Post! Thanks for sharing.

    I ran the tool with our testing data. RPV is almost the same, within less than .5% difference. But the test result is Statistically Significant! Does it make sens e? How do I tell whether Test or Control is better in this case?

    • Jack Dwyer

      Without seeing the data, it’ll be tough for me to answer. I can say that the test does output values for both test.revenue and control.revenue, so whichever is greater there is the better variant.

      Moving on to your concern, if the data doesn’t have a wide spread, this may happen. I suspect you have low variance within your samples, which is yielding the significance output even with such a small discrepancy. Again, it’ll be tough to tell without having your data, but I suspect this is the case.

      Hope that helps!

  • Joe

    Is average revenue per user = (total revenue / total count of users)? If so, how can you use average revenue per user in a t-test, if your test distribution is uneven (ie. 80/20 test)?

  • Marcus Oliveira

    Very interesting. I have a question, how can I estimate the right sample size using the Wilcoxon (Mann-Whitney U) ?

Roopa Carpenter
About the Author

Roopa Carpenter is Director, Optimization at Blast. With several years of experience, she drives testing and personalization strategy, implementation, and results analysis for various clients, with a focus on helping improve customer experience. Roopa oversees all optimization-related account activity, identifies user experience (UX) opportunities, creates testing roadmaps, and utilizes a data-driven approach to impact customer purchase behavior and bottom-line metrics. Connect with Roopa on LinkedIn. Roopa Carpenter has written on the Web Analytics Blog.

HIPAA and Analytics White Paper CTA

Featured White Paper

Healthcare Analytics and HIPAA: Ways to Minimize Risk and Ensure Compliance

The rise in digital data and analytics adds complexity and risk for healthcare organizations. Those that don’t comply with data privacy requirements, including Health Insurance Portability and Accountability Act (HIPAA), could face heavy fines, civil action lawsuits, and even criminal charges. Not to mention loss of patient trust.

Download the White Paper

Ready To Do More With Your Data?

If you have questions or you’re ready to discuss how Blast can help you EVOLVE your organization, talk to an Analytics Consultant today.

Call 1 (888) 252-7866 or contact us below.