Sunday, February 5, 2012

02 . 05 . 12 | Randomization Distribution

Randomization Distribution | Modernizing An Old Device for CM/DM and Multi-Channel Optimization Generally.

Rampant misunderstandings about multi-intervention studies in care/disease management (CM/DM), other human studies (e.g. sales, education)and non-manufacturing generally are:
  1. Differences among test units (e.g. nurses, retail stores, students) will invalidate the study.
  2. Adverse selection could contaminate results (e.g. recently hospitalized; new stores, gap students).
  3. Influences like regression-to-the-mean, HCC-score, store-trends, size etc. will affect findings.
  4. Members (patients) should be randomized to nurses (care managers); ditto students/classes.
  5. A 2nd. Study or RCT would increase confidence or “validate”.
  6. False-alarm rate will increase with the more things tested.

Since 1926, Fisher [1] hasn’t been well understood by mathematicians. This matters more today in healthcare and complex multi-channels generally. Using also a couple of other devices, we fix (simply for users) all of the above and free of man-made constraints that have held back progress. Using the CM example then closing with the multi-channel sales example to illustrate for all industries:

#1 is completely solved by the simple trick of control-charting pre-study results across nurses from a time-window equal to the planned (randomized!) study period and finding homogeneity (akin to stability in manufacturing). A dry-run pre-study analysis (like a Heckman-Hotz econometric test) is essential or patterns among nurse admit rates can still give spurious findings. This dual homogeneity check has been controversial among PhD statisticians and academics without reason. Users, more correctly, have no trouble with it.

#2 to #4 is solved by correct randomization of nurses to the study. #4 is popular among mathematicians but usually wrong and then would not improve anything or make money. Closed cohort design will re-assure everyone more but isn’t necessary and cannot always be done (e.g. transitional care). Of course long-term validation must have a closed cohort or propensity scored analogy, not open cohort.

#5 is a little like sending a rowing boat out to see if it was safe to sail the ocean liner that just sailed through. Confidence remains about the same off similar results and studying new things for a shorter time trumps replicating, even with one-factor-at-a-time testing [2]. Of course if a 2nd study is conducted, even randomization would not allow it be done on the same sample! A fresh random sample is called for: just like in manufacturing.

#6 goes away at ~20+ interventions: false-alarm is a problem for small studies devoid of scientific context.

Multi-channel optimization (e.g. sales in stores, online and stimulated through media, and call centers; or care by nurses, pharmacists, automated systems and house-calls to the same population) is solved by simultaneous statistical designs provided randomization is correctly deployed and the channel tests set up with a clever new device that’s easy for users (mutual-orthogonality). Education cases to improve learning/careers, then all industries follow the same model.

Optional Note for Professional Statisticians: In a 20 run design, a contrast is simply the calculation to find the effect of one intervention, simply by averaging the 10 tested vs. the other 10 (counterfactual). Comparison to all possible 10 vs. 10 contrasts gives a yardstick to see if the effect is “real” or essentially by chance. The histogram on the home page is of 10,000 random contrasts out of more than 200,000 calculated. There are 20C10 = 184,756 contrasts in total and the extras were run so that every contrast was more likely included. The histogram of contrasts is from untransformed response data.

The average of the 200,000+ contrasts is -0.07337 with maximum at 419.59 and minimum at -410.97.

The true contrast is at -261.39, from a CM/DM case measuring hospitalization rate per thousand people per year.

The histogram shows that a normal approximation is close, as expected. However the calculations are all distribution-free. No original assumption of normality (a “bell-shaped curve”) is needed. The histogram indicates visually how well all possible contrasts approximates normal. The true p-value (from this randomization distribution) is 0.01 vs. a normal approximation (from linear model software) at 0.0033.

Of course the actual contrast will only be absolutely largest among all randomized contrasts (p-value equals 1/184,756= 0.0000054) if the treatment and counterfactual have no overlap in the raw data (as driven by effect size). 

Further insights based on the above reveal why randomization of nurses to treatment combinations is correct but of members to nurses is usually not, as it cannot be often managed that way. Also that modeling nuisance variables (such as prior admit rates, HCC-score, selection criteria etc.) will approximate the same answers as the correct analysis relying on randomization directly. (This back-end analysis first suggested by Neyman in the 1930s is not needed. Were it to differ, one might look for multi-colinearity problems or otherwise check the model.) Of course running such a model to see if randomization “worked” is folly. It will always “work” provided the device is used correctly, which it tends not to be. Further consideration also reveals why a single test unit per combination in the study design is ample (in 20+ intervention designs) and replication is a waste. Finally, that a common mistake is to try for n>30 per combination whereas  n=1 is usually ample.

On misunderstanding #1 (homepage) a trick of analyzing change in admit rate by nurse (since the prior) can be popular. But, this is only valid if that prior is significant and then a covariance analysis is used since the “change” adds noise and can cause errors. 

Power calculations are performed in the usual RCT way (and yield identical sample size requirement) but miss the point that they will be pessimistic as variation usually reduces during large studies. Our earliest case found the standard deviation about 1/3 of the prior. The far larger issue is that single-intervention studies (excluding, say, 19 that could have been included with the same resources) have zero power for all the untested things. 

It has not escaped our attention that performance often improves from day 1 of large studies, rendering them attractive to businesses especially if in urgent need of step-change. Also that large studies stop any accidental roulette with customers, members, students etc.


 1. Fisher, R.A. (1926). The Arrangement of Field Experiments. J. Min. Agric. G. Br., 33: 503-513 

2. Box, G.E.P. (1966). A Simple System of Evolutionary Operation Subject to Empirical Feedback. Technometrics. Vol. 8, No.1