Randomization Distribution | Modernizing An Old Device for CM/DM and Multi-Channel Optimization Generally.
Rampant misunderstandings about multi-intervention studies in care/disease management (CM/DM), other human studies (e.g. sales, education)and non-manufacturing generally are:
- Differences
among test units (e.g. nurses, retail stores, students) will invalidate the
study.
- Adverse
selection could contaminate results (e.g. recently hospitalized; new stores,
gap students).
- Influences
like regression-to-the-mean, HCC-score, store-trends, size etc. will affect
findings.
- Members
(patients) should be randomized to nurses (care managers); ditto
students/classes.
- A 2nd. Study
or RCT would increase confidence or “validate”.
- False-alarm
rate will increase with the more things tested.
Since 1926, Fisher [1] hasn’t been well understood by
mathematicians. This matters more today in healthcare and complex
multi-channels generally. Using also a couple of other devices, we fix (simply
for users) all of the above and free of man-made constraints that have held
back progress. Using the CM example then closing with the multi-channel sales
example to illustrate for all industries:
#1 is completely solved by the simple trick of
control-charting pre-study results across nurses from a time-window equal to
the planned (randomized!) study period and finding homogeneity (akin to
stability in manufacturing). A dry-run pre-study analysis (like a Heckman-Hotz
econometric test) is essential or patterns among nurse admit rates can still
give spurious findings. This dual homogeneity check has been controversial
among PhD statisticians and academics without reason. Users, more correctly,
have no trouble with it.
#2 to #4 is solved by correct randomization of nurses to the
study. #4 is popular among mathematicians but usually wrong and then would not
improve anything or make money. Closed cohort design will re-assure everyone
more but isn’t necessary and cannot always be done (e.g. transitional care). Of
course long-term validation must have a closed cohort or propensity scored
analogy, not open cohort.
#5 is a little like sending a rowing boat out to see if it
was safe to sail the ocean liner that just sailed through. Confidence remains
about the same off similar results and studying new things for a shorter time
trumps replicating, even with one-factor-at-a-time testing [2]. Of course if a
2nd study is conducted, even randomization would not allow it be done on the
same sample! A fresh random sample is called for: just like in manufacturing.
#6 goes away at ~20+ interventions: false-alarm is a problem
for small studies devoid of scientific context.
Multi-channel optimization (e.g. sales in stores, online and
stimulated through media, and call centers; or care by nurses, pharmacists,
automated systems and house-calls to the same population) is solved by
simultaneous statistical designs provided randomization is correctly deployed
and the channel tests set up with a clever new device that’s easy for users
(mutual-orthogonality). Education cases to improve learning/careers, then all
industries follow the same model.
Optional Note for Professional Statisticians: In a 20 run
design, a contrast is simply the calculation to find the effect of one
intervention, simply by averaging the 10 tested vs. the other 10
(counterfactual). Comparison to all possible 10 vs. 10 contrasts gives a
yardstick to see if the effect is “real” or essentially by chance. The
histogram on the home page is of 10,000 random contrasts out of more than
200,000 calculated. There are 20C10 = 184,756 contrasts in total and the extras
were run so that every contrast was more likely included. The histogram of
contrasts is from untransformed response data.
The average of the 200,000+ contrasts is -0.07337 with
maximum at 419.59 and minimum at -410.97.
The true contrast is at -261.39, from a CM/DM case measuring
hospitalization rate per thousand people per year.
The histogram shows that a normal approximation is close, as
expected. However the calculations are all distribution-free. No original
assumption of normality (a “bell-shaped curve”) is needed. The histogram
indicates visually how well all possible contrasts approximates normal. The
true p-value (from this randomization distribution) is 0.01 vs. a normal
approximation (from linear model software) at 0.0033.
Of course the actual contrast will only be absolutely
largest among all randomized contrasts (p-value equals 1/184,756= 0.0000054) if
the treatment and counterfactual have no overlap in the raw data (as driven by
effect size).
Further insights based on the above reveal why randomization
of nurses to treatment combinations is correct but of members to nurses is
usually not, as it cannot be often managed that way. Also that modeling
nuisance variables (such as prior admit rates, HCC-score, selection criteria
etc.) will approximate the same answers as the correct analysis relying on
randomization directly. (This back-end analysis first suggested by Neyman in
the 1930s is not needed. Were it to differ, one might look for
multi-colinearity problems or otherwise check the model.) Of course running
such a model to see if randomization “worked” is folly. It will always “work”
provided the device is used correctly, which it tends not to be. Further
consideration also reveals why a single test unit per combination in the study
design is ample (in 20+ intervention designs) and replication is a waste.
Finally, that a common mistake is to try for n>30 per combination
whereas n=1 is usually ample.
On misunderstanding #1 (homepage) a trick of analyzing change in
admit rate by nurse (since the prior) can be popular. But, this is only valid
if that prior is significant and then a covariance analysis is used since the
“change” adds noise and can cause errors.
Power calculations are performed in the usual RCT way (and
yield identical sample size requirement) but miss the point that they will be
pessimistic as variation usually reduces during large studies. Our earliest
case found the standard deviation about 1/3 of the prior. The far larger issue
is that single-intervention studies (excluding, say, 19 that could have been
included with the same resources) have zero power for all the untested things.
It has not escaped our attention that performance often
improves from day 1 of large studies, rendering them attractive to businesses
especially if in urgent need of step-change. Also that large studies stop any
accidental roulette with customers, members, students etc.
REFERENCES:
1. Fisher, R.A. (1926). The Arrangement of Field Experiments. J. Min. Agric. G. Br., 33: 503-513
2. Box, G.E.P. (1966). A Simple System of Evolutionary Operation Subject to Empirical Feedback. Technometrics. Vol. 8, No.1