The following question was sent to me, I thought it useful enough for a full elaboration:

Submitted on 2014/05/12 at 8:23 amDr. Fleishman,

I am so happy I found your site. I have been trying to decide how to best analyze the results of a pilot study I conducted (ABA design: N = 4) examining an intervention targeted for people who stutter . During the first phase (AB) a dependent t test was performed. I am pasting the results of this preliminary analyses below:

Results of the SSI-4 indicated that participants reduced their percent syllables stuttered from M= 7.1 (sd=1.7) to M= 5.1 (sd=1.7). Although this did not reach statistical significance, there was a large effect size (d =1.18). Findings from the Burns Anxiety Inventory revealed a large decrease in the number of items checked off as causing anxious feelings, anxious thoughts, and physical symptoms of anxiety. Anxious thoughts were significantly higher before yoga treatment (M=.70; sd=.45) than after yoga (M=.27; sd=.32) as indicated by a significant paired t-test, t(3) = 5.58, p = .011; d = 1.10 (large effect). On the OASES, participants indicated positive changes in the general perception of their impairment with improved reactions to their stuttering, reduced difficulties about speaking in daily situations, and improved satisfaction with their quality of life related to communication. Overall perceptions of speakers’ experiences of stuttering were significantly more negative before yoga treatment (M= 2.70; sd=.22) than after yoga (M= 2.32; sd=.17) as indicated by a significant paired t-test, t(3) = 8.01, p = .004., d = 1.93 (large effect).

I am now analyzing the results including the followup measures. After reading the previous posts, I realize that I should be focusing on reporting descriptives, rather than trying the figure out the appropriate statistical test to use. I just read the results of a study similar in design to my study with a sample size of 3. In that study the authors used a technique called “Split Middle Method of Trend Estimation”. I have never heard of this technique. Could you explain how to perform such an analysis? Would you recommend using this type of analysis?

I am trying to get the results of this small pilot study published, but am worried about just reporting descriptives. . Is there a bias toward not publishing studies that are purely descriptive in nature?

Heather

I believe you are asking two questions. 1) Information about the Split-Middle method and 2) Publication of descriptive studies.

1) I never heard of the split-middle test. I did a quick search on the web and located “http://physther.org/content/62/4/445.full.pdf”. They give a complete description on how to do the calculation (pages 448-449). It is an approach which examines the trend of the data by dichotomizing the data into two halves, based on date. Then it computes the median of each half and plots a line between the two medians. This approach uses the ordinal information of the data. Ordinal information is not bad. It is certainly far better than nominal, but slightly less powerful as interval. When you have small N, you need all the information you can get however. One issue with small N, is that a single outlier could vastly affect your conclusions. Did you see any? If not, then nonparametrics isn’t necessary. With regard to your study, it does not appear that you are looking at time (date). So I’m not sure if this approach is applicable for your study.

Reply by Heather: I did not see any outliers as per examination of the boxplots. I think there is a common misconception that exists regarding sample size and parametric analyses, for which I have fell prey to: automatically run nonparametric analyses on studies with small sample sizes.

2) Your study. As I’ve written in my blog, I’m a very large advocate in effect size. Yes, with N=4, I would focus on descriptive results. Are these the only three key parameters? Or were these the best three out of ten or twenty or one hundred parameters.

Reply by Heather: Yes, these were the three best results. We utilized 3 measures but within those three measures we looked at individual section scores in addition to total score.

Reply by Allen: Then alpha level has much less meaning. Your reply didn’t indicate the number of sub-scales (p-values) you tested. If there were 10 (subscales and total) per scale then 3*10=30 p-values. With 30 p-values, then the likelihood of finding at least one fortuitously statistically significant is (1.0 – (1.0 – 0.05)^30 =) 79%. That is, with completely random data, with 30 tests, you would find statistical significance 4 times out of five. The same would apply to the detectable difference. Out of all the scales and sub-scales, one must be the biggest difference.

This is THE reason statisticians rigorously adhere to the protocol’s key parameter/hypothesis. By stating a priori what we are looking at we can meaningfully see if we hit the target.

Fortunately, with your three parameters, you are blessed with two which are significant. I also strongly believe in confidence intervals. Looking at your results, you found for the weakest?? first parameter (Percent of Syllables Stuttered) a mean difference of 2.0 (sd = 1.7), for an effect size of 1.18. A 95% CI (I believe) is -0.70 to +4.70. As you noted, since the lower end is minus 0.7 you cannot discount that the intervention’s magnitude could be zero, or even deleterious (negative 0.7 points or an effect size of -0.4). However, your best estimate of the true effect, the mean, is that the intervention was 2.0 points (effect size of 1.2). Finally, the intervention could be as large as an improvement of 4.7 points (effect size of 2.8). The effect size of -0.4 is a moderate negative effect size. However, the 1.2 is quite large and the potential benefit of d=2.8 is huge. My conclusion from these observations is that the intervention is indeed potentially useful. What you did was likely quite positive, especially for these 4 subjects.

I am very concerned that the effect size of d=1.18 was not statistically significant, but the effect size for anxious thoughts (d=1.10) was statistically significant. If both used N=4 and a paired t-test, then there is a MAJOR ERROR somewhere. You CANNOT have a smaller effect size (1.1) significant, but a larger effect size (1.2) not significant in the same parallel analysis.

Reply by Heather: I recalculated the effect size and got .63 (not 1.18) for the nonsignificant test. and 2.79 (not 1.10) for the test that reached statistical significance.

Reply by Allen: Although it wasn’t significant, the 0.63 effect size, is still a large treatment effect. That is, the intervention induced the responses to shift almost 2/3 of a standard deviation, a very, very respectable change.

Assuming you calculated the effect size correctly (and were not reporting the t-test value). For example, you used the standard deviation, and not the standard error of the difference in means. My only comment here is that an effect size of 2.79 is very, very huge. The two distributions do not overlap. I work primarily in drug studies. Effect sizes of 0.3 are common. An effect size of 1.0 is very, very rarely seen, at least in their double-blind and randomized trials. My gut says that an effect size of 2.79 is not a realistic treatment effect. For example, you measured anxiety levels minutes after your yoga treatment, rather than the next day or week.

One problem with this study is that there may be alternative explanations for the results, not just p>0.05. Were the subjects or raters blinded, etc.? For example, the ‘Good-bye’ effect. This could easily invalidate all of your conclusions. One of my favorite books was a 71 page gem by Campbell and Stanley, titled Experimental and Quasi-Experimental Designs for Research (1963). See their page 8, ‘2. One-Group Pretest-Posttest Design’. They would say your design is delinquent in that it is potentially invalidated by a) History, b) Maturation, c) Testing, d) Instrumentation, e) Selection by Maturation (etc.) interactions, f) Testing by Intervention interaction, g) Selection by Intervention Interaction, and other potential flaws. When my wife was in graduate school and doing her master’s thesis, she did an intervention study. All it may have proved is that her subjects liked her and wanted to help (give her positive results). It sufficed in that it gained her a masters degree.

Reply by Heather: I agree with you. This study was definitely not a randomized control double blind study.

However let me just focus on the results, not the design. You have three parameters which apparently give strong parallel results on the favorability of your intervention. To finally answer your last question, yes publications hate to publish results which do not have that magic ‘p < 0.05’. As noted above, the first parameter indicates that the intervention could have a moderate negative effect (-0.4). [They would actually say (focus on) that the results could be zero – ignorant fools.] Where does that leave you?

- Well, you could submit the results. You have two of the three parameters statistically significant. That might suffice, assuming that you only examined three parameters and these were all key parameters. The intervention is likely to be clinically useful. But with N=4, the results would appear to them as tenuous at best. You did say this was a pilot study.
- Or you could increase your sample size. A quick power analysis indicates that you would need only 8 subjects with your current design for 80% power and a paired t-test. Alternatively, if you used a two parallel group randomized trial, the N would be 26 (13 per group).

My best suggestion is to allow the pilot study to suggest a larger and potentially better study and not to treat a pilot study like the final study. Of course, you could do both. Attempt to publish, while completing the full study.

Comment by Heather: I still, however, would like to get it published as this intervention shows promise and warrants a larger, more controlled study.

Comment by Allen: Best of luck to you.

In regards to the reply below:

Reply by Allen: Then alpha level has much less meaning. Your reply didn’t indicate the number of sub-scales (p-values) you tested. If there were 10 (subscales and total) per scale then 3*10=30 p-values. With 30 p-values, then the likelihood of finding at least one fortuitously statistically significant is (1.0 – (1.0 – 0.05)^30 =) 79%. That is, with completely random data, with 30 tests, you would find statistical significance 4 times out of five. The same would apply to the detectable difference. Out of all the scales and sub-scales, one must be the biggest difference.

Question: I was not running these analyses simultaneously-Is a Bonferroni correction still necessary?