Guidelines


Statistics and Methods

Statistics and Methods Guidelines for Authors and Reviewers

Seamus Donnelly, Australian National University,
Language Development Research statistics consultant


Authors should follow the reporting standards set out in Journal Article Reporting Standards for Quantitative Research in Psychology: The APA Publications and Communications Board Task Force Report: https://www.apa.org/pubs/journals/releases/amp-amp0000191.pdf

Authors should report all parameter estimates, including random effects, from all statistical models reported in the paper.  Authors should also report standard errors or confidence intervals for fixed effects. These can be in supplementary materials (or on the Open Science Framework) if necessary. Even if random effects aren’t of substantive interest, reporting them is helpful as doing so (a) clarifies model specification to readers, (b) indicates the extent experimental effects vary across participants and items, and (c) can provide information about model non-convergence or misspecification.  Additionally, if authors choose to use Bayesian methods, they should report on priors, as well as provide the number of chains, iterations and convergence statistics for the chains (R hats and Effective Sample Sizes.)

Authors should also test and report on the plausibility of a model’s assumptions. All statistical models make assumptions. Authors should be explicit about the assumptions of their models and should assess the plausibility of these assumptions by examining residuals, fitted values and influence statistics (For a good overview see, Fox, 2016 Applied Regression Analysis and Generalized Linear Models and the associated R package car). A detailed discussion is not required in the main text of the paper, but relevant diagnostics should at a minimum be mentioned in a footnote. It is not required that a model’s assumptions will be satisfied perfectly, but acknowledgement of mild departures from normality of residuals, linearity and heteroskedasticity can be informative to readers and reviewers. In cases where the data markedly violate an assumption of a pre-registered model and the authors choose a more suitable model or transformation, results from both the pre-registered and the more appropriate model.

Pre-registration of all statistical analyses is preferred, ideally in the form of code for freely available analysis software (e.g., R), ideally tested on simulated data (e.g., simR package). Authors are also encouraged to pre-register their sample size, exclusion criteria, experimental methods and stimuli (https://osf.io/8uz2g/). The journal does not provide a repository for pre-registration materials, analysis code or simulated data; authors should instead use a public repository such the Open Science Framework (https://osf.io/), linked to in the manuscript. For research that could be classified as “medical research involving human subjects” authors must comply with the World Medical Association Declaration of Helsinki (https://www.wma.net/policies-post/wma-declaration-of-helsinki-ethical-principles-for-medical-research-involving-human-subjects/), which states that “Every research study involving human subjects must be registered in a publicly accessible database before recruitment of the first subject”.

For non-medical research, pre-registration of statistical analyses (or any of part of the study design) is not mandatory. However, where analyses are not pre-registered, the journal will expect to see a particularly clear justification for the analysis choices made, and evidence of the extent to which the analyses report hold under alternative justifiable analyses. If authors subsequently discover an error in a pre-registered analysis, they should instead/additionally report a corrected analysis, alongside a clear explanation of the reasons for the departure from the pre-registered analysis. In particular, if the assumptions of the pre-registered model are badly violated and the authors consider a transformation/more appropriate model, please report results of both the pre-registered and additional model.

Exploratory/unplanned analyses (i.e., analyses conceived after having seen the data) are encouraged. However, whether or not a pre-registration was submitted, authors must clearly differentiate analyses that were planned before seeing the data from exploratory/unplanned analyses. That is, authors must never hypothesize after results are known (http://goodsciencebadscience.nl/?p=347).

Sample size. Ideally, authors should either (a) pre-register a sample-size (ideally calculated on the basis of data from previous studies and/or simulated data; e.g., (https://www.journalofcognition.org/articles/10.5334/joc.10/) or (b) use a sequential testing procedure, which can be done in both frequentist (Lakens, 2020, https://psyarxiv.com/9yegd/) and Bayesian analysis frameworks (https://link.springer.com/article/10.3758/s13423-017-1230-y). Note that, for mixed models, power will be affected by n at each sampling unit, so the number of trials should be justified as well.

Data-peeking (i.e., running analyses, then testing more participants in an attempt to reach a criterion for statistical significance) is expressly prohibited (https://neuroneurotic.net/2016/08/25/realistic-data-peeking-isnt-as-bad-as-you-thought-its-worse/), unless it is acknowledged and corrected for (e.g., https://www.sciencedirect.com/science/article/pii/S0163638318300894).

Model Specification. Both frequentist and Bayesian analysis frameworks are acceptable. Whichever framework is chosen, authors should avoid dichotomous interpretations (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4877414/) that focus solely on the distinction between findings that are “significant” and “non-significant” at p<0.05. (Not least because, in some circumstances, a frequentist p value just under 0.05 can equate to stronger evidence for the null hypothesis than the effect under a Bayesian analysis (http://daniellakens.blogspot.com/2014/09/bayes-factors-and-p-values-for.html)). Rather, analyses should focus on estimates of effect size and credible/confidence intervals, and not solely whether these intervals include zero.

Likelihood function. Where regression and related techniques are used (e.g., generalized linear models, mixed/multilevel models, SEM, etc) authors should justify their choice of likelihood function given the nature of their dependent variable. See Gelman and Hill (2007, Data Analysis Using Regression and Mutlilevel/Hierarchical Models), or McElreath (2016, Statistical Rethinking) for examples of common likelihood functions in social science data. As a starting point, consider this list of sensible likelihood functions and transformations for common dependent variables in psychology. These suggestions are not authoritative and other transformations and likelihood functions can certainly be used if justified.

                                                                       

Data Type

Possible Likelihood Function

Possible Transformation.

Symmetrical data

Normal Distribution

 

Symmetrical data with outliers

t distribution

 

Proportions/percentages

Beta Distribution (See Smithson and Verkuilen, 2006, https://psycnet.apa.org/record/2006-03820-004)

Logit transformation

Binary data, aggregated binary data (e.g., correct vs incorrect; number of correct trials out of all trials)

Binomial Distribution

Logit transformation

Continuous data with a floor

(e.g. : Reaction Time, Durations)

Log-normal distribution (See Rouder and Province, http://pcl.missouri.edu/node/148)) or Gamma distributions (see Lo and Andrews (2015,

https://www.frontiersin.org/articles/10.3389/fpsyg.2015.01171/full)

Log transformation, inverse transformation.

Count data with a floor

(e.g., Number of times a participant used a particular construction)

Poisson distribution, negative binomial distribution (See Coxe, West and Aiken, 2009) https://pubmed.ncbi.nlm.nih.gov/19205933/

Log transformation (plus some correction for 0s).

Ordinal data (e.g, Likert scales)

Ordinal regression models (see Bürkner & Vuorre (2019), for examples) https://journals.sagepub.com/doi/10.1177/2515245918823199

 

   
Confounding variables. Authors should justify the inclusion of each confounding variable. In general, it is appropriate to control for variables that affect the independent variable, the dependent variable or both, unless the relationship between the confounding variable and the dependent variable is mediated entirely by the independent variable. It is generally not appropriate to control for variables that are affected by the independent variable (potential mediators), as doing so will bias the coefficient of the independent variable. See VanderWeele (2019) for a detailed discussion (https://link.springer.com/article/10.1007/s10654-019-00494-6). Moreover, authors should be appropriately cautious about claims to have fully controlled for potentially confounding variables, since this is possible only when these variables are measured essentially perfectly. See Westfall and Yarkoni (2016) for a more detailed discussion (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152719).

Random Effects Structures. For experimental research, researchers should ideally estimate all random effects implied by the design (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3881361/). However, in complex experimental designs, and observational studies, such a strategy will often result in an empirically unidentifiable model. In such cases, the variance/covariance matrix may be degenerate even if program does not issue a warning. We, therefore, suggest authors consider the strategies in Bates, Kliegl, Vashith & Baayen (2015) for simplifying complex mixed models (https://arxiv.org/abs/1506.04967).  Alternatively, Bayesian methods may provide sufficient regularization for these contexts (https://arxiv.org/abs/1701.04858).

Model Testing. Because the degrees of freedom of a linear mixed model are unknown, calculating p values for these models is difficult. Options for calculating p values for fixed effects, from worst to best are (1) likelihood ratio test  (2) Parametric bootstrapping, (2) Kenward-Roger or Satterthwaite (using REML) approximation (at least according to https://link.springer.com/article/10.3758/s13428-016-0809-y).  However, as there is no consensus amongst statisticians on the solution to this problem, often the best approach is to check multiple of these approximations and note any significant divergences

Null effects. Authors should not take absence of evidence as evidence of absence. Rather, claims of theoretically-meaningful null effects should be supported with a Bayes Factor analysis (https://www.frontiersin.org/articles/10.3389/fpsyg.2014.00781/full; http://pcl.missouri.edu/bayesfactor) or frequentist equivalence testing (http://daniellakens.blogspot.com/2016/12/tost-equivalence-testing-r-package.html). This is true for naturalistic as well as experimental data. For example, a claim that naturalistic data show lack of morpho-syntactic productivity should be supported by evidence that the data are in principle sufficient dense to detect productivity, for example by using adult speech as a control (e.g., https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0119613). If Bayes Factors are used, authors should clearly state and justify their choice of priors for the null and alternative hypotheses.

Matching. Similarly, authors should not use inferential tests to compare participant groups or stimuli on baseline characteristics, and take the absence of a significant difference as evidence of successful matching (https://arxiv.org/abs/1602.04565). Rather, such measures should be included as covariates in the statistical model. If including such variables in the model is not possible, authors should report effect sizes for the differences in the matched variables across conditions, as these will be more informative as to the magnitude of the difference in the matched variable than p values.

Interpreting interactions. Authors must cite the relevant test when making claims about interactions. For example, finding a significant effect in group 1 but not group 2 does not mean that the effect is different across to two groups; such a claim should only be made after testing the relevant interaction (http://www.stat.columbia.edu/~gelman/research/published/signif4.pdf). Moreover, if an interaction is significant, and authors wish to examine the effect within each group separately, they should not fit separate models to each group. These effects can be tested in frequentist statistics by doing post hoc tests of interactions (for example using the phia package), and in Bayesian statistics by computing the relevant variables from the posterior distribution (for example using the hypothesis function in brms).

Dichotomization. Authors should not use median splits, or otherwise dichotomize continuous predictors (e.g., age, vocabulary), unless there is some compelling theoretical reason to do so (e.g., above/below cutoff on a diagnostic test) (https://scholar.google.co.uk/scholar?hl=en&as_sdt=0%2C5&q=dichotomize+continuous+predictors&btnG=).

Treatment of outliers. Authors must justify and report their decision for defining and removing outliers. Keep in mind that outliers are always defined relative to some model; that is, outliers are data points that deviate from their predicted values. Therefore, one approach to defining outliers is to look at standardized residuals, which quantify, in standard deviations, how far a given data point is from its predicted value.