Halpin_2013_multiple.pdf (1023.08 kB)
Multiple imputation for life-course sequence data
reportposted on 2014-01-28, 13:51 authored by BRENDAN HALPINBRENDAN HALPIN
As holistic analysis of life-course sequences becomes more common, using optimal matching (OM) and other approaches the problem of missing data becomes more serious. Longitudinal data is prone to missingness in ways that cross-sectional is not. Existing solutions (e.g., coding for gaps) are not satisfactory, and deletion of gappy sequences causes bias. Multiple imputation seems promising, but standard implementations are not adapted for sequence data. I propose and demonstrate a Stata implementation of a chained multiple imputation procedure that “heals” gaps from both ends, taking account of the longitudinal nature of the measured information, and also constraining the imputations to respect this longitudinality. Using the sequence data alone, without auxiliary individual-level information, stable imputations with good characteristics are generated. Using additional information about the structure of data collection (which relates to mechanisms of missingness) gives better prediction models, but imputations that differ only subtly. Many sequence analysts proceed by cluster analysis of the matrix of pairwise OM distances between sequences. As a non-inferential procedure, this does not benefit from “Rubin’s Rules” for multiple imputation in averaging across estimations. I explore ways of clustering with multiplyimputed sequences that allow us to assess the variability due to imputation. I compare the results with an existing approach that codes gaps with a special missing value that is maximally different from all other states, and show that imputation performs better. In an example data set drawn from BHPS work-life histories, imputation of short internal gaps ( 12 months) increases the available sample size by approximately 25 percent. Moreover, the gappy sequences have a distinctly different distribution, with higher numbers of transitions, so deletion of gappy sequences distorts the sample badly. For typical longitudinal data sets, we can expect missingness to be related to the amount of instability in the career, and to proceed without imputation will cause serious bias.