Q-learning has been one of the most commonly used methods for optimizing dynamic treatment regimes (DTRs) in multi-stage decision making. Right-censored survival outcome poses a significant challenge to Q-Learning due… Click to show full abstract
Q-learning has been one of the most commonly used methods for optimizing dynamic treatment regimes (DTRs) in multi-stage decision making. Right-censored survival outcome poses a significant challenge to Q-Learning due to its reliance on parametric models for counterfactual estimation which are subject to misspecification and sensitive to missing covariates. In this paper we propose an imputation-based Q-learning (IQ-learning) where flexible nonparametric or semiparametric models are employed to estimate optimal treatment rules for each stage and then weighted hot-deck multiple imputation (MI) and direct-draw MI are used to predict optimal potential survival times. Missing data are handled using inverse probability weighting and MI, and the non-random treatment assignment among the observed is accounted for using a propensity-score approach. We investigate the performance of IQ-learning via extensive simulations and show that it is more robust to model misspecification than existing Q-Learning methods, imputes only plausible potential survival times contrary to parametric models, and provides more flexibility in terms of baseline hazard shape. Using IQ-learning we developed an optimal DTR for leukemia treatment based on a randomized trial with observational follow-up that motivated this study. This article is protected by copyright. All rights reserved.
               
Click one of the above tabs to view related content.