Water distribution pipes convey clean drinking water to billions of end-users around the globe. Unexpected pipe breaks can lead to several challenges, including, reduced fire-fighting capability, and contamination. Recent studies have developed machine learning models to predict pipe break status. Because the number of pipes that have never experienced breaks generally outnumbers broken pipes, pipe break datasets are inherently imbalanced. For this reason, different approaches for data preparation might yield different model performances. This study will explore the impact of different data preparation strategies on the performance of pipe break status prediction for a case study water distribution system. The system has 14,436 pipes and 6,381 breaks recorded between 1956 and 2019. Because XGBoost has been shown to perform well for similar models it will be applied herein to all model variations. Four areas of variation will be explored: (1) break types, (2) sampling, (3) break period and (4) splitting. Firstly, datasets with first breaks, most recent breaks and all breaks will be compared. Secondly, no sampling, simple random sampling, and stratified sampling will be compared. Thirdly, break status will be aggregated into 1 year, 2 year, 5 year or 10 year periods. Fourthly, random splits of data, i.e. 70/30, 80/20 and 90/10 for training and cross-validation/testing, as well as splits by time period will be compared. In all cases, the last 10 years of data will be excluded from training and cross-validation to ensure comparability of test results. All models will be evaluated with F1 scores and precision-recall curves. Results will provide practical insights into best data preparation practices.