- Research
- Open access
- Published:
Random forest algorithm for predicting tobacco use and identifying determinants among pregnant women in 26 sub-Saharan African countries: a 2024 analysis
BMC Public Health volume 25, Article number: 1506 (2025)
Abstract
Introduction
Tobacco use during pregnancy is a significant public health concern, associated with adverse maternal and neonatal outcomes. Despite its critical importance, comprehensive data on tobacco use among pregnant women in sub-Saharan Africa is limited. Leveraging machine learning approaches allows us to better understand these constraints and predict tobacco use among pregnant women, providing actionable insights for policy and intervention.
Objective
This study aimed to predict tobacco use and identify its determinants among pregnant women in 26 SSA countries using machine learning algorithm.
Methods
Using data from the Demographic and Health Surveys (2016–2023) across 26 SSA countries, we analyzed responses from 33,705 pregnant women. The Random Forest classifier, complemented by SHAP for feature interpretability, was employed for prediction and analysis. Data preprocessing included K-nearest neighbor imputation for missing values, SMOTE for handling class imbalance, and Recursive Feature Elimination for feature selection. Model performance was evaluated using metrics such as accuracy, recall, F1 score, and AUC-ROC.
Results
The Random Forest model demonstrated robust performance, achieving an AUC-ROC of 98%, recall of 94%, and F1 score of 93%. Key predictors identified included maternal literacy, maternal education, wealth index, distance to healthcare facilities, and place of residence. Pregnant women with lower educational attainment, residing in rural areas, and from lower wealth quintiles were more likely to use tobacco.
Conclusion and recommendations
This study utilized a Random Forest machine learning algorithm to identify key predictors of tobacco use among pregnant women across 26 Sub-Saharan African countries. Significant factors included maternal literacy, education, wealth index, and healthcare access, highlighting systemic inequities contributing to tobacco dependency during pregnancy. These findings advocate for policies addressing educational disparities, economic inequalities, and barriers to healthcare access to reduce tobacco use and improve maternal and neonatal outcomes. Future research should incorporate longitudinal data to enhance predictive accuracy and inform policy development.
Introduction
According to the World Health Organization (WHO), tobacco smoking is one of the most prevalent and significant health problems throughout the globe, posing a high risk for the development of chronic diseases [1]. Globally, an estimated 1.3Â billion individuals are smokers, with more than 80% residing in low- and middle-income countries (LMICs) [2]. Worldwide, tobacco use results in 8Â million deaths each year, with more than 5% of these mortalities constituted by adult females [3]. The smoking rate in sub-Saharan African countries is rising rapidly, with projections showing it could reach 208Â million individuals by 2030 [4].
In a 2020 study involving 204 countries, the overall smoking prevalence for women aged 15 years and older was 6.5%, while in sub-Saharan Africa, the prevalence among females was found to be less than 5% [5]. This is particularly alarming because women in their reproductive years encounter heightened risks from smoking, which negatively impacts fertility, pregnancy outcomes, and the development of the fetus and child [6, 7]. Despite the critical importance of the gestational period for child development, 52.9% of female smokers worldwide continue their smoking habit during pregnancy [8].
During the early stages of pregnancy, especially when the major organs are forming, the fetus is exceedingly susceptible to substance-related harm [9]. Hence, the multitude of chemicals present in a cigarette, along with nicotine, can easily cross the placenta, potentially causing harmful events to the developing fetus [7]. Smoking during pregnancy increases the risk of growth restriction [10], low birth weight [11], obesity and chronic diseases [12], stillbirth and congenital anomalies [13], sudden infant death syndrome [14], and also raises the risk of hypoxia, respiratory, and neuronal diseases in infants [15].
Despite the detrimental effects of smoking during pregnancy being well-documented, the number of female smokers remains substantial and rising, especially among young adults in their reproductive years [16]. The global prevalence of smoking during pregnancy is estimated to range from 1.7 to 4.5% [17]. In 2018, the prevalence of smoking during pregnancy was notably higher in high-income countries, with rates reaching 8.2% in the United States [18] and 11.4% in Australia [19]. However, in LMICS, the pooled prevalence of tobacco use among pregnant women was 2.6% [20]. Its prevalence is common in Nepal 8.4%, India 8.0%, and Cambodia 6.7% [20]. In Africa, the prevalence was estimated at 0.8% [17].
A shreds of evidence showed that maternal tobacco use during pregnancy has significantly decreased in higher-income countries [21]. Even though there is a rapid increase in smoking rates, information on the overall magnitude and determinants of tobacco use among pregnant women in sub-Saharan Africa remains limited [22]. However, previous studies identified various risk factors for the high prevalence of smoking among pregnant women, including lower socioeconomic levels, low education, and occupational status such as unemployment [16, 23,24,25]. Multiparty, poor antenatal care, being a young or single mother, and unexpected and unwanted pregnancies are also pointed as associated with smoking habits during pregnancy [16, 18, 26].
Several studies have been conducted to determine the prevalence of tobacco use among pregnant women and identify its associated factors using classical statistical methods. However, no study has been conducted in Sub-Saharan Africa using machine learning approaches [16, 18, 23,24,25,26,27]. Machine learning (ML) was chosen over traditional statistical methods for predictive modeling due to its ability to overcome key limitations of conventional approaches. Classical statistical techniques, such as linear or logistic regression, prioritize parameter estimation, hypothesis testing, and inference under strict assumptions-including linearity, normality, independence, and homoscedasticity [28, 29]. These models excel in explaining relationships between variables when theoretical assumptions are met but struggle with predictive accuracy in real-world data that often violate these conditions. For instance, misspecification of variable interactions or non-linear relationships in regression models can lead to biased estimates and unreliable conclusions, limiting their utility in uncovering hidden patterns [29, 30].
In contrast, ML algorithms (e.g., random forests, gradient-boosted trees) are inherently designed to optimize predictive performance by learning complex, non-linear relationships directly from data without relying on rigid a priori assumptions [28, 31, 32]. This flexibility enables ML to handle high-dimensional datasets, automatically detect interactions, and adapt to evolving data structures-capabilities particularly critical in public health research, where factors influencing behaviors like tobacco use may involve intricate socio-cultural, economic, and psychological interdependencies. This approach enables algorithms to reveal hidden knowledge and patterns that may not be obvious based on prior assumptions [33, 34].
Furthermore, while conventional methods emphasize confidence intervals and p-values to quantify uncertainty, ML embraces uncertainty through iterative learning, cross-validation, and ensemble techniques that enhance robustness [35, 36]. ML’s focus on prediction facilitates the identification of determinants via feature importance metrics, which quantify variable contributions to outcomes without presuming linearity [37]. This is advantageous in settings like sub-Saharan Africa, where determinants of tobacco use among pregnant women may involve non-linear thresholds or context-specific interactions that regression models could overlook. Additionally, ML models integrate seamlessly with digital health systems, enabling real-time deployment and scalability-key for translating research insights into interventions [38]. By transcending the constraints of assumption-driven statistics, ML not only bridges the gap between theoretical research and practical application but also empowers researchers to discover actionable insights in complex, heterogeneous populations [39]. This study leverages these strengths to address the evidence gap on tobacco use by developing a predictive model and identifying its determinate tailored to the dynamic realities of sub-Saharan African healthcare contexts, thereby advancing precision public health strategies.
Method
Study period and setting
The study used data from recent surveys conducted in 26 SSA countries between 2016 and 2023 G.C. These countries are Angola, Benin, Burkina Faso, Burundi, Cameroon, Côte d’Ivoire, Ethiopia, Gabon, Ghana, Gambia, Guinea, Kenya, Liberia, Madagascar, Malawi, Mali, Mauritania, Mozambique, Nigeria, Rwanda, Senegal, Sierra Leone, South Africa, Tanzania, Uganda, and Zambia. For this analysis, the study focused on pregnant women by appending individual records from each country and identifying those currently pregnant. This process resulted in a total weighted sample of 33,705 pregnant women. The data used in the study is publicly available and can be accessed at https://dhsprogram.com/data/available-datasets.cfm.
Dependent variable (Target variable)
In this study, the outcome variable was tobacco use. Women in the DHS dataset were asked about their use of tobacco and other related substances. The study relied on pregnant women’s self-reported use of various tobacco products including cigarettes, pipes, chewing tobacco, snuffs by a nose, snuffs by mouth, smoking cigars/cheroots/cigarillos, smoking water pipes, and other country-specific tobacco products. Pregnant women who reported taking any of these products were classed as ‘tobacco users’; otherwise, they were classified as ‘non-users’ [40].
Independent variables (Features)
The independent variables for this study were selected based on existing literature on issues influencing tobacco use. The independent variables included maternal literacy, education status of mothers, wealth index, mobile phone ownership, distance from health facility, place of residency, has bank account, respondent occupation, marital status, presence of electricity, number of living children, sex of household, media exposure, and internet use [20, 40,41,42,43].
Data management and analysis
The data was appended and weighted using SPSS version 27 and Microsoft Excel 2019. Python 3.12 used for doing further analysis. Several Python libraries were used in this study to support various stages of the analysis. Pandas and NumPy were used to manipulate data and compute numerical results. Matplotlib and Seaborn were used for visualization. The Scikit-Learn package was used for accessing and developing machine learning models, encompassing tasks such as splitting data, and evaluating model performance. For model development and feature selection, the Random Forest classifier was selected as the primary algorithm. The classifier was selected for its ability to handle complex, high-dimensional datasets often encountered in healthcare applications. Unlike simpler linear models Random Forest can capture complex relationships and intricate interactions between predictors without requiring strict parametric assumptions [44]. This makes it well-suited for datasets with noise, and non-standard distributions. By leveraging an ensemble of decision trees trained on random subsets of data, Random Forest reduces overfitting and enhances generalization, ensuring more stable and reliable predictions [45]. Its built-in feature importance mechanism also provides valuable insights into the most influential variables, aiding interpretability and decision-making [46]. Compared to other ensemble methods like Gradient Boosting or XGBoost, Random Forest is particularly advantageous when the primary goal is variance reduction rather than maximizing predictive accuracy on potentially overfit training data [47]. Its bagging approach aggregates predictions from multiple trees, which is especially effective in mitigating the impact of outliers or noise [48].
Random forest classifier
Random Forest is an ensemble learning method that combines multiple decision trees to improve classification performance by aggregating the predictions of many trees. This technique reduces overfitting and variance compared to a single decision tree. It uses bagging (Bootstrap Aggregating), where each tree is trained on a random subset of the data, selected with replacement. The trees are further diversified by considering only a random subset of features at each split, which helps reduce correlation between trees. The n_estimators parameter controls the number of trees in the forest, with more trees generally improving performance but increasing computational time. Starting with a value of 100 is common, and it can be increased based on the needs of the model. The max_depth parameter limits the depth of each tree; deeper trees capture more complex patterns but may overfit. The max_features parameter determines the number of features considered for splitting at each node, with a common value being ‘sqrt’ (the square root of the total number of features). The criterion parameter measures the quality of the split, where ‘gini’ and ‘entropy’ are commonly used, with Gini impurity being slightly faster. The min_samples_split parameter specifies the minimum number of samples required to split a node; higher values help prevent overfitting but may reduce model complexity. A starting value of 2 is typical, with adjustments as needed. Similarly, min_samples_leaf sets the minimum number of samples required to be at a leaf node, starting with 1 and adjusted based on the data. The bootstrap parameter determines whether to use bootstrap sampling for tree construction, which can improve model variance and generalization but may increase bias. Random Forests are also robust to overfitting due to the averaging of multiple trees and can handle missing data using surrogate splits. Additionally, Random Forest provides feature importance, allowing for insights into which features contribute most to the model’s predictions [48,49,50,51,52]. Fig.1 demonstrates the Random Forest classification workflow, where an instance is classified by multiple decision trees. Each tree provides an independent prediction, and the final class is determined through majority voting across the trees (Fig.:1; Source: GeeksforGeeks: https://www.geeksforgeeks.org).
Data refinement and feature processing
Missing values were addressed by K-nearest neighbor imputation. Given the imbalance in the outcome variable, with the majority of women were non tobacco user, the Synthetic Minority Oversampling Technique (SMOTE) was used. One-hot encoding was employed to convert categorical variables into numerical values. Furthermore, normalization and scaling were used to ensure that features were on a comparable scale, which is vital to improving model training and evaluation. To assess multicollinearity and understand the relationships between key variables, we employed a correlation analysis (Fig. 2). Additionally, Recursive Feature Elimination (RFE) was used to reduce dimensionality by selecting only the most important predictors for model training. This method reduced the dataset’s complexity, increasing model efficiency and interpretability [53].
Data segmentation
The study used train-and-test split and 10-fold cross-validation to ensure robust model evaluation and performance assessment. The dataset was divided into training and testing sets, with the former used to build and train the model and the latter used to evaluate its performance on previously unseen data. Furthermore, 10-fold cross-validation was used to validate the model’s performance by dividing the data into 10 folds. The model was trained and tested 10 times, with each test set being a separate fold and the remaining folds serving as the training set. This method reduces overfitting and yields a more reliable estimate of the model’s generalization performance [54]. The dataset, comprising 33,705 records, was divided into 10 subsets of approximately equal size (~ 3,370 records per fold). In each iteration, one subset (10%) was used as the testing dataset (~ 3,370 records), while the remaining 90% (~ 30,335 records) were used as the training dataset. This process was repeated 10 times, ensuring that each record was included in the testing dataset exactly once and in the training dataset nine times.
Model fitting and optimization
The random forest classifier was trained on the preprocessed and balanced dataset. Cross-validation ensured that each model was validated on different subsets of the training data, allowing for an accurate evaluation of performance. We applied Random Forest and optimized the following hyperparameters: n_estimators = 200, max_depth = 15, max_features = ‘sqrt’, criterion = ‘gini’, min_samples_split = 10, min_samples_leaf = 7, and bootstrap = True. These values were selected to strike a balance between model performance and computational efficiency. The chosen settings helped reduce overfitting, improved generalization, and ensured that the model was both robust and effective in capturing complex patterns. The predictions for the classifier was generated on the test set, and a custom threshold of 0.5 was applied to the predicted probabilities to classify the target variable [55].The predictions for the classifier was generated on the test set, and a custom threshold of 0.5 was applied to the predicted probabilities to classify the target variable [55]. We compared the findings with logistic regression to evaluate the performance of our model and validate its predictive power in comparison to a traditional approach. The results showed an accuracy of 89%, an AUC of 93%, a precision of 87%, a recall of 91%, and an F1 score of approximately 88.9%, highlighting the model’s strong performance across various evaluation criteria.
Performance metrics
Model performance was evaluated using discriminations metrics. Discrimination parameters including accuracy, precision, recall, F1 score, and Area Under the Receiver Operating Characteristic Curve (AUC-ROC) were used to assess the model’s capacity to distinguish between positive and negative instances.
Model explanation and feature selection
The link between the predictors and the outcome variable assessed using the random forest built in feature selection method and Additive Explanations (SHAP) feature significance approach [56]. The SHAP method was used to assess the impact of each feature on model predictions. SHAP analysis uses a game theory framework to offer a global or local interpretation and explanation for any machine learning model’s prediction [56]. Feature selection was performed to identify the most important features influencing tobacco use predictions. SHAP was chosen because it provides clear and interpretable insights into how each feature contributes to model decisions, which is crucial in healthcare applications where interpretability is important.
Results
Socio-demographic characteristics of study participants
A total of 33,705 (weighted) study respondents were included in this study. From study participants 669 (2.07%) utilized tobacco during pregnancy. The majority of participants, 22,612 (67%), were from rural areas, while 11,093 (32.91%) resided in urban areas. Regarding educational status, 12,241 (36.32%) participants had no formal education, 10,742 (31.87%) had completed primary education, and 10,722 (31.81%) had secondary education or higher. The study also revealed that 27,432 (81.39%) of the household heads were male, while 6,273 (18.61%) were female. In terms of literacy, 17,003 (50.45%) participants could not read at all, while 16,702 (49.55%) could read and write. Regarding marital status, 29,873 (88.63%) participants were married, 3,462 (10.27%) were single, 238 (0.71%) were widowed, and 132 (0.39%) were divorced. Mobile phone ownership was relatively high, with 17,903 (53.12%) participants owning a mobile phone, while 15,802 (46.88%) did not. Only a small percentage of participants, 691 (2.05%), used the internet, while the majority, 33,014 (97.95%), did not. In terms of wealth index, 15,774 (46.8%) participants were classified as poor, 11,191 (33.2%) were middle class, and 6,740 (20%) were rich. Finally, 21,768 (64.58%) respondents were employed, while 11,937 (35.42%) were not employed (Table 1).
Machine learning Anlaysis
The Random Forest classifier performed well in predicting tobacco usage among pregnant women. The model achieved an accuracy of 0.92, which means it accurately categorized 92% of all occurrence. The Area Under the Curve (AUC) was 98%, demonstrating the model’s nearly flawless ability to differentiate between tobacco users and non-users. Furthermore, the precision was 91%, which means that 91% of the cases categorized as tobacco users were correct, indicating a low false positive rate. The recall was 94%, demonstrating the model’s capacity to accurately identify 94% of actual tobacco users with few false negatives. The F1 Score (a harmonic mean of precision and recall) was 0.93, indicating a great balance between these criteria. Furthermore, the accuracy of the model was reevaluated by 10-fold cross-validation because it provides a more rigorous assessment of model performance. After Evaluating the classifier using cross-validation, it achieved comparable accuracy of 90%, precision of 88%, recall of 92% and F1 of score 90% as observed in train-test split (Fig. 3).
In this study, the performance of a Random Forest classifier for predicting tobacco use among pregnant women was also assessed using the Receiver Operating Characteristic (ROC) curve. The ROC curve visualizes the trade-off between sensitivity (true positive rate) and specificity (true negative rate) as the classification threshold varies. An area under the curve (AUC) of 98% signifies excellent discriminative ability. This high AUC suggests that the Random Forest classifier effectively distinguishes between pregnant women who use tobacco and those who do not across a wide range of decision thresholds, indicating strong predictive power and potential clinical utility in identifying women who may benefit from targeted interventions to reduce tobacco use during pregnancy (Fig. 4).
Additionally, Precision-Recall curve was used to evaluate the performance of a classification model. The curve illustrates the relationship between precision and recall as the classification threshold varies. The initial high precision at low recall indicates that the model is effective at identifying positive cases with high confidence. However, as recall increases, precision tends to decrease, suggesting a trade-off between capturing more true positives and maintaining a high level of accuracy in the positive predictions [57] (Fig. 5).
Important feature selection
In this study, we employed two ways to determine the most important predictors of tobacco use among pregnant women: Random Forest’s built-in feature importance and SHAP values, both applied with the Random Forest classifier. Using these methodologies enabled us to cross-validate the significance of predictors while also improving our understanding of their influence and interpretability, strengthening the finding of the study.
Using Random Forest’s built-in feature importance Wealth index, distance from health facilities, and mothers’ education status rank high in importance, suggesting that socioeconomic disparities and limitations in healthcare access significantly impact tobacco use behavior during pregnancy. Factors like marital status, mobile phone ownership, and access to banking services also play a role. Conversely, features like internet use and media exposure appear to have less influence on the model’s predictions (Fig. 6).
To improve interpretability, we used SHAP values with random forest classifier to understand how each predictor contributes to the model’s individual predictions. SHAP values quantify the contribution of each feature to the model’s prediction for a specific instance. By averaging these values across all instances, we obtained a measure of the feature’s overall impact on the model’s output. The SHAP value bar graph highlights the relative importance of features in the model’s predictions, with literacy and education status of mothers having the greatest influence (+ 0.45 each), followed by wealth index (+ 0.42) and mobile phone ownership (+ 0.34). Moderate contributors include factors like distance from health facility, place of residency, and having a bank account (+ 0.29 each), while features like media exposure (+ 0.08) and internet use (+ 0.07) have the least impact (Fig. 7).
The SHAP summary plot offers a detailed view of how each feature influences the model’s predictions, with each dot representing the contribution of a specific feature value to a single prediction. Red dots indicate that the feature increases the model’s prediction, while blue dots signify a decrease in the prediction. Figure 7 highlighted key factors that influenced tobacco use among pregnant women. Lower literacy levels and low educational attainment were strongly associated with tobacco use, while higher literacy and education reduced the likelihood of tobacco use. Women from lower wealth quintiles and those who did not own a mobile phone were more likely to use tobacco. Similarly, greater distance from health facilities and rural residency were linked to higher tobacco use. Women without a bank account or those not engaged in formal work were also more likely to use tobacco. Being single and living in female-headed households were further associated with higher tobacco use compared to married women and male-headed households. Additionally, lack of access to electricity, limited media exposure, and no internet use were significant predictors of tobacco use. Women with fewer children were also more likely to use tobacco. These findings underlined the critical role of socio-demographic, economic, and accessibility factors in shaping tobacco use behaviors among pregnant women, suggesting that targeted interventions addressing these determinants could help reduce tobacco use in this population (Fig. 8).
Discussion
This study used machine learning models to predict tobacco use and identify its determinants among pregnant women in 26 Sub-Saharan African Countries. The Random Forest classifier performed well in predicting tobacco usage among pregnant women. The model achieved an accuracy of 0.92, (AUC) of 0.98, precision of 0.91, recall of 0.94, and F1 score of 0.93. Tobacco use is a significant public health problem and should be considered when evaluating lifestyle determinants of health. In this study, we observed that 2.1% of pregnant women smoked tobacco. This finding is congruent with previous research conducted in sub-Saharan African countries, where a similar prevalence of 2.0% was reported [58]. These consistent results indicate a stable trend in tobacco use among pregnant women in this region and this study utilizes current socioeconomic conditions with the latest datasets. However, this observed result was slightly lower than previously published research from LMICs with a prevalence of 2.6% [59]. This disparity might be due to contemporary women’s belief in equal social, political, and economic rights and opportunities for all, which could influence lifestyle choices and health-related behaviors, including tobacco use [60]. Moreover, the rising prevalence of smokeless tobacco products in the population could lead to increased tobacco use among women [61].
This study aimed to develop a predictive model for tobacco use among pregnant women and to identify associated factors using machine learning techniques. While risk quantification, such as odds ratios, is often provided in regression-based studies, the focus here was on feature importance and interpretability using SHAP values. This approach allows for a nuanced understanding of how factors like contribute to predictions without relying on traditional risk metrics. SHAP analysis revealed that factors such as literacy, education status of mothers, wealth index, mobile phone ownership, distance from health facility, place of residency, and having a bank account were the most influential predictors of tobacco use.
In the current study, the wealth index significantly influenced the likelihood of tobacco use among pregnant women in sub-Saharan African countries. Notably, women with lower wealth indexes had the highest rates of cigarette use during pregnancy, consistent with previous studies [58, 62, 63]. This similarity could be linked to a low wealth index resulting in financial stress and limited access to healthcare resources contributing to higher smoking rates [63]. Women with a lower wealth index would consider tobacco as a coping mechanism due to increased stress [64]. Besides, limited healthcare access hinders quitting efforts, exacerbating the problem.
In this study, women’s educational level and literacy status were identified as a significant risk factor for the consumption of tobacco during pregnancy. Notably, pregnant women who smoke tobacco had lower education levels such as primary education and high school, consistent with previous studies [16, 24, 25]. This concordance of results might be due to women with lower education levels being less aware of the harmful consequences of tobacco use and the various smoking cessation programs that can assist them in adopting healthier behaviors during pregnancy [65]. This lack of health literacy may also restrict their access to healthcare services and heighten the likelihood of using smoking as a coping mechanism during stressful times [64, 66].
This study also found that the geographical location and residence of pregnant women were strongly associated with increasing smoking rates. Particularly, women residing in rural areas and being distant from health facilities were associated with increased tobacco use, aligning with previous findings [67,68,69]. This congruence of results could be related to women residing in rural areas often facing a shortage of healthcare providers and resources, which increases the difficulty for pregnant women seeking support to quit smoking [70]. Furthermore, lower socioeconomic status and cultural norms that are more accepting of smoking can also contribute to higher tobacco use among these women [71].
In addition, his study also found that socioeconomic inequalities are associated with higher tobacco use among women. Notably, women with limited access to electricity could contribute to higher tobacco use due to increased stress and fewer opportunities for health education and cessation programs, consistent with earlier studies [67, 72]. Furthermore, limited media exposure and low mobile ownership were linked to higher tobacco use, aligning with findings from previous studies [17, 73]. This outcome could be attributed to media exposure’s role in reshaping perceptions, preventing the initiation of smoking, and supporting quitting [74]. As a result, media significantly influences attitudes toward tobacco use and prevention.
Single women and those in female-headed households exhibited higher tobacco use. Marital status and household dynamics can influence stress levels, social support, and economic stability, all of which affect smoking behaviors [75,76,77]. The American College of Obstetricians and Gynecologists reports that women aged 20–24 years and those with a high school education or less are more likely to smoke during pregnancy, indicating that younger, possibly single women are at higher risk [78].
Unemployed women were more likely to use tobacco during pregnancy. Employment provides not only financial stability but also structured routines and social interactions that can discourage smoking [79, 80]. Conversely, unemployment may lead to increased stress and idle time, potentially contributing to higher tobacco usage [81,82,83].
According to the study women with fewer children were more likely to use tobacco. This finding may be related to increased health awareness and motivation to quit smoking among women with more children, who may have greater concerns about secondhand smoke exposure and setting a positive example [84].
Strength and limitation of the study
This study used a large dataset of 33, 705 women from 26 Sub-Saharan African countries, hence increasing the generalizability of its findings. It detects complex, non-linear connections that are frequently ignored by standard statistical methods and using SHAP analysis to assess the relative importance of each predictor, yielding actionable insights. Furthermore, the study closes the research-practice gap by focusing on practical, effective solutions for real-world health care services. However, using self-reported data may create response bias, and the cross-sectional research design restricts causal inference. Variations in survey years may impact the importance of contraceptive coverage in certain nations, whilst the absence of local factors may restrict the results’ application to specific demographic segments. We did not compare machine learning models with conventional regression methods, as that was not the study’s objective. The study also focused on identifying determinants through SHAP analysis without including risk quantification for each feature. Additionally, we aggregated data from 26 SSA countries but did not perform country-specific analyses, and variations in survey quality and cultural differences were not controlled. Country-specific policies may also influence the results, as national policies can significantly impact survey outcomes.
Conclusion and recommendation
This study employed a Random Forest machine learning approach to identify socioeconomic and healthcare-related determinants of tobacco use among pregnant women across 26 Sub-Saharan African countries. Key predictors, including maternal literacy, education, wealth index, and healthcare access, underscore systemic inequities driving tobacco dependency during pregnancy. The model’s interpretability highlights actionable intervention points, demonstrating the potential of machine learning to inform targeted public health strategies. These findings advocate for policies addressing education gaps, economic disparities, and healthcare access barriers, which are critical for reducing tobacco use and improving maternal and neonatal outcomes in the region. Using these information, policymakers and healthcare practitioners may implement interventions to minimize tobacco use among pregnant women, improving newborn and maternal health outcomes in the region. Future research could compare machine learning models with conventional regression models, particularly for exploring urban-rural differences in smoking prevalence. Analyzing inter-country differences across SSA countries could provide deeper insights, and validating our findings with conventional models would strengthen the results. Addressing data quality and cultural differences would enhance the findings, and providing the analysis code in a publicly accessible repository would improve reproducibility.
Data availability
The datasets analyzed in the current study are available in the public domain through the Measure DHS website (http://www.measuredhs.com).
References
Banks E, Joshy G, Weber MF, Liu B, Grenfell R, Egger S, et al. Tobacco smoking and all-cause mortality in a large Australian cohort study: findings from a mature epidemic with current low smoking prevalence. BMC Med. 2015;13(1):38.
Organization WH, Tobacco, WHO.; 2023 [Available from: https://www.who.int/news-room/fact-sheets/detail/tobacco
Jha P, MacLennan M, Chaloupka F, Yurekli A, Ramasundarahettige C, Palipudi K et al. Global Hazards of Tobacco and the Benefits of Smoking Cessation and Tobacco Taxes. 2015. pp. 175– 93.
Mbongwe B, Tapera R, Tobacco. A looming epidemic in Sub-Saharan African countries. In: Mhaka-Mutepfa M, editor. Substance use and misuse in sub-Saharan Africa: trends, intervention, and policy. Cham: Springer International Publishing; 2021. pp. 63–78.
Dai XA-O, Gakidou E, Lopez AA-O. Evolution of the global smoking epidemic over the past half century: strengthening the evidence base for policy action. (1468–3318 (Electronic)).
Abraham M, Alramadhan S, Iniguez C, Duijts L, Jaddoe VW, Den Dekker HT et al. A systematic review of maternal smoking during pregnancy and fetal measurements with meta-analysis. (1932–6203 (Electronic)).
McDonnell B, Regan C. Smoking in pregnancy: pathophysiology of harm and current evidence for monitoring and cessation. Volume 21. The Obstetrician & Gynaecologist; 2019.
Lange S, Probst C, Rehm J, Popova S. National, regional, and global prevalence of smoking during pregnancy in the general population: a systematic review and meta-analysis. (2214-109X (Electronic)).
MÃguez M, Pereira B, Pinto T, Figueiredo B. Continued tobacco consumption during pregnancy and women’s depression and anxiety symptoms. Int J Public Health. 2019;64:1355–65.
Polakowski LL, Akinbami Lj Fau -, Mendola P, Mendola P. Prenatal smoking cessation and the risk of delivering preterm and small-for-gestational-age newborns. (0029-7844 (Print)).
Wang X, Zuckerman B, Fau - Pearson C, Pearson C, Fau - Kaufman G, Kaufman G, Fau - Chen C, Chen C, Fau - Wang G, Wang G, Fau - Niu T et al. Maternal cigarette smoking, metabolic gene polymorphism, and infant birth weight. (0098-7484 (Print)).
Ino T. Maternal smoking during pregnancy and offspring obesity: meta-analysis. (1442-200X (Electronic)).
Leonardi-Bee J, Britton J, Fau - Venn A, Venn A. Secondhand smoke and adverse fetal outcomes in nonsmoking pregnant women: a meta-analysis. (1098–4275 (Electronic)).
Ioakeimidis N, Vlachopoulos C, Katsi V, Tousoulis D. Smoking cessation strategies in pregnancy: current concepts and controversies. (2241–5955 (Electronic)).
Gajewska E, Malak R, Mojs E, Samborski W. [Cigarette smoking–threat from first days of life]. Przegla̧d Lekarski. 2008;65:709–11.
Širvinskienė G, Žemaitienė N, Jusienė R, Šmigelskas K, Veryga A, Markūnienė E. Smoking during pregnancy in association with maternal emotional well-being. Medicina. 2016;52(2):132–8.
Aychiluhm SA-O, Mare KA-O, Dagnew B, Seid AA, Melaku MS, Sabo KG et al. Determinants of tobacco use among pregnant women in sub-Saharan Africa. A multilevel mixed-effect logistic regression model. (1932–6203 (Electronic)).
Curtin Sc Fau -, Matthews TJ, Matthews TJ. Smoking prevalence and cessation before and during pregnancy: data from the birth certificate, 2014. (1551–8922 (Print)).
Australian Institute of H, Welfare. Australia’s mothers and babies 2014—in brief. Canberra: AIHW %U; 2016. https://www.aihw.gov.au/reports/mothers-babies/australias-mothers-babies-2014-in-brief.
Caleyachetty R, Tait CA, Kengne AP, Corvalan C, Uauy R, Echouffo-Tcheugui JB. Tobacco use in pregnant women: analysis of data from demographic and health surveys from 54 low-income and middle-income countries. Lancet Global Health. 2014;2(9):e513–20.
Vellios NA-O, Ross H, Perucic AM. Trends in cigarette demand and supply in Africa. (1932–6203 (Electronic)).
Méndez D, Alshanqeety O, Fau - Warner KE, Warner KE. The potential impact of smoking control policies on future global smoking trends. (1468–3318 (Electronic)).
Houston-Ludlam AN, Bucholz KK, Grant JD, Waldron M, Madden PAF, Heath AC. The interaction of sociodemographic risk factors and measures of nicotine dependence in predicting maternal smoking during pregnancy. (1879-0046 (Electronic)).
de Wolff MG, Backhausen MG, Iversen ML, Bendix JM, Rom AL, Hegaard HK. Prevalence and predictors of maternal smoking prior to and during pregnancy in a regional Danish population: a cross-sectional study. Reproductive Health. 2019;16(1):82.
Erlingsdottir A. Sigurdsson El Fau - Jonsson JS, Jonsson Js Fau - Kristjansdottir H, Kristjansdottir H Fau - Sigurdsson JA, Sigurdsson JA. Smoking during pregnancy: childbirth and health study in primary care in Iceland. (1502–7724 (Electronic)).
Graham H, Hawkins SS, Law C. Lifecourse influences on women’s smoking before, during and after pregnancy. Soc Sci Med. 2010;70(4):582–7.
Verma P, Pandey P, Thakur A. Prevalence and determinants of tobacco consumption among pregnant women of three central Indian districts. Trop J Obstet Gynecol. 2017;34:99.
Breiman L. Statistical modeling: the two cultures. Qual Control Appl Stat. 2003;48(1):81–2.
Hosmer DW Jr, Lemeshow S, Sturdivant RX. Applied logistic regression: Wiley; 2013.
Bizzego A, Gabrieli G, Bornstein MH, Deater-Deckard K, Lansford JE, Bradley RH, et al. Predictors of contemporary under-5 child mortality in low-and middle-income countries: A machine learning approach. Int J Environ Res Public Health. 2021;18(3):1315.
Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. Citeseer; 2009.
Ij H. Statistics versus machine learning. Nat Methods. 2018;15(4):233.
Dhar V. Data science and prediction. Commun ACM. 2013;56(12):64–73.
Sarker IH. Machine learning: algorithms, real-world applications and research directions. SN Comput Sci. 2021;2(3):160.
Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A. Conditional variable importance for random forests. BMC Bioinformatics. 2008;9:1–11.
Shalev-Shwartz S, Ben-David S. Understanding machine learning: from theory to algorithms. Cambridge University Press; 2014.
Obermeyer Z, Emanuel EJ. Predicting the future—big data, machine learning, and clinical medicine. N Engl J Med. 2016;375(13):1216–9.
Rajkomar A, Dean J, Kohane I. Machine learning in medicine. N Engl J Med. 2019;380(14):1347–58.
Zhang A, Xing L, Zou J, Wu JC. Shifting machine learning for healthcare from development to deployment and from models to data. Nat Biomedical Eng. 2022;6(12):1330–45.
Aychiluhm SB, Mare KU, Dagnew B, Seid AA, Melaku MS, Sabo KG, et al. Determinants of tobacco use among pregnant women in sub-Saharan Africa. A multilevel mixed-effect logistic regression model. PLoS ONE. 2024;19(5):e0297021.
Minyihun A, Tessema ZT. Determinants of access to health care among women in East African countries: A multilevel analysis of recent demographic and health surveys from 2008 to 2017. Risk Manage Healthc Policy. 2020;13(null):1803–13.
Pampel F. Tobacco use in sub-Sahara Africa: estimates from the demographic health surveys. Soc Sci Med. 2008;66(8):1772–83.
Yaya S, Uthman OA, Adjiwanou V, Bishwajit G. Exposure to tobacco use in pregnancy and its determinants among sub-Saharan Africa women: analysis of pooled cross-sectional surveys. J Maternal-Fetal Neonatal Med. 2020;33(9):1517–25.
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
Cutler DR, Edwards TC Jr, Beard KH, Cutler A, Hess KT, Gibson J, et al. Random forests for classification in ecology. Ecology. 2007;88(11):2783–92.
Lundberg S. A unified approach to interpreting model predictions. ArXiv Preprint arXiv:170507874. 2017.
Chen JH, Asch SM. Machine learning and prediction in medicine—beyond the peak of inflated expectations. N Engl J Med. 2017;376(26):2507.
Parmar A, Katariya R, Patel V, editors. A review on random forest: An ensemble classifier. International conference on intelligent data communication technologies and internet of things (ICICI) 2018; 2019: Springer.
Liaw A. Classification and regression by randomForest. R news. 2002.
James G. An introduction to statistical learning. springer; 2013.
Rimal Y, Sharma N, Alsadoon A. The accuracy of machine learning models relies on hyperparameter tuning: student result classification using random forest, randomized search, grid search, bayesian, genetic, and optuna algorithms. Multimedia Tools Appl. 2024;83(30):74349–64.
Hossain R, Timmer D. Machine learning model optimization with hyper parameter tuning approach. Glob J Comput Sci Technol D Neural Artif Intell. 2021;21(2):31.
Kuhn M, Johnson K, Kuhn M, Johnson K. An introduction to feature selection. Appl Predictive Model. 2013:487–519.
Brownlee J. Statistical methods for machine learning: discover how to transform data into knowledge with Python. Machine Learning Mastery; 2018.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.
Bifarin OO. Interpretable machine learning with tree-based Shapley additive explanations: application to metabolomics datasets for binary classification. PLoS ONE. 2023;18(5):e0284315.
Mohr F, van Rijn JN. Learning curves for decision making in supervised machine learning: A survey. Mach Learn. 2024:1–55.
Yaya SA-O, Uthman OA-O, Adjiwanou VA-O, Bishwajit GA-O. Exposure to tobacco use in pregnancy and its determinants among sub-Saharan Africa women: analysis of pooled cross-sectional surveys. (1476–4954 (Electronic)).
Caleyachetty R, Tait CA, Kengne AP, Corvalan C, Uauy R, Echouffo-Tcheugui JB. Tobacco use in pregnant women: analysis of data from demographic and health surveys from 54 low-income and middle-income countries. (2214-109X (Electronic)).
Woods Nf Fau -, Lentz M, Lentz M, Fau - Mitchell E, Mitchell E. The new woman: health-promoting and health-damaging behaviors. (0739–9332 (Print)).
Maziak W. Ward Kd Fau - Afifi Soweid RA, Afifi Soweid Ra Fau - Eissenberg T, Eissenberg T. Tobacco smoking using a waterpipe: a re-emerging strain in a global epidemic. (1468–3318 (Electronic)).
Palipudi K, Rizwan Sa Fau -, Sinha DN et al. Sinha Dn Fau - Andes LJ, Andes Lj Fau - Amarchand R, Amarchand R Fau - Krishnan A, Krishnan A Fau - Asma S, Prevalence and sociodemographic determinants of tobacco use in four countries of the World Health Organization: South-East Asia region: findings from the Global Adult Tobacco Survey. (1998–4774 (Electronic)).
Abdeta TA-O, Hunduma GA-OX. Tobacco Use Among Reproductive Age Women in Ethiopia: Evidence from the National Health Survey. (1179–8467 (Print)).
Businelle MS. Kendzor De Fau - Reitzel LR, Reitzel Lr Fau - Costello TJ, Costello Tj Fau - Cofta-Woerpel L, Cofta-Woerpel L Fau - Li Y, Li Y Fau - Mazas CA, Mechanisms linking socioeconomic status to smoking cessation: a structural equation modeling approach. (1930–7810 (Electronic)).
Maralani V. Understanding the links between education and smoking. (1096– 0317 (Electronic)).
Raghupathi V, Raghupathi W. The influence of education on health: an empirical assessment of OECD countries for the period 1995–2015. (0778–7367 (Print)).
Pilehvari A, Chipoletti A, Krukowski R, Little M. Unveiling socioeconomic disparities in maternal smoking during pregnancy: a comprehensive analysis of rural and Appalachian areas in Virginia utilizing the multi-dimensional YOST index. BMC Pregnancy Childbirth. 2024;24(1):828.
Nighbor TA-O, Doogan NJ, Roberts ME, Cepeda-Benito A, Kurti AN, Priest JS et al. Smoking prevalence and trends among a U.S. national sample of women of reproductive age in rural versus urban settings. (1932–6203 (Electronic)).
Tong Vt Fau -, Dietz PM, Dietz Pm Fau - Morrow B, Morrow B, Fau -. D’Angelo DV, D’Angelo Dv Fau - Farr SL, Farr Sl Fau - Rockhill KM, Rockhill Km Fau - England LJ, Trends in smoking before, during, and after pregnancy–Pregnancy Risk Assessment Monitoring System, United States, 40 sites, 2000–2010. (1545–8636 (Electronic)).
Pilehvari A, You W, Krukowski RA, Little MA. Examining Smoking Prevalence Disparities in Virginia Counties by Rurality, Appalachian Status, and Social Vulnerability, 2011–2019. (1541-0048 (Electronic)).
Unger JB, Cruz T, Fau - Shakib S, Shakib S, Fau - Mock J, Mock J, Fau - Shields A, Shields A, Fau - Baezconde-Garbanati L, Baezconde-Garbanati L, Fau - Palmer P et al. Exploring the cultural context of tobacco use: a transdisciplinary framework. (1462–2203 (Print)).
Hosseinpoor AR. Parker La Fau - Tursan d’Espaignet E, Tursan d’Espaignet E Fau - Chatterji S, Chatterji S. Socioeconomic inequality in smoking in low-income and middle-income countries: results from the World Health Survey. (1932–6203 (Electronic)).
Pierce JP, Gilpin EA. News media coverage of smoking and health is associated with changes in population rates of smoking cessation but not initiation. (0964–4563 (Print)).
Achia TN. Tobacco use and mass media utilization in sub-Saharan Africa. (1932–6203 (Electronic)).
Abdeta T, Hunduma G. Tobacco use among reproductive age women in Ethiopia: evidence from the National health survey. Subst Abuse Rehabilitation. 2021;12(null):1–10.
Ramsey MW Jr., Chen-Sankey JC, Reese-Smith J, Choi K. Association between marital status and cigarette smoking: variation by race and ethnicity. Prev Med. 2019;119:48–51.
Cho H-J, Khang Y-H, Jun H-J, Kawachi I. Marital status and smoking in Korea: The influence of gender and age. Social science & medicine (1982). 2008;66:609– 19.
Azagba S, Manzione L, Shan L, King J. Trends in smoking during pregnancy by socioeconomic characteristics in the united States, 2010–2017. BMC Pregnancy Childbirth. 2020;20(1):52.
Sreeramareddy CT, Pradhan PM, Sin S. Prevalence, distribution, and social determinants of tobacco use in 30 sub-Saharan African countries. BMC Med. 2014;12:1–13.
Guliani H, Gamtessa S, Çule M. Factors affecting tobacco smoking in Ethiopia: evidence from the demographic and health surveys. BMC Public Health. 2019;19:1–17.
Zuelke AE, Luck T, Schroeter ML, Witte AV, Hinz A, Engel C, et al. The association between unemployment and depression–Results from the population-based LIFE-adult-study. J Affect Disord. 2018;235:399–406.
Amiri S. Unemployment associated with major depression disorder and depressive symptoms: a systematic review and meta-analysis. Int J Occup Saf Ergon. 2022;28(4):2080–92.
Feather NT, Barber JG. Depressive reactions and unemployment. J Abnorm Psychol. 1983;92(2):185.
Jarvis M. The association between having children, family size and smoking cessation in adults. Addiction. 1996;91(3):427–34.
Funding
This study did not receive any funding.
Author information
Authors and Affiliations
Contributions
EAT and ATK developed the concept for the study. EAT, ATK and EYW, reviewed the literature. EAT conducted the data analysis. ATZ, TEZ, FGA, GYH, ATK, and EAT discussed the findings. All authors proofread the manuscript for spelling and grammar, and all approved the final version for submission.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This study used secondary data analysis, hence no direct participation from individuals was required. A consent letter for data access was obtained from a major health and demographic survey via a web-based request submitted to http://www.dhsprogram.com. This study used exclusively de-identified information, ensuring full compliance with ethical standards for participant privacy and confidentiality.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Taye, E.A., Woubet, E.Y., Hailie, G.Y. et al. Random forest algorithm for predicting tobacco use and identifying determinants among pregnant women in 26 sub-Saharan African countries: a 2024 analysis. BMC Public Health 25, 1506 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12889-025-22794-1
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12889-025-22794-1