https://doi.org/10.29013/ESR-20-1.2-13-22
Rui Yang, Emory University E-mail: yr159123@hotmail.com
AMERICANS& WORRY ABOUT FINANCING RETIREMENT: COMPARING THREE PREDICTIVE MODELS
Abstract. This paper aims to build a predictive model about Americans& financial worry over retirement on the basis of demographic factors and subjective financial, physical, and mental health conditions. In this paper, a cross-sectional nationally representative data set based on the National Health Interview Survey 2018 was used. Missing values in the data set were first indicated by dummies features and then replaced using mean value imputation. Three machine learning models, random forest, logistic regression, and multilayer perceptron, were built, and their respective performances were compared. As a result, all three algorithms reported fairly similar results with approximately 0.9 true positive rate (TPR), 0.3 false positive rate (FPR), and 0.88 area under curve (AUC). We also found that financial condition is the most important factor relating to people&s financial worry. As a result, policy-makers should put more weight on this factor when designing specific policies or deciding an individual&s eligibility to receive necessary assistance.
Despite recent rebounds in the economy, Ameri- tal status and people&s degree of concern about the
cans today have become increasingly worried about adequacy of their retirement plans. The result of
financing their retirement since the Great Recession. their study showed that compared to other groups,
A survey conducted by the Pew Research Center in singles and those who have experienced negative
Fry [18]). The same research has also shown that the on the macro-level. There is a noticeable lack of studproportion of people worry about financing their re- ies that actually connect factors, such as age or maritirement varies greatly among different age groups and tal status, to build a model that gives predictions on
income groups, with the mid-age and mid to low- in- an individual level. This paper aims to fill this gap
come level individuals being most worried. through developing, validating, and comparing sevWhile the subject of financial preparation for eral models based on these factors, and, at the same
retirement has been extensively explored, most re- time, find the most significant predictor relating to
search has only focused on how different factors are people&s financial worry after retirement. A more acassociated with people&s financial anxiety. Research curate predictive model might help policymakers to
by Owen and Wu in 2007 established the relation- tailor specific policies on the micro-level to address
ship between psychological factors as well as mari- such worries more effectively.
Data Set Description
The National Health Interview Survey (hereinafter referred to as NHIS) is a nationally representative cross-sectional household study which monitors trends of illness and the progress of current health objectives. The survey was initiated in 1957 by Centers for Disease Control and Prevention and has been conducted annually since then with an approximate 70% response rate (National Center for Health Statistics, [9]). Data from the survey conducted in
Table 1. - Description
A list of 28 different indicator features used in the model can be found in (Table 1).
of the 28 Features
Groups Code Description
REGION Region
RACERPI2 Race group
AGE P Age
R MARITL Marital status
Employment
YRSWRKPA Number of years on the job
WRKLYR4 Whether have a job or business at any time in the past 12 months
HYPEV Ever been told having hypertension
CHDEV Ever been told having coronary heart disease
CANEV Ever been told having cancer
LIVEV Ever been told having chronic liver condition
AMIGR Had severe headache/migraine in the past 3 months
AINTIL2W Had stomach problem with vomit/diarrhea in the past 2 weeks
DIBEV1 Ever been told having diabetes
ARTH1 Ever been told having arthritis
SMKSTAT2 Had smoked at least 100 cigarettes
Financial
ASISTLV How worried about not being able to maintain the standard of living
ASINBILL ... pay monthly bills
ASIHCST ... pay rent/mortgage/housing costs
ASICCMP ... make credit card payments
ASICNHC ... afford medical costs of healthcare
ASISLEEP Hours of sleep on average
ASISLPFL Number of times having trouble falling asleep in the past week
ASISAD How often did you feel so sad that nothing cheers you up in the past month
ASINERV How often did you nervous in the past month
ASIRSTLS How often did you restless in the past month
ASIHOPLS How often did you hopeless in the past month
ASIWTHLS How often did you worthless in the past month
Data Pre-processing
Currently, the U.S. full retirement age is 66 years and 2 months (U. S. Social Security Administration [19]). For the purpose of this study, sample points with an age greater than 67 years were excluded from the data set. The final sample size is N = 19.090. The outcome variable is coded as "ASIRETR", corresponding to the question in the survey "How worried are you right now about not having enough money for retirement?" The effective responses consist of four levels, indicating the respondents are "very," "moderately," "not too," or "not" worried about their financial condition after retirement. Then this variable was recoded into a binary variable where the first three levels were combined and coded as 1, while the last level was coded as 0. In this way, one indicates a respondent has financial worry over retirement while zero indicates the respondent does not have such feeling.
The explanatory variable representing race (coded as "RARACERPI 2") was recoded into three groups: white, black/African American, and other, where the third group is a combination of other races as well as multiple races.
The nominal variable is one kind of categorical variables whose levels are simply labels and thus does not contain any meaning of order. For example, in the variable "REGION", "Northeast" is encoded as 1, and "Midwest" is encoded as 2. Even though we want these two levels to be equally weighted, it is
usually problematic if we feed the feature directly into the model, as most algorithms will mistakenly assume that Midwest is greater than Northeast. Thus the results produced in this way may not be optimal. A way to solve this is to use the one-hot encoding (Raschka [16]). The idea behind this approach is to create a dummy feature for each unique value in the nominal variables. Here, for each region in the "REGION" variable, a new binary feature was created whose values were used to indicate the particular region of a sample.
Missing value
It is common in real-world applications that the samples contain missing values for variable reasons. In the NHIS data set, a missing value is introduced when the respondent either refused to answer the question or did not understand the answer. Most machine learning algorithms become problematic when missing values are present within the data set. A convenient yet defective approach to handle this is simply removing data that contains a missing value. However, depending on the size of missing values, we may end up removing too many sample points, which introduces a significant new selection bias, and, at the same time, we take the risk of losing valuable information that the classifier needs to learn model parameters. An alternative approach is to use mean value imputation, where the missing values are replaced by the mean of that feature.
As can be seen in (Figure 1), the frequency of missing values is relatively low in both classes: most features
have less than 5% missing values with only one exception that contains about 7% missing values. As a result, using mean value imputation is unlikely to have a significant impact on the overall reliability of the data set.
In this research, two steps were taken to treat the missing values. First, a dichotomous dummy feature was created aside each feature, using one to
indicate sample points containing a missing value in the corresponding feature and zero otherwise. Then the mean imputation method was utilized to fill the blank. This two-step method had the advantage of not only keeping all the original information, but also making the whole data set useable by nearly all algorithms.
Figure 1. The distribution of
Standardization
Some machine learning algorithms, such as neural networks, require a specific technique called feature standardization for better training speed and accuracy. Feature standardization transforms different features into comparable scales and ensures all features weigh equally in the training process.
For each feature, its mean and standard deviation were first computed as x and s. Then each data point
x. with respect to that feature was replaced by y calculated as:
After standardization, all features follow a standard normal distribution with mean zero and unit standard deviation.
missing values for each class
Splitting up the data
For most machine learning algorithms, the data set typically needs to be randomly split into one training set and one test set. The training set is first fed into the model to learn parameters, while the test set is held untouched. Then the test set is used to give unbiased performance measure of how well the model fits on the training set. In this model, the training set has 70% of data and the test set has the remaining 30%.
Machine Learning Models
Logistic regression
Logistic regression is one of the most well-known algorithms for classification that performs very well on linearly separable classes. In logistic regression, each feature has its specific weight wi. The net input
z is calculated as a linear combination of input and feature weights, which is derived by:
z = w x = w0 + w1x1 + ■
+ w x
Where m is the number of features. Given z in the entire real number range, it then can be transformed into the probability that a sample belongs to a certain class through the logistic function:
V(z) = --^
The goal of logistic regression is to find the optimal weights that maximize the likelihood of overall classification, which then becomes to minimize the cost function over the entire data set given below:
C (w) = -- log(<p(z!)) + (1 - )log(1 -y(z&))
A typical problem during the training process is overfitting. Overfitting occurs when the model is more complex than the data. Overfitting can be identified when the model has much better performance on the training set than test set. A possible way to reduce high variance is to introduce regularization into the model. The concept behind regularization is to add an extra term to the cost function which gives penalties to the extreme parameter weights. A common form of regularization is called L2 regular-ization, which can be written as follows:
L = - fw2
c^ ¿—t i
This paper used L2 regularization and 5-fold cross validation in the model. The regularization parameter C was set to be 100, which was found via grid search. The model was built using the scikit-learn package with other options set by default. Raschka&s book provides more details about logistic regression and regularization [15].
Artificial neural network
An artificial neural network is a computational model inspired by biological neural networks. Unlike linear models such as logistic regression, neural networks are able to capture non-linear relations within data, which makes the algorithm stand out from linear models when the relationship between the input and output is highly complicated.
A multilayer network is an artificial neural network that consists of one input layer, one or multiple hidden layers, and one output layer. The input layer is the first layer, the output layer is the last layer, and any layers between them are hidden layers. The data are passed into the input layer, processed by the hidden layers, and finally transformed into predicted labels in the output layer.
An activation function in neural networks brings in the much desired non-linearity property that enables the model to capture almost any relationship. The three most common activation functions are logistic function, hyperbolic tangent function, and rectified linear unit (ReLU) given below: f (x) = max(0, x)
This paper used ReLU as the activation function for the neural network. One advantage of ReLU over the other two functions in this model is the reduced computational cost (Arora et al. [2]). In this model, the network has two hidden layers, with 16 nodes in the first layer and 8 nodes in the second layer. The network was trained via a technique called back propagation. More detail about this technique and general neural networks can be found in Raschka&s book [17]. Like logistic regression, this paper used L2 regularization in the training process, and the regularization parameter a was set to be 0.0572, which was found via grid search.
Random forest
Random Forest is an ensemble learner where multiple weak learners (decision trees) are trained independently on a random sample of the training data with replacement. Usually, a single decision tree has a high chance of overfitting the data when it grows deeper. However, the high variance can be reduced when we introduce various uncorrelated decision trees into the same data set. Generally, a random forest outperforms a decision tree and can achieve a better balance between variance and bias. A more detailed description of random forest can be found in Breiman&s article [3].
The random forest in this model was built using scikit-learn with bootstrap samples, 100 decision
trees, and 50 minimum sample leaves. The cross-entropy function given below was applied as the cost function to maximize the information gain for each decision tree:
It (t) = p log2 p + (1 - p)log2 (1 -p),
Where p is the proportion of the samples that belongs to class zero for a particular node t (Witten et al. [24]). Unlike in some other random forest models that use the majority vote, the predicted labels of each sample point in this model are decided by averaging their probabilistic predictions of decision trees involved.
Environment
The data pre-processing process was mainly conducted in R3.6.1 (R Core Team [14]). The missing value visualization was produced using the package ggplot 2 (Wickham [21]), the data cleaning process was done with packages dplyr (Wickham et al. [22]) and tidyr (Wickham Henry [23]), and dummy features were created using dummies (Brown [4]). The data set read-in, partition, model training, and model validation process were done using scikit-learn (Pedregosa et al. [13]), SciPy (Virtanen et al. [20]), NumPy (Oliphant [12]), and pandas package (McKinney [7]). Other graphs were produced using Matplotlib (Hunter [5]).
Model Validation
The most common metrics measuring the performance ofbinary classification models are confusion matrix, receiver operating characteristic (ROC), and area under curve (AUC). A confusion matrix is simply a matrix that lays out the counts of true positive, true negative, false positive, and false negative predictions of a classifier. Figure 2 displays the meaning of these terms.
Figure 2. An illustration of confusion matrix
ROC graphs are useful in comparing the performance of different models. The x-axis of a ROC graph is the false positive rate, and the y-axis is the true positive rate. When giving prediction to a particular test case, most machine learning algorithms return a probability rather than a binary label. The classification is made when we set a decision threshold to dichotomize the result. When we shift the decision threshold from 0 to 100%, the false positive rate and the true positive rate will also change accordingly, which becomes the ROC curve if we connect those points together. The diagonal of a ROC graph can be interpreted as random guessing, while a curve that falls below the diagonal is said to perform worse than guessing. A curve at the top-left corner with a true positive rate 100% represents the performance of a perfect classifier that gives correct predictions under any decision threshold. In general, a classifier has better performance if its RO C is closer to the top-left corner.
It might be hard to identify which algorithm performs better by looking directly at ROC curves, especially when one curve is not totally enclosed in another. AUC overcomes this by finding the area under the ROC curve. A theoretically perfect classifier will have an AUC of 1, while a classifier that guesses randomly will have a value of 0.5. Generally, a higher value indicates better classification performance.
Results and Discussion
The confusion matrix of three algorithms can be found in (Figure 3). All three algorithms have about 90% true positive rate and 30% false positive rate. In the diagnosis of worry about financing retirement, we are more concerned about providing potential financial and mental assistance to those who are truly anxious. The models have 90% accuracy in identifying those people. At the same time, it is also important to decrease the potential waste of resources on those who are incorrectly identified as positive samples. In contrast to the true positive rate, the false positive rate provides insights about the fraction of incorrect positive samples out of total negative samples.
Normalized L&cnfjç;on Matrix
flDrmal ized ConFu siori Matrix
Normalized Lontjsisi Matrix
OB OT №6 OS 04 0.3 0Î
OS OT 0« Si 04 03 0Î
Dll 08»
u 06 ■is
D4 0.Î 0Î
Prided lawl
Pttdxttd i*t»i
Figure 3. Confusion matrix: random forest (left), logistic regression (middle), multilayer perceptron (right)
Table 2 shows a comparison of three algorithms used in this study. Each algorithm was run 10 times, and the values shown are mean and standard deviation of the AUC score. As can be seen from the table as well as Figure 4, all algorithms report fairly similar results, with the multilayer perceptron having a slightly better score than the other two. However, considering the influence created by different partitions of the data set as well as other random disturbance in the training process, such difference can
safely be ignored. As a result, all algorithms demonstrate the same performance on this data set.
Table 2.- AUC results generated from three classification algorithms
Algorithm Mean Std
Random Forest 0.884 0.0032
Logistic Regression 0.883 0.0034
Multilayer Perceptron 0.886 0.0042
Figure 4. The mean ROC curves for three classifiers trained on the data set
When a model has high training accuracy but low test accuracy, the model is said to have high variance, and when both training and test accuracy are low, the model is suffering from high bias. Random forest is a pretty robust algorithm that is unlikely to overfit data, as long as the number of
-*- Training score -*- Testing score
decision trees are large enough. To further diagnose the existence of variance and bias issues within the other two models and whether increasing the number of training samples will help, a learning curve was plotted for each algorithm in (Figure 5).
fc 06 <0
-•— 1- -*
-*- Training score —»- Testing score
«00 4500
Figure 5. Learning Curves of two models: logistic regression (left), multilayer perceptron (right)
For both models, the testing score comes closer to the training score when the sample size increases from 1 000 to 2 500. Then both remain at approximately the same accuracy and stop improving with further enlargements of the sample. As a result, to further improve the classification accuracy, it would be of little help to collect more data, but it is possible to introduce new features to build a more complex model. In addition, as a linear model, the regular logistic regression might perform poorly in capturing the non-linear relationships within data. To remedy this, future research can focus on including higher-order combinations of original features in the model in order to achieve a better trade-off between bias and variance.
When building a predictive model for situations such as worry about financing retirement, most researchers care not only about the performance of their model, but also whether the model is able to provide a way that enables human users to interpret the results. However, most machine learning
algorithms do not offer a straight-forward explanation about the relationship. In neural networks, for example, an input is passed through many layers of transformations with thousands or even millions of mathematical operations involved. Such processes make the algorithm difficult to interpret. Random forest does better for this purpose, as it is able to calculate feature importance via a technique called mean decrease in impurity (MDI). Louppe et al. in [6] provided a description about the technical details involved in this technique used in scikit-learn package. Figure 6 shows ten of the most important features and their respective importance in random forest.
It is worthwhile to note that the importance of all features adds up to 1. Referring to Table 1, it is not difficult to see that the top five features are in the financial group, which accounts for about 0.85 of total importance. Based on this result, the financial group is the most informative group and plays the most essential role in this model.
Figure 6. Ten most important features in random forest
Therefore, upon proposing policies to address financial worry after retirement in a more specific way, legislators should focus more on improving people&s current financial condition or overall economy rather than tailoring specific policies for different races, ages, or other demographic groups. The results have also demonstrated that mental health, even though some factors are among the top ten features, actually plays a minor role in people&s thinking process, and thus should be less emphasized.
Conclusion
The intention ofthis study was to build a predictive model with the best performance and to investigate the factors most related to Americans& worry about financing retirement. Three different models were built, and all of them have achieved a similar and superb performance. The study finds that current financial condition is the key group involved in this process.
One limitation of the study is that it only establishes the importance of financial factors in predicting worry, but has not actually quantitatively discussed their relationships. A potential direction in this area could be to analyze how they are correlated. In addition, the mean value imputation method used in this paper to complete missing values is a very simple and rough approach: all samples with a missing value in a feature will be replaced by the same value. This method does not consider the potential feature correlations and is likely to reduce data variance. For future studies, more advanced imputation methods like k-nearest neighbors (kNN), which replaces missing elements with the mean of k nearest neighbors of that particular sample, can be used to obtain better performance.
References: