Can Not Handle Categorical Predictors With More Than 53 Categories.

Splitting on categorical predictors in random forests

Marvin N. Wright

^aneLeibniz Plant for Prevention Research and Epidemiology—BIPS, Bremen, Germany

Inke R. König

ⁱⁱInstitut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Deutschland

Bookish Editor: Andrew Grayness

Received 2022 Jun 26; Accepted 2022 Dec 22.

Supplementary Materials: Supplemental Data 1: Supplementary material.

DOI:10.7717/peerj.6339/supp-one

Supplemental Information 2: Replication cloth for simulation and real data written report.

DOI:10.7717/peerj.6339/supp-ii

Data Availability Argument

The following data was supplied regarding data availability:

R scripts to replicate the simulation and real information study are available in the Supplemental File. Some raw data was provided to united states of america by the National Arthritis Foundation. We received permission to re-use the raw data for the purpose of this project, merely not to share the original data.

Abstract

One reason for the widespread success of random forests (RFs) is their ability to clarify most datasets without preprocessing. For example, in contrast to many other statistical methods and machine learning approaches, no recoding such as dummy coding is required to handle ordinal and nominal predictors. The standard approach for nominal predictors is to consider all 2^{k − one} − 1 2-partitions of the k predictor categories. Still, this exponential relationship produces a large number of potential splits to be evaluated, increasing computational complexity and restricting the possible number of categories in near implementations. For binary classification and regression, it was shown that ordering the predictor categories in each split leads to exactly the same splits as the standard arroyo. This reduces computational complexity because only chiliad − ane splits accept to be considered for a nominal predictor with one thousand categories. For multiclass classification and survival prediction no ordering method producing equivalent splits exists. Nosotros therefore propose to use a heuristic which orders the categories co-ordinate to the first main component of the weighted covariance matrix in multiclass nomenclature and by log-rank scores in survival prediction. This ordering of categories tin can be done either in every split or a priori, that is, just one time before growing the forest. With this approach, the nominal predictor can be treated equally ordinal in the entire RF procedure, speeding up the ciphering and avoiding category limits. Nosotros compare the proposed methods with the standard approach, dummy coding and simply ignoring the nominal nature of the predictors in several simulation settings and on existent data in terms of prediction performance and computational efficiency. Nosotros testify that ordering the categories a priori is at least as good every bit the standard approach of considering all 2-partitions in all datasets considered, while existence computationally faster. Nosotros recommend to use this approach as the default in RFs.

Keywords: Random forest, Categorical predictors, Nomenclature, Survival analysis

Introduction

Random forests (RF; Breiman, 2001) are a popular machine learning method, successfully used in many awarding areas such every bit economics (Varian, 2014), spatial predictions (Hengl et al., 2017; Schratz et al., 2018) or genomics (Goldstein, Polley & Briggs, 2011). In part, their success is due to their ease of use: RF frequently works well without much fine tuning or preprocessing, and chiselled predictors can exist included without recoding. Ordered categorical (ordinal) predictors such as customer satisfaction or the Likert scale tin can simply be treated the aforementioned fashion as numerical predictors.

For unordered categorical (nominal) predictors such as geographical regions, colors, medication types or cell types, the standard arroyo is to consider all 2-partitions of the k predictor levels. Each of these 2^{k − 1} − ane partitions is tried for splitting, and the best division is selected. This approach has two major drawbacks: First, for predictors with many categories, the exponential relationship produces a large number of potential splits to evaluate, increasing computational complexity. For instance, the predictor land code with all 28 Eu countries as categories would produce 2^{28 − 1} − ane = 1.34 × 10⁸ possible divide points. Second, each category has to be assigned to the left or right kid node in every node and thus these partition splits cannot be saved equally a single separate point, as information technology is done for ordinal or continuous predictors. A popular option is to utilise the bit representation of a single integer to relieve these child node assignments, for example, by setting a chip to 0 for the left child node and to ane for the right child node. However, this arroyo limits the possible number of categories to 32 or 64, depending on the implementation.

As a solution for the cases of for binary classification and regression, it was suggested to lodge the predictor levels in each split. Specifically, it was shown that ordering the categories by the proportion of ane's in a binary issue and treating these ordered categories as ordinal leads to exactly the same splits in the optimization (Fisher, 1958; Breiman et al., 1984; Ripley, 1996). The same holds for ordering past increasing mean of continuous outcomes (Breiman et al., 1984). This approach reduces computational complexity, because merely thousand − ane splits take to be considered for a nominal predictor with k categories.

An extension to ordering predictor levels at every split is to order them only in one case earlier growing the forest. With this approach, the divide betoken reduces to a single number, as for ordinal or continuous predictors. In improver, the categories accept to exist sorted merely once, thus speeding upwards the ciphering. For multiclass classification and survival prediction, the simplifications of Breiman et al. (1984), Ripley (1996) and Fisher (1958) do non utilize and no fast method which selects exactly the same splits exist. Several methods have been proposed to reduce the computational brunt. Nádas et al. (1991) and Mehta, Agrawal & Rissanen (1996) propose heuristics which iteratively find a good split just the splits have to exist saved equally node assignments. This leads to limits in the number of levels and to memory bug (see above), and the methods cannot be used to order categories a priori. Loh & Vanichsetakul (1988) suggest to transform the categorical variable to a dummy matrix and projection the columns of this matrix onto their largest discriminant coordinate (CRIMCOORD) (Gnanadesikan, 2011). Coppersmith, Hong & Hosking (1999) proposes to order the categories according to the first principal component of the weighted covariance matrix. We use the approach of Coppersmith, Hong & Hosking (1999) for multiclass classification because it can exist applied a priori to the entire dataset and is computationally faster than the arroyo of Loh & Vanichsetakul (1988). For survival outcomes, we propose to order past the log-rank score, as defined past Hothorn & Lausen (2003).

Nosotros compare the ii ordering schemes and the proposed methods for multiclass classification and survival prediction with the standard approach in simulations and on existent data in terms of prediction performance and computational efficiency. As a benchmark, nosotros also include dummy coding as it is used in generalized linear models. We testify that the choice of the splitting algorithm for nominal predictors has a slap-up impact on prediction performance and runtime. Compared to the standard approach, the performance can be improved while simultaneously reducing computational fourth dimension.

Methods

Random forests

Random forests (Breiman, 2001) are a car learning method based on collections of classification, regression or survival trees. In each split within a tree, the impurity reduction is maximized. This reduction is unremarkably measured by the Gini index (Breiman et al., 1984) for classification, by the sum of squares (Ishwaran, 2015) for regression and by the log-rank statistic (Ishwaran et al., 2008) or Harrell'south C-index (Schmid, Wright & Ziegler, 2016) for survival outcomes. To create various trees, each tree is grown on a subsample or bootstrap sample of the data. A farther random element is introduced by randomly cartoon splitting candidate variables at each split up. The algorithm grows copse to form the ensemble and aggregates the predictions of the unmarried copse to generate an ensemble prediction. The assemblage procedure is specific to the outcome type: For classification a majority vote is used, for regression an arithmetic mean and for survival an ensemble cumulative hazard estimator (Ishwaran et al., 2008). For details we refer to the original literature (Breiman, 2001) and to contempo reviews (Biau & Scornet, 2016; Boulesteix et al., 2012; Ziegler & König, 2014).

The nearly of import parameters of RFs are the number of trees, the number of variables randomly selected as splitting candidates (the so-called mtry value) and tree size (Probst, Wright & Boulesteix, 2018b). The tree size can be specified past setting a lower bound for the final node size or an upper bound for the tree depth. For nomenclature outcomes, no average has to be computed in the final nodes and consequently tree growing typically continues until only one class remains in a node. Cross validation can exist used to tune these parameters in order to optimize prediction accuracy without overfitting. Model comparisons should use nested resampling for unbiased results (Bischl et al., 2012). Co-ordinate to Probst, Bischl & Boulesteix (2018a), mtry is the parameter with the highest tunability, that is, is most of import to tune.

Methods to handle nominal predictors

Standard approach: consider all 2-partitions

The standard approach to handle nominal predictors in RFs is to consider all two-partitions of the k predictor categories. Each category can either be assigned to the left or the right child node. The society of the two child nodes does not thing for further splitting and thus the number of possible partitions is calculated by the Stirling partition number S(chiliad, ii) = 2^{k − i} − 1. Each of these partitions is tried for splitting, and the best partition is selected, run into Fig. 1A for a simple example. We refer to this method every bit "Partition."

An external file that holds a picture, illustration, etc. Object name is peerj-07-6339-g001.jpg

Analogy of nominal predictor methods.

The task in this case is to separate fifty-fifty and odd digits in a single split up. The Partition method (A) tries all seven possible partitions and selects the correct 1. In Ignore (B), all three possible linear splits are tried merely the correct partition is impossible. The Order method (C) orders the categories before splitting and is able to find the correct split with three trials.

A retention efficient way to relieve these splits is to salve node consignment as integer bit representation. For each category i bit is prepare to 0 or ane, corresponding to an assignment to the left or correct kid node, respectively. Because there are 2^{m − 1} − 1 possible 2-partitions, only 64 categories tin can be assigned with a 64 bit integer. In some implementations such as randomForest (Liaw & Wiener, 2018) and ranger (Wright & Ziegler, 2017), a double is used to save the split points, inducing a limit of 53 levels^one. Alternatively, all levels assigned to 1 child have to be saved in a split, removing the level limit but requiring more than retentiveness.

Ordering categories

As explained in a higher place, information technology was shown for continuous outcomes that ordering the predictor levels in each split past increasing hateful of the outcome and treating the ordered categories as ordinal leads to exactly the same splits in the optimization as the partitioning arroyo (Breiman et al., 1984; Fisher, 1958). For binary outcomes, splitting by the Gini impurity is equivalent to coding the effect equally 0 and 1 and applying the weighted variance splitting used for regression outcomes (Ishwaran, 2015). Consequently, the aforementioned equivalence of splits every bit for continuous outcomes holds (Ripley, 1996; Breiman et al., 1984).

For multiclass classification, no sorting algorithm leading to equivalent splits is known and we propose to use the approximation of Coppersmith, Hong & Hosking (1999). To sort the nominal predictor 10 with the categories 1,…, k by the multiclass outcome y with the classes 1,…, c, the method works as follows:

Pace 1 Compute the m × c contingency table N between the predictor x and the outcome y, where k is the number of categories of x and c the number of classes of y.
Stride two Convert the contingency table North to the class probability matrix P , where each row p ^α, α ∈ ane,…, k consists of the elements

$p_{γ}^{α} = \frac{1}{\sum_{i = 1}^{northward} 𝟙_{10_{i} = α}} \sum_{i = ane}^{n} 𝟙_{x_{i} = α \land y_{i} = γ}, α \in i, \dots, k, γ \in ane, \dots, c,$

representing the relative course frequencies of predictor category α.
Step three Compute the weighted covariance matrix

$Σ = \frac{1}{north - 1} \sum_{α \in A} {north}_{α} (p^{α} - \bar{p}) {(p^{α} - \bar{p})}^{T},$

where $n_{α} = \sum_{i = 1}^{north} 𝟙_{x_{i} = α}$ is the number of samples with category α, n is the total number of samples and $\bar{p}$ is the vector of mean course probabilities per outcome class.
Stride four Compute the first principal component 5 of Σ and the principal component scores S _α of the predictor categories by S _α = 5 · p ^α.
Stride 5 Sort the categories by their principal component scores S _α.

As in the case of multiclass classification, no lossless alternative to partitioning is known for survival outcomes. To order categories by their average survival time, censoring has to be taken into account. A simple arroyo would be to employ the median survival, every bit estimated past the Kaplan–Meier method. Nevertheless, in instance of high censoring proportions, more than than half of the observations may still be live at the finish of the written report, so that the median survival volition not exist available. As an alternative, nosotros propose to sort past the mean of log-rank scores (Hothorn & Lausen, 2003) of each category. The log-rank score for an ascertainment i is defined as

$a_{i} (Z, δ) = δ_{i} - \sum_{j = 1}^{γ_{i} (Z)} \frac{δ_{j}}{(Northward - γ_{i} (Z) + i)},$

where Z and δ are the vectors of survival time and censoring indicators, respectively, and γ_i( Z ) is the number of observations which experienced an event or were censored before or at time Z_i (Hothorn & Lausen, 2003).

The equivalence of the partitioning and ordering approach for continuous and binary outcomes holds if the categories are re-ordered in each split in the tree. An alternative is to guild the categories a priori, that is, once on the unabridged dataset before the analysis. This approach has several advantages: Showtime, the categories have to be sorted merely once, resulting in faster computation. Second, this arroyo does not suffer from the category limit trouble—the level order merely needs to be saved once for each nominal predictor. Third, since the categories are treated as ordinal in the splitting procedure, categories not present in the training data in a node tin can nonetheless exist assigned to a child node and thus the "absent levels problem" (Au, 2017) is avoided. Finally, the arroyo can exist hands used with whatsoever RF implementation because no changes in the splitting algorithm are required.

In the following, we denote ordering in each split or a priori as "Order (split up)" and "Guild (in one case)," respectively. We implemented both ordering schemes in the simpleRF package (Wright, 2018). The package uses simply plain R to allow like shooting fish in a barrel modification of the RF splitting algorithm which is required for the Gild (split) method. Nosotros also implemented the Guild (one time) method in the runtime-optimized ranger bundle (Wright & Ziegler, 2017).

Other alternatives

Another alternative is to use dummy or one-hot encoding. In one-hot encoding, a nominal predictor with k categories is replaced by thousand binary predictors, each corresponding to ane category. For each observation, one of these k binary predictors is 1, all others are gear up to 0. The same information contained in the 1000 binary predictors tin be saved in 1000 − 1 binary predictors by setting all converted predictors to 0 for i category (the reference category). Usually, the quondam is denoted 1-hot encoding and the latter dummy coding. The advantages of these approaches are their simplicity and that they tin can be used with any algorithm able to handle numeric data. However, the size of the dataset increases essentially, in particular for many categorical variables or for many categories. We denote this method every bit "Dummy."

Finally, the simplest approach is to ignore the nominal nature of the predictors. They are causeless to be ordinal in the order they are encoded or as they appear in the data. For example in R, cistron variables are sorted alphabetically by default. We refer to this method as "Ignore." An example is shown in Fig. 1B.

Methods bachelor in other packages

The package randomForest (Liaw & Wiener, 2018) orders the levels of a nominal predictor in every split for regression and binary classification but considers all 2-partitions for multiclass nomenclature. The randomForestSRC package (Ishwaran & Kogalur, 2017) considers all 2-partitions for all outcome types. In h2o (H2O.ai, 2017), several options to handle nominal predictors are available, including one-hot encoding and ordering a priori. However, no proper ordering method is implemented for multiclass nomenclature. By default, the categories are grouped to fewer categories (in lexical ordering) in each split until the number of categories is below a threshold, in which case all ii-partitions are considered. Many other implementations, including xgboost (Chen et al., 2018) and the Python implementation scikit-learn (Pedregosa et al., 2011), currently accept merely numeric data. Nevertheless, for scikit-larn it has been discussed whether support for nominal data should be added (https://github.com/scikit-learn/scikitlearn/pull/4899).

Simulation studies

We performed simulation studies to compare the methods to handle nominal predictors regarding prediction operation and computational efficiency. Three types of nominal predictors were combined with four types of outcome, resulting in twelve simulation scenarios (Table 1).

Table ane

Simulation scenarios.

Name	Categories	Predictor type
Digit	10	Digits between 0 and 9, separate odd and fifty-fifty digits
SNP	3	Single nucleotide polymorphisms with heterozygous effect
Group	10	Two groups of observations with different predictor effects

The start blazon of nominal predictors were digits. For each of the p predictors, n digits between 0 and 9 were false. For regression, the event was the difference between the number of odd digits and the number of even digits, for binary classification the outcome coded whether there were more odd than even digits. For multiclass nomenclature, one outcome class was used for each combination of the number of odd and fifty-fifty digits, corresponding to 16 classes. For survival outcomes, an exponentially distributed random variable was used for the survival time, where the parameter λ was set to the sum of the departure between the number of odd digits and the number of fifty-fifty digits (as in regression). A fraction of 30% of the training observations were randomly selected to be censored. The time until censoring for these observations was drawn from a uniform distribution betwixt 0 and the true survival time. The validation data was not censored.

The second type of nominal predictors were single nucleotide polymorphisms (SNPs), that is, variations in single nucleotides on the genome. SNPs are usually coded with the number of minor alleles, which can be 0, ane, or two. We simulated a heterozygous event, that is, an effect if exactly 1 minor allele is nowadays. For regression, nosotros simulated the result with a simple linear model, for binary nomenclature with a logit model. For multiclass classification, one outcome course was used for each combination of heterozygous effects, corresponding to sixteen classes. For survival outcomes, the same procedure as in the Digit data was used with the parameter λ set to the linear predictor used in the regression case. We used p = 4 predictors per dataset, a pocket-size allele frequency of 25% and an result size of 3.

The tertiary type of nominal predictors was specifically designed without a latent ordering to make a priori ordering difficult. Each of the p predictors was fake with 10 equally frequent categories. The categories were faux to have different linear effects in two groups of observations. All northward observations were randomly assigned to one of two groups and in both groups each category was assigned a random effect size between 1 and 10. For regression, the effect was the sum of the predictor furnishings. For binary classification the outcome coded whether this sum was larger than the median of all observations. For multiclass classification, each predictor was assigned to be loftier if the effect size was five or higher or depression for values below 5. Each combination of high and low values corresponded to i outcome form, with a total of xvi classes. For survival outcomes, the same procedure as for the other predictor types was used, where the parameter λ was set to the sum of the predictor effects.

We applied all five methods to handle nominal predictors on each simulation scenario. For each dataset, we simulated due north training observations and northward validation observations. We used nested resampling for operation evaluation: 100 grooming and validation datasets were fake and on each training dataset, we tuned the mtry value, that is, the number of variables available for splitting in a node, with a five-fold cross validation. Ten values evenly distributed between 1 and the number of predictors were used for the dummy coding and mtry ∈ {ane,ii,3,4} for all other approaches. The number of copse was set up to fifty. All other parameters were set to the default values. In each simulation replicate, the tuned model was and so applied to the validation data, to evaluate prediction performance. For regression, the mean squared error (MSE) was used to measure prediction performance, for classification the proportion of misclassifications and for survival the integrated Brier score (Graf et al., 1999). The simulations were repeated with northward ∈ {50,100,200}. In each simulation scenario, the alternative methods were statistically compared to the Partition method with a two-sided paired t-test over the simulation replications. Differences with two-sided p < 0.05 were considered statistically significant.

To study the effect of censoring in the survival case, we performed an boosted simulation study. All simulation settings were unchanged except the censoring proportion was varied betwixt 0 (no censoring) and 0.9 (ninety% censored observations) in steps of 0.i. Farther, nosotros conducted two simulations to understand the touch of the nominal predictor methods on the splitting procedure on multiclass data. In the get-go simulation, nosotros compared the size of copse grown with the Ignore and the Order (once) methods, measured as the number of nodes in a tree. In the second simulation, we investigated the "absent levels problem" (Au, 2017), that is, that after several splits in a tree some categories are not available anymore and with near splitting methods no meaningful split up is possible. For this, nosotros calculated the number of available categories for a given tree size.

In all simulations, we measured runtime on a 64-flake Linux platform with 2 8-core Intel Xeon E5649 2.53GHz CPUs. All simulations were performed with the simpleRF package (Wright, 2018).

Real information analysis

To compare prediction performance and computational efficiency of the approaches, nosotros analyzed vi real datasets. The datasets are summarized in Tabular array 2. The datasets Tic-tac-toe, Splice, MPG, and Servo are standard benchmark datasets from the UCI machine learning repository (Lichman, 2013). These datasets were obtained from OpenML (Vanschoren et al., 2014).

Table two

Real datasets analyzed in the comparison of prediction functioning and runtime.

Name	Blazon	Obs.	Vars.	Cat.
Tic-tac-toe	Binary nomenclature	958	9	3
Splice	Multiclass classification (three classes)	3,190	60	4–half dozen
MPG	Multiclass nomenclature (seven classes)	234	10	iii–38
Servo	Regression	167	4	iv–5
RA SNPs	Binary classification	500	501	3–4
AIDS	Survival	478	iv	2–8

The Tic-tac-toe dataset contains all possible lath configurations at the end of a tic-tac-toe game. The nomenclature task is to predict whether the starting histrion wins the game. The nine predictor variables each code one position of the lath, which can be bare or taken by either thespian. The objective in the Splice dataset (Noordewier, Towell & Shavlik, 1991) is to recognize and classify DNA sequences containing a boundary between exon and intron or vice versa. The event variable has the possible classes ei, ie, or n, corresponding to the boundaries between exon and intron, intron and exon or neither. Each predictor variable represents a position in a threescore-remainder DNA sequence and is coded past the four nucleobases. The MPG dataset contains data on cars from the years 1999 to 2008. The objective is to predict the class of a automobile, based on the other attributes such as manufacturer, transmission type, or fuel consumption. It includes nominal equally well as numerical data. The Servo dataset is obtained from a simulation of a servo organization (Quinlan, 1992). The outcome is the time required for the system to respond to a step alter (ascent fourth dimension), measured as a continuous variable. The 4 predictor variables are two gain settings and two choices of mechanical linkage. In the RA SNPs dataset, the task is to predict rheumatoid arthritis affection condition based on genetic data. The analyzed dataset is part of that used in the Genetic Analysis Workshop 16 (Amos et al., 2009). It includes 500 randomly selected observations (250 cases and 250 controls) with genotypes of 500 randomly selected SNPs from the HLA region on chromosome half dozen, which is known to be associated with rheumatoid arthritis (Amos et al., 2009). The AIDS dataset (Venables & Ripley, 2002) contains survival information from patients diagnosed with AIDS in Australia before July 1, 1991. Predictors are the patient's sex and age (in years), the state of origin (grouped into four categories) and the transmission category (viii categories). To residuum the categories and reduce computational fourth dimension, we randomly sampled 100 observations from the transmission category hs (male person homosexual or bisexual contact) and included all observations from other manual categories, resulting in 478 observations.

We analyzed all datasets with the five methods to handle nominal predictors. We used a ten/5-fold nested cross validation (repeated 50 times) for performance evaluation. In the inner v-fold cross validation, the mtry value was tuned (x values evenly distributed between 1 and the number of predictors or all possible values if less than 10 predictors). In the outer 10-fold cantankerous validation the prediction performance was evaluated with the MSE for regression, with the proportion of misclassifications for classification and with the integrated Brier score for survival. The number of trees was set to fifty. All other parameters were prepare to the default values. On each dataset, the alternative methods were statistically compared to the Partition method with a ii-sided corrected paired t-examination for repeated cross validation (Nadeau & Bengio, 2003). Differences with two-sided p < 0.05 were considered statistically significant. As in the simulation study, runtime was measured with the simpleRF packet (Wright, 2018) on a 64-bit Linux platform with two eight-core Intel Xeon E5649 2.53GHz CPUs.

Results

Simulation studies

The simulation results for a regression outcome and a sample size of n = 100 are shown in the first row of Fig. 2, the respective runtimes in Table 3 and Fig. S3 and the results of statistical testing in Table S1. On the Digit data, Order (in one case) performed best, followed by Order (split up) and Sectionalization. The Dummy and Ignore methods performed worse, with a larger altitude. In terms of runtime, Order (once) and Ignore were fastest, followed by Order (dissever) and Dummy, while the Partition method was slowest. On the SNP data, the differences in prediction performance were smaller and Dummy performed as well equally Order (one time). Partition and Dummy were faster on the SNP data, as expected because of the lower number of categories. Here, Order (carve up) was the slowest method. On the Group data, Order (once), Order (split) and Partition performed almost equally, Dummy and Ignore slightly worse. The runtime ranking was similar to the Digit data, except that Partitioning was faster.

An external file that holds a picture, illustration, etc. Object name is peerj-07-6339-g002.jpg

Simulation results for n = 100.

The vertical panels represent to the predictor blazon (Digit, SNP, or Group), the horizontal panels to the consequence blazon (regression, binary nomenclature, multiclass classification, or survival). Each boxplot represents the prediction error for ane method to handle nominal predictor variables, practical to one simulation scenario (A–Fifty). The prediction error is measured by the mean squared error for regression, by the proportion of misclassifications for classification and by the integrated Brier score for survival.

Table 3

Median runtimes of simulations (in seconds).

Predictor type	Outcome type	Nominal predictor method
Predictor type	Outcome type	Segmentation	Guild (divide)	Order (once)	Dummy	Ignore
Digit	Regression	15.89	6.22	4.69	6.60	4.93
	Binary class.	xviii.23	4.27	three.80	five.92	5.62
	Multiclass class.	28.47	xi.09	five.77	13.19	nine.84
	Survival	660.43	74.85	61.88	ix.24	51.45
SNP	Regression	4.54	6.01	four.72	4.thirteen	4.48
	Binary course.	3.33	four.36	3.55	ii.89	3.44
	Multiclass class.	4.27	seven.fourteen	4.03	3.98	5.64
	Survival	9.52	nine.47	nine.09	9.13	8.78
Group	Regression	7.28	4.62	3.72	7.72	3.81
	Binary class.	10.69	4.71	v.23	vii.95	half dozen.xxx
	Multiclass grade.	xiv.77	thirteen.55	ix.00	fourteen.thirteen	ix.47
	Survival	188.61	27.14	24.86	ix.14	24.71

The 2d row of Fig. two presents the simulation results for a binary classification consequence. On the Digit data, we observed a similar pattern of prediction accuracy every bit in the regression case. The Partition method was slowest to compute. Again, Guild (once) was faster than Ignore. On the SNP data, all methods perform about equally. The differences in runtime results were as well minor, with Dummy being the fastest method. On the Grouping data, all methods except Ignore performed near equally. The runtimes were comparable to the Digit data.

The simulation results for a multiclass classification outcome are shown in the third row of Fig. 2. On the Digit data, the ranks of the methods were the same every bit in the regression case. Withal, the performance of the methods differed substantially. For the Digit data, the Ignore method misclassified a median of 83% of the observation, the Order (in one case) method only three%. The runtimes and their differences increased, compared to regression and binary nomenclature, but their order was the same. On the Group data, Partition, Order (carve up) and Society (once) performed about equally, Dummy and Ignore worse. The Gild (in one case) method was considerably faster than the other methods.

The last row of Fig. 2 shows the results for a survival outcome. Hither, the Dummy method performed best on the Digit and Group data and too as the other methods on the SNP data. The other methods performed almost equally. The runtimes and their differences increased considerably and the Dummy method was the fastest method on the Digit and Group information.

As expected, all results of Gild (split up) were non significantly different from those of Division, except for the Digit data with a survival outcome. Order (once) performed significantly dissimilar from Partition on the Digit data with all event types, on the SNP information with continuous and multiclass outcomes and on the Group data with survival outcome. The results of Dummy were not significant on a binary classification outcome with SNP and Group data, merely significant on all others. Ignore performed statistically dissimilar from Partition in all cases.

The results of the simulations with due north ∈ {50,200} are shown in Figs. S1–S2 and Tables S2–S3. Compared to the results with due north = 100, the ranks of prediction performance remained the aforementioned. The differences increased with n = 200 and decreased with n = 50.

The results of the simulation study with varying censoring proportions are shown in Fig. 3. As expected, the prediction error increased with higher censoring proportions. On the Digit and Group data, the Dummy approach performed best for higher censoring proportions. All other methods performed most equally. No appreciable divergence was observed on the SNP data.

An external file that holds a picture, illustration, etc. Object name is peerj-07-6339-g003.jpg

Survival simulation results with varying censoring proportion.

The horizontal panels correspond to the predictor blazon (A, Digit; B, SNP; C, Group). Each line represents the prediction mistake for 1 method to handle nominal predictor variables. The prediction mistake is measured by the integrated Brier score.

The results of the tree size and "absent level" simulations are shown in Figs. 4A and 4B, respectively. The left plot shows that trees grown with the Gild (one time) method are on average well-nigh half the size of copse grown with the Ignore method. This explains why Guild (once) is computationally faster than Ignore in all multiclass simulations, despite the computational effort to society the categories of every variable. The correct plot shows that the number of categories available in nodes drops fast with tree depth: For depth of 10 or larger, on average merely two or three categories are bachelor. With the Partition and Club (split) methods, no meaningful splits can be found for the absent categories, which reduces the prediction functioning, equally shown in Fig. ii.

An external file that holds a picture, illustration, etc. Object name is peerj-07-6339-g004.jpg

Simulation results of the tree size (A) and "absent-minded level" (B) simulations.

In (A), the number of nodes of l trees are shown for an RF grown with the Ignore and Order (once) methods. In (B), the number of available categories in nodes at different tree depths, averaged over 500 trees, is shown.

Real data analysis

Figure 5 and Table iv testify the results of the real data performance comparison. Boxplots of the runtime and results of statistical testing are shown in Figs. S4 and Tabular array S4, respectively. On the Tic-tac-toe dataset, the Partition, Order (split), Lodge (once), and Dummy methods performed nearly equally, the Ignore method worse. However, the dataset was easily separable, since all methods achieve low prediction errors. In terms of runtime, Order (in one case) and Dummy were fastest, followed by Partition and Ignore, while Order (separate) was slowest. Comparable results were observed on the Splice information, except for Dummy, which performed slightly worse. The Club (in one case) method was fastest, followed past Division, Dummy, and Ignore. Once again, Order (split) was slowest. On the MPG data, no results could be obtained for the Division method because the category limit was reached. The methods Guild (dissever) and Gild (one time) performed best, followed past Dummy and Ignore. Here, Society (once) was about two times faster than Order (split) and Ignore and three times faster than Dummy. On the Servo data, the methods Segmentation, Society (split) and Society (in one case) performed about equally, Dummy and Ignore worse. The runtime differences were pocket-sized, except for Order (split) which was slowest. A like blueprint was observed on the RA SNPs information: Partition, Order (dissever), and Order (in one case) methods performed about equally, Dummy and Ignore worse. On this dataset, Gild (once), Ignore and Partition were fastest, Dummy and Club (split) slowest. On the AIDS data, all methods performed virtually equally in terms of prediction accuracy. Here, Dummy was the fastest method and Partition the slowest. Statistically significant differences from the Sectionalisation method after the correction of Nadeau & Bengio (2003) were observed for the Ignore method on the Splice, MPG, and Servo datasets likewise every bit for the Dummy method on the Servo dataset.

An external file that holds a picture, illustration, etc. Object name is peerj-07-6339-g005.jpg

Existent data results.

The panels correspond to the datasets. Each boxplot represents the prediction error of one method to handle nominal predictor variables, applied to one dataset (A–F). The prediction error is measured with a x/v-fold nested cross validation (repeated 50 times) past the hateful squared error for regression and by the proportion of misclassifications for classification. No results could be obtained for the Segmentation method on the MPG dataset because the category limit was reached.

Table 4

Median runtimes of real data analyses (in seconds).

Dataset	Nominal predictor method
Dataset	Partition	Social club (dissever)	Guild (in one case)	Dummy	Ignore
Tic-tac-toe	19.46	26.82	18.65	nineteen.37	21.threescore
Splice	105.40	241.44	90.62	107.71	112.48
MPG	NA^*	viii.89	v.14	17.02	10.55
Servo	half-dozen.56	9.lxx	v.38	v.98	v.51
RA SNPs	188.54	270.06	170.26	285.17	185.lx
AIDS	67.61	49.64	47.58	16.49	47.08

Word

Nosotros compared five approaches to handle nominal predictors in RFs. The approach of sorting the categories one time earlier analysis of the dataset (Lodge (in one case)) performed best or on par with other approaches in all simulations and on all real information examples, except for survival outcomes. The arroyo of ignoring the unordered nature of these predictors performed worst in all settings. The differences were generally large in regression and multiclass classification and smaller in binary classification and survival. On most datasets, the Society (once) and Dummy methods were computationally fastest. The new methods for multiclass classification and survival improved the prediction accurateness considerably.

Notably, Order (once) performed considerably improve than Order (split) and Partition on the false Digit data with a multiclass upshot. This might be due to the fact that for this job it is only relevant if a digit is even or odd and thus a latent social club of the categories exists. This order however holds if the categories are not nowadays in a node in the preparation information and meaningful splits are still possible. In this case, the latent ordering cannot be found by because all two-partitions because the assignment of the empty levels does not change the impurity measure (Au, 2017). The same trouble applies to the arroyo which sorts the levels past the outcome in each split.

A related problem oftentimes faced in real data are new categories in prediction. For example, a person in the prediction data might come from a geographical region which was not present in the preparation data or a rare genetic variant might announced offset in the prediction information. If the levels are entirely new, most packages show an mistake that there are new categories, which were not available in the preparation data. It is intuitively clear that no meaningful splitting decision can exist made for observations with new categories at a node splitting at this variable, regardless of the nominal predictor method. Yet, prediction should nevertheless be possible if there are other variables bachelor. One unproblematic approach would exist to assign all observations with new categories to 1 node, another would be to assign them randomly. Ane might debate that the former is the more sensible approach because these observations are kept together and tin exist split by another variable later on. Nevertheless, Au (2017) showed that random assignment is generally more robust. Finally, if options to handle missing predictor values are available, new categories could be prepare to missing and handled appropriately.

Further inquiry is required on how to handle nominal predictors in random survival forests. Nosotros proposed to order by log-rank scores (Hothorn & Lausen, 2003) once or in each split separately and achieved results on par with the partition arroyo. Still, in two out of three simulation scenarios this was outperformed by dummy coding. No difference could be observed in the third simulation scenario and in existent information.

Our results testify that fifty-fifty for SNP data with just 3 categories, the splitting algorithm for unordered categories plays an important role. The simulation results indicate that dummy coding might be an alternative for SNP information. Even so, the results from the existent data analysis bear witness that the increase of the number of variables due to the dummy coding leads to reduced prediction performance and longer runtimes, and consequently the ordering method is preferred. The real information results too prove that ignoring the unordered nature of the predictors harms the predictions accuracy on real data, that presumably consists of SNPs with a variety of genetic models or no upshot on the outcome at all.

Conclusions

We accept shown that ordering predictor categories a priori is a fast approach to handle nominal predictors in RFs which performs well on different kinds of datasets. In comparing to the standard approach, runtimes can exist reduced substantially. Furthermore, in the example of a latent ordering, the prediction operation tin be improved. We recommend to use this arroyo equally the default when analyzing data with RFs, except for survival outcomes, where we recommend dummy coding.

Acknowledgments

The authors are grateful to Ronja Foraita for valuable discussions.

Funding Statement

This piece of work is based on information that was gathered with the support of grants from the National Institutes of Health (NO1-AR-2-2263 and RO1-AR-44422), and the National Arthritis Foundation. The funders had no role in report design, data collection and assay, determination to publish, or preparation of the manuscript.

Additional Data and Declarations

Competing Interests

The authors declare that they accept no competing interests.

Author Contributions

Marvin N. Wright conceived and designed the experiments, performed the experiments, analyzed the data, contributed reagents/materials/analysis tools, prepared figures and/or tables, authored or reviewed drafts of the newspaper, approved the final typhoon.

Inke R. König conceived and designed the experiments, authored or reviewed drafts of the paper, canonical the final draft.

Data Availability

The post-obit information was supplied regarding data availability:

R scripts to replicate the simulation and real data study are bachelor in the Supplemental File. Some raw information was provided to u.s.a. by the National Arthritis Foundation. We received permission to re-use the raw information for the purpose of this projection, but not to share the original data.

References

Amos et al. (2009) Amos CI, Chen Westward, Seldin MF, Remmers EF, Taylor KE, Criswell LA, Lee AT, Plenge RM, Kastner DL, Gregersen PK. Data for genetic analysis workshop sixteen problem ane, association analysis of rheumatoid arthritis information. BMC Proceedings. 2009;3(Suppl 7):S2. doi: 10.1186/1753-6561-iii-s7-s2. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

Au (2017) Au TC. Random forests, determination copse, and categorical predictors: the "absent-minded levels" problem. Journal of Machine Learning Research. 2017;19(45):1–30. [Google Scholar]

Biau & Scornet (2016) Biau M, Scornet East. A random forest guided tour. Examination. 2016;25(2):197–227. doi: ten.1007/s11749-016-0481-7. [CrossRef] [Google Scholar]

Bischl et al. (2012) Bischl B, Mersmann O, Trautmann H, Weihs C. Resampling methods for meta-model validation with recommendations for evolutionary ciphering. Evolutionary Ciphering. 2012;20(2):249–275. doi: 10.1162/evco_a_00069. [PubMed] [CrossRef] [Google Scholar]

Boulesteix et al. (2012) Boulesteix A-L, Janitza South, Kruppa J, König IR. Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2012;2(vi):493–507. doi: ten.1002/widm.1072. [CrossRef] [Google Scholar]

Breiman (2001) Breiman L. Random forests. Machine Learning. 2001;45(1):v–32. [Google Scholar]

Breiman et al. (1984) Breiman L, Friedman J, Olshen RA, Stone CJ. Classification and Regression Trees. Boca Raton: CRC Press; 1984. [Google Scholar]

Chen et al. (2018) Chen T, He T, Benesty Thousand, Khotilovich V, Tang Y. xgboost: extreme gradient boosting. R bundle Version 0.6.iv.1https://CRAN.R-projection.org/package=xgboost 2018

Coppersmith, Hong & Hosking (1999) Coppersmith D, Hong SJ, Hosking JR. Sectionalisation nominal attributes in decision copse. Information Mining and Knowledge Discovery. 1999;three(2):197–217. doi: x.1023/a:1009869804967. [CrossRef] [Google Scholar]

Fisher (1958) Fisher WD. On grouping for maximum homogeneity. Journal of the American Statistical Association. 1958;53(284):789–798. doi: 10.1080/01621459.1958.10501479. [CrossRef] [Google Scholar]

Gnanadesikan (2011) Gnanadesikan R. Methods for statistical data analysis of multivariate observations. Vol. 321. New York: John Wiley & Sons; 2011. [Google Scholar]

Goldstein, Polley & Briggs (2011) Goldstein BA, Polley EC, Briggs FBS. Random forests for genetic association studies. Statistical Applications in Genetics and Molecular Biology. 2011;10(ane):32. doi: x.2202/1544-6115.1691. [PMC costless article] [PubMed] [CrossRef] [Google Scholar]

Graf et al. (1999) Graf E, Schmoor C, Sauerbrei W, Schumacher M. Assessment and comparing of prognostic nomenclature schemes for survival data. Statistics in Medicine. 1999;eighteen(17–18):2529–2545. doi: 10.1002/(sici)1097-0258(19990915/30)18:17/18<2529::aid-sim274>3.0.co;two-5. [PubMed] [CrossRef] [Google Scholar]

Hengl et al. (2017) Hengl T, De Jesus JM, Heuvelink GBM, Gonzalez MR, Kilibarda Grand, Blagotić A, Shangguan W, Wright MN, Geng X, Bauer-Marschallinger B, Guevara MA, Vargas R, MacMillan RA, Batjes NH, Leenaars JGB, Ribeiro E, Wheeler I, Mantel Southward, Kempen B. SoilGrids250m: global gridded soil information based on machine learning. PLOS Ane. 2017;12(two):e0169748. doi: 10.1371/journal.pone.0169748. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

Hothorn & Lausen (2003) Hothorn T, Lausen B. On the exact distribution of maximally selected rank statistics. Computational Statistics & Data Analysis. 2003;43(ii):121–137. doi: 10.1016/s0167-9473(02)00225-6. [CrossRef] [Google Scholar]

Ishwaran (2015) Ishwaran H. The effect of splitting on random forests. Motorcar Learning. 2015;99(1):75–118. doi: 10.1007/s10994-014-5451-ii. [PMC free article] [PubMed] [CrossRef] [Google Scholar]

Ishwaran et al. (2008) Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. Annals of Applied Statistics. 2008;2(3):841–860. [Google Scholar]

Lichman (2013) Lichman Thousand. University of California, Irvine, School of Data and Calculator Sciences; 2013. UCI machine learning repository. [Google Scholar]

Loh & Vanichsetakul (1988) Loh Westward-Y, Vanichsetakul N. Tree-structured classification via generalized discriminant analysis. Journal of the American Statistical Association. 1988;83(403):715–725. doi: 10.1080/01621459.1988.10478652. [CrossRef] [Google Scholar]

Mehta, Agrawal & Rissanen (1996) Mehta 1000, Agrawal R, Rissanen J. SLIQ: a fast scalable classifier for data mining. International Conference on Extending Database Engineering; Avignon: Springer; 1996. pp. 18–32. [Google Scholar]

Nádas et al. (1991) Nádas A, Nahamoo D, Picheny MA, Powell J. An iterative "flip-flop" approximation of the virtually informative split in the construction of decision trees. International Conference on Acoustics, Speech communication, and Bespeak Processing; Toronto: IEEE; 1991. pp. 565–568. [Google Scholar]

Nadeau & Bengio (2003) Nadeau C, Bengio Y. Inference for the generalization fault. Motorcar Learning. 2003;52(3):239–281. [Google Scholar]

Noordewier, Towell & Shavlik (1991) Noordewier MO, Towell GG, Shavlik JW. Preparation knowledge-based neural networks to recognize genes in Dna sequences. Advances in Neural Information Processing Systems; Denver, CO, USA. 1991. pp. 530–536. [Google Scholar]

Pedregosa et al. (2011) Pedregosa F, Varoquaux K, Gramfort A, Michel Five, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot K, Duchesnay E. Scikit-learn: machine learning in Python. Periodical of Automobile Learning Research. 2011;12:2825–2830. [Google Scholar]

Probst, Bischl & Boulesteix (2018a) Probst P, Bischl B, Boulesteix A-L. Tunability: importance of hyperparameters of machine learning algorithms. 2018a1802.09596

Probst, Wright & Boulesteix (2018b) Probst P, Wright M, Boulesteix A-L. Hyperparameters and tuning strategies for random forest. 2018b1804.03515

Quinlan (1992) Quinlan JR. Learning with continuous classes. fifth Australian Joint Conference on Bogus Intelligence; Singapore. 1992. pp. 343–348. [Google Scholar]

Ripley (1996) Ripley BD. Pattern recognition and neural networks. Cambridge: Cambridge University Printing; 1996. [Google Scholar]

Schmid, Wright & Ziegler (2016) Schmid G, Wright MN, Ziegler A. On the utilise of Harrell'due south C for clinical take chances prediction via random survival forests. Expert Systems with Applications. 2016;63:450–459. doi: x.1016/j.eswa.2016.07.018. [CrossRef] [Google Scholar]

Schratz et al. (2018) Schratz P, Muenchow J, Iturritxa E, Richter J, Brenning A. Functioning evaluation and hyperparameter tuning of statistical and automobile-learning models using spatial data. 20181803.11266

Vanschoren et al. (2014) Vanschoren J, Van Rijn JN, Bischl B, Torgo L. OpenML: networked science in machine learning. SIGKDD Explorations. 2014;xv(2):49–threescore. [Google Scholar]

Varian (2014) Varian HR. Big data: new tricks for econometrics. Journal of Economic Perspectives. 2014;28(2):3–28. doi: x.1257/jep.28.2.3. [CrossRef] [Google Scholar]

Venables & Ripley (2002) Venables WN, Ripley BD. Modern applied statistics with S. Quaternary Edition. New York: Springer; 2002. [Google Scholar]

Wright & Ziegler (2017) Wright MN, Ziegler A. ranger: a fast implementation of random forests for high dimensional information in C++ and R. Journal of Statistical Software. 2017;77(1):1–17. doi: x.18637/jss.v077.i01. [CrossRef] [Google Scholar]

Ziegler & König (2014) Ziegler A, König IR. Mining data with random forests: current options for real-globe applications. Wiley Interdisciplinary Reviews: Data Mining and Cognition Discovery. 2014;4(1):55–63. doi: 10.1002/widm.1114. [CrossRef] [Google Scholar]

acostabappre.blogspot.com

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6368971/

Can Not Handle Categorical Predictors With More Than 53 Categories.

Splitting on categorical predictors in random forests

Marvin N. Wright

Inke R. König

Abstract

Introduction

Methods

Random forests

Methods to handle nominal predictors

Standard approach: consider all 2-partitions

Ordering categories

Other alternatives

Methods bachelor in other packages

Simulation studies

Table ane

Real information analysis

Table two

Results

Simulation studies

Table 3

Real data analysis

Table 4

Word

Conclusions

Acknowledgments

Funding Statement

Additional Data and Declarations

References

0 Response to "Can Not Handle Categorical Predictors With More Than 53 Categories."

Postar um comentário

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel