Predicting Halo Occupation And Galaxy Assembly Bias With Machine Learning | Monthly Notices Of The Royal Astronomical Society

Xiaoju Xu and Saurabh Kumar are authors contributed equally to the work and share co-first authorship.

Xiaoju Xu, Saurabh Kumar, Idit Zehavi, Sergio Contreras, Predicting halo occupation and galaxy assembly bias with machine learning, Monthly Notices of the Royal Astronomical Society, Volume 507, Issue 4, November 2021, Pages 4879–4899, https://doi.org/10.1093/mnras/stab2464

Understanding the impact of halo properties beyond halo mass on the clustering of galaxies (namely galaxy assembly bias) remains a challenge for contemporary models of galaxy clustering. We explore the use of machine learning to predict the halo occupations and recover galaxy clustering and assembly bias in a semi-analytic galaxy formation model. For stellar mass selected samples, we train a random forest algorithm on the number of central and satellite galaxies in each dark matter halo. With the predicted occupations, we create mock galaxy catalogues and measure the clustering and assembly bias. Using a range of halo and environment properties, we find that the machine learning predictions of the occupancy variations with secondary properties, galaxy clustering, and assembly bias are all in excellent agreement with those of our target galaxy formation model. Internal halo properties are most important for the central galaxies prediction, while environment plays a critical role for the satellites. Our machine learning models are all provided in a usable format. We demonstrate that machine learning is a powerful tool for modelling the galaxy–halo connection, and can be used to create realistic mock galaxy catalogues which accurately recover the expected occupancy variations, galaxy clustering, and galaxy assembly bias, imperative for cosmological analyses of upcoming surveys.

The advent of large galaxy surveys has transformed the study of large-scale structure, allowing high-precision measurements of galaxy clustering statistics. Imaging and spectroscopic surveys such as the Sloan Digital Sky Survey (SDSS; York et al. 2000), the Dark Energy Survey (DES; Abbott et al. 2016), the Dark Energy Spectroscopic Instrument (DESI; DESI Collaboration 2016), and the upcoming Legacy Survey of Space and Time (LSST; LSST Collaboration 2009; Ivezić et al. 2019) provide extraordinary opportunities to utilize such clustering measurements to study both galaxy formation and cosmology. However, it is difficult to model these directly since they depend on complex baryonic processes that are not fully understood. In the standard framework of ΛCDM cosmology, galaxies form and evolve in dark matter haloes (White & Rees 1978), and therefore galaxy clustering can be modelled through halo clustering and galaxy–halo connection.

The formation and evolution of the dark matter haloes are dominated by gravity and their abundance and clustering can be well predicted by analytic models (Press & Schechter 1974; Bond et al. 1991; Mo & White 1996; Sheth & Tormen 1999; Paranjape, Sheth & Desjacques 2013) and by using high-resolution cosmological numerical simulations (Springel et al. 2005; Prada et al. 2012; Villaescusa-Navarro et al. 2019; Wang et al. 2020). Numerical N-body simulations track the evolution of dark matter particles under the influence of gravity and are able to accurately reproduce non-linear clustering on small scales. Haloes or subhaloes can be identified (Springel et al. 2001b; Behroozi, Wechsler & Wu 2013) and merger tree can then be constructed by linking the haloes or subhaloes to their progenitors and descendants at each snapshot in the simulation.

A useful approach for incorporating the predictions of galaxy formation physics is with semi-analytic modelling (SAM), in which the simulated dark matter haloes are populated with galaxies and evolved according to specified prescriptions for gas cooling, galaxy formation, feedback processes, and merging (De Lucia & Blaizot 2007; Guo et al. 2011, 2013; Croton et al. 2016; Stevens, Croton & Mutch 2016; Cora et al. 2018). Such models have been successful in reproducing several measured properties of galaxy populations and have become a popular method to explore the galaxy–halo connection. An alternative approach to model galaxy formation is provided by cosmological hydrodynamic simulations (Schaye et al. 2015; Nelson et al. 2019), which simulate both the dark matter particles and the stellar and gas components. The baryonic processes are tracked by a combination of fluid equations and subgrid prescriptions. Cosmological hydrodynamical simulations are starting to play a major role in studying galaxy formation, but are computationally expensive for the large volumes involved.

Empirical models such as halo occupation distribution (HOD) modelling (Berlind & Weinberg 2002; Cooray & Sheth 2002; Zheng et al. 2005; Zehavi et al. 2005, 2011) and subhalo abundance matching (SHAM; Conroy, Wechsler & Kravtsov 2006; Behroozi, Conroy & Wechsler 2010; Reddick et al. 2013; Chaves-Montero et al. 2016; Guo et al. 2016; Contreras, Angulo & Zennaro 2020) are also used to model galaxy clustering by characterizing the relation between galaxies and their host haloes. In the HOD approach, one fits or utilizes a model for the halo occupation function, the average number of central and satellite galaxies in the host halo as a function of the halo mass. In contrast, the SHAM methodology connects galaxies to dark matter (sub)haloes using a monotonic relation between the galaxy’s luminosity (or stellar mass) and the subhalo mass (or maximum circular velocity). Compared to SAM and hydrodynamic simulations, HOD and SHAM are practical and faster ways to generate realistic galaxy mock catalogues, increasingly important for the planning and analysis of galaxy surveys.

In the standard HOD or SHAM approaches, the galaxy content only depends on the halo or subhalo mass (or related mass indicators). However, halo clustering has been shown to depend on secondary halo properties or more generally on the assembly history or large-scale environment of the haloes, a phenomenon termed (halo) assembly bias (Sheth & Tormen 2004; Gao, Springel & White 2005; Wechsler et al. 2006; Gao & White 2007; Paranjape, Hahn & Sheth 2018; Ramakrishnan et al. 2019). The dependences on these secondary parameters manifest themselves in different ways and are not trivially described (Mao, Zentner & Wechsler 2018; Salcedo et al. 2018; Xu & Zheng 2018; Han et al. 2019). Halo assembly bias might impact large-scale galaxy clustering as well, if the formation of galaxy is correlated to that of the host halo, an effect commonly referred to as galaxy assembly bias (GAB hereafter; e.g. Croton, Gao & White 2007; Zu et al. 2008; Chaves-Montero et al. 2016; Contreras et al. 2019; Xu & Zheng 2020; Xu, Zehavi & Contreras 2021). In such a case, the halo occupation by galaxies will no longer depend solely on halo mass, but will vary with these secondary halo and environmental properties. These expected occupancy variations (hereafter OVs) have recently been studied in SAM and hydrodynamical simulations (Artale et al. 2018; Zehavi et al. 2018, 2019; Bose et al. 2019; Xu et al. 2021).

If the GAB is significant in the real Universe, neglecting it would have direct implications for interpreting galaxy clustering and the inferred galaxy–halo connection and cosmological constraints (Zentner, Hearin & van den Bosch 2014; McEwen & Weinberg 2018; Lange et al. 2019; McCarthy, Zheng & Guo 2019). Some extensions to include environment or other halo properties have been suggested (e.g. Hearin et al. 2016; McEwen & Weinberg 2018; Contreras, Angulo & Zennaro 2021; Xu et al. 2021). However, given the complexities involved, it is very hard to develop a scheme which will simultaneously incorporate the OV of all relevant halo properties. Moreover, as demonstrated in Xu et al. ( 2021), each halo property on its own contributes only a small fraction of the GAB signal such that a mix of multiple properties will likely be required. This makes first principles predictions for assembly bias challenging. Alternative approaches to predict galaxy properties based on halo assembly history have been proposed (Moster, Naab & White 2018; Behroozi et al. 2019); however, the full galaxy–halo connection could be high-dimensional and non-linear, which is difficult to capture by these models.

Machine learning (ML) provides a potentially powerful approach to study the galaxy–halo connection, inferring intricate relations from the complex multidimensional data in order to accurately connect the galaxies to the dark matter haloes. In recent years, ML techniques have become a versatile tool with a range of applications in large-scale structure and cosmology (Lucie-Smith et al. 2018; Aragon-Calvo 2019; Berger & Stein 2019; Arjona & Nesseris 2020; de Oliveira et al. 2020; Ntampaka et al. 2020). It is also helpful for processing observational data and performing classification (De La Calleja & Fuentes 2004; Sánchez et al. 2014; Tanaka et al. 2018; Cheng et al. 2020; Wu & Peek 2020; Mucesh et al. 2021; Zhou et al. 2021). In the context of halo modelling, ML can be implemented to predict galaxy properties based on input halo information (Xu et al. 2013; Kamdar, Turk & Brunner 2016a, b; Agarwal, Davé & Bassett 2018; Wadekar et al. 2020; Lovell et al. 2021; Moews et al. 2021), and also applied in the reverse sense, predicting halo properties based on galaxy information (Armitage, Kay & Barnes 2019; Calderon & Berlind 2019). More specifically, Xu et al. ( 2013) make a first attempt to predict the number of galaxies given the halo’s properties that can be utilized to create mock catalogues, matching the large-scale correlation function to |$5{{\ \rm per\ cent}}\!-\!10{{\ \rm per\ cent}}$|⁠ . Agarwal et al. ( 2018) predict central galaxy properties based on halo properties and environment and find that the average relations of these properties with halo mass are accurately recovered. In Kamdar et al. ( 2016a, b), several galaxy properties such as gas mass, stellar mass, star formation rate, and colour are predicted based on subhalo information. Recently, Lovell et al. ( 2021) also present a study reproducing several galaxy properties based on subhalo properties in the EAGLE set of hydrodynamic simulations (Schaye et al. 2015).

In this paper, we aim to train an ML model to learn the relation between halo properties and the occupation numbers of galaxies from a galaxy formation simulation. This invariably includes the complex set of effects related to GAB (such as the preferential occupation of galaxies in early-formed haloes as one example). We utilize here random forest (RF) classification and regression, one of the most effective ML models for predictive analytics (Breiman 2001). RF is an ensemble supervised learning method that works by combining decisions from a sequence of base models (decision trees). We use for this purpose stellar mass selected galaxy samples from the Guo et al. ( 2011) SAM applied to the Millennium Run Simulation (Springel et al. 2005). The input is the halo catalogue including an exhaustive set of halo properties and environment measures and the output will be the occupation numbers of central and satellite galaxies. The RF model will then be used to create mock galaxy catalogues and compared to the true levels of galaxy clustering and large-scale GAB.

We begin with an RF model that uses all internal and environmental halo properties as input and find an excellent agreement between the predicted HOD, galaxy clustering, and GAB and those measured in the SAM. The RF also provides feature importance, which enables us to select the top properties for predicting occupations. Interestingly, the environment properties are found to be important for the satellites occupation, but not for central one. We find that using only the top four input features can still recover the full level of GAB. We perform additional tests, where we build RF models based on only mass and environment, and alternatively, using the internal halo properties alone.

This methodology can be applied to other galaxy formation models as well, and serve as the basis for an efficient way to populate galaxies in dark matter only simulations, capturing the pertinent information of the galaxy–halo relation and recovering the right level of galaxy clustering including the detailed effects of assembly bias. Additionally, evaluating the relative feature importance can provide valuable insight regarding the contributors to assembly bias and the importance of halo and environmental properties to galaxy formation and evolution. Compared to other related ML works which predict the stellar mass of central galaxies (e.g. Xu et al. 2013; Wadekar et al. 2020; C. Cuesta, in preparation), our work utilizes the occupation numbers, more directly probing assembly bias, and allows to naturally incorporate both central and satellite galaxies. In contrast to Xu et al. ( 2021) which evaluated the individual contributions to GAB and produced mock catalogues that recover the full level of GAB and OV with respect to specific environment measures, here we use the full ensemble of properties and are able to reproduce the OV with multiple properties simultaneously. This latter property allows for more realistic and complete mock catalogues, which may be important for certain cosmological applications.

The paper is organized as follows. In Section 2, we briefly describe the N-body simulation, the halo and environmental properties, and the SAM galaxy formation model. Section 3 provides an introduction to the RF algorithm and the performance measures used to evaluate our models. In Section 4, we present our results for the halo occupation, galaxy clustering, and GAB with different combinations of halo and environmental properties. We conclude in Section 5. Appendix A provides technical details about the RF decision trees while Appendix B and Appendix C present further results of our analysis.

We use in this work the dark matter halo sample from the Millennium N-body simulation (Springel et al. 2005). The simulation was run using the gadget-2 code (Springel, Yoshida & White 2001aa), and adopts the first-year WMAP ΛCDM cosmology (Spergel et al. 2003), corresponding to the following cosmological parameters: Ωm = 0.25, Ωb = 0.045, h = 0.73, σ8 = 0.9, and ns = 1. The simulation is in a periodic box with a length of 500 |$h^{-1}\, {\rm Mpc}$| on a side, with 21603 total number of dark matter particles of mass |$8.6\times 10^8 \, h^{-1}\, {\rm M}_\odot$|⁠ . The simulation outputs 64 snapshots spanning |$z$| = 127 to |$z$| = 0. At each redshift, the distinct haloes are identified by a friends-of-friends (FoF) group finding algorithm (Davis et al. 1985), and the subhaloes are identified by the subfind algorithm (Springel et al. 2001b). Finally, a halo merger tree is constructed by linking each subhalo to its progenitor and descendant (Springel et al. 2005).

We utilize a set of internal halo properties as well as environmental measures similar to those used in Xu et al. ( 2021), as the input features for the RF models. These halo properties characterize halo structure and assembly history, and the environmental ones measure the density and tidal field at the position of the halo. We list and define all properties used in Table 1. The halo properties are separated into two categories. The first one is properties that can be obtained from the information from a single snapshot, here the one corresponding to |$z$| = 0, such as Mvir, Vmax, halo concentration c defined as Vmax/Vvir, and specific angular momentum j. The second category of halo properties pertains to the assembly history of the haloes and can be calculated from the merger tree. These include Vpeak, a0.5, a0.8, avpeak, the mass accretion rate |$\dot{M}$|⁠ , |$\dot{M} /M$|⁠ , |$z$| first, |$z$| last, and Nmerge. The environmental properties we use are the mass densities on different smoothing scales, δ1.25, δ2.5, δ5, δ10, and the tidal anisotropy α1,5 (Xu et al. 2021).

Halo properties and environmental measures used as input features for the RF models. The top part correspond to properties obtained directly from the |$z$| = 0 snapshot in the Millennium data base. The middle part are properties computed using the merger tree of the simulation, and the bottom part corresponds to the environmental properties.

We use the galaxy sample corresponding to the Guo et al. ( 2011) galaxy formation SAM implemented on the Millennium simulation. It models the main physical processes involved in galaxy formation in a cosmological context. These processes include reionization, gas cooling, star formation, angular momentum evolution, black hole growth, galaxy merger and disruption, and AGN and supernova feedback. The Guo et al. ( 2011) is a version of L-galaxies, the SAM code of the Munich group (De Lucia, Kauffmann & White 2004; Croton et al. 2006; Guo et al. 2013; Henriques et al. 2015, 2020), and uses the subhalo merger tree of the simulation to trace and evolve the galaxies through cosmic time. The prescription parameters in the model are tuned to luminosity, colour, abundance, and clustering of observed galaxies. The Guo et al. ( 2011) SAM model is widely used in literature (e.g. Wang et al. 2013; Lu, Yang & Shen 2015; Lin et al. 2016; Zehavi et al. 2018; Xu et al. 2021), and it is publicly available at the Millennium data base. 1

When constructing our galaxy samples, we first apply a halo mass cut of |$10^{10.7} \, h^{-1}\, {\rm M}_\odot$|⁠ , below which the number of dark matter particles is too low to reliably host galaxies. We define stellar mass selected samples with different number densities. For our main analysis we focus on a sample with a stellar mass threshold of |$1.42\times 10^{10} \, h^{-1}\, {\rm M}_\odot$|⁠ , corresponding to a number density of |$n=0.01 \, h^{3}\, {\rm Mpc}^{-3}$|⁠ . This sample includes a total of 745 027 central galaxies and 505 784 satellite galaxies. For some of our analysis, we use two additional samples with stellar mass thresholds of |$3.88\times 10^{10} \, h^{-1}\, {\rm M}_\odot$| and |$0.185\times 10^{10} \, h^{-1}\, {\rm M}_\odot$|⁠ , corresponding to |$n=0.00316 \, h^{3}\, {\rm Mpc}^{-3}$| and |$n=0.0316 \, h^{3}\, {\rm Mpc}^{-3}$|⁠ , respectively. These three samples are approximately evenly spaced in logarithmic number density and follow the choices made in Zehavi et al. ( 2018) and Xu et al. ( 2021). While the results presented in this paper are limited to the Guo et al. ( 2011) SAM at z = 0, the developed methodology can be applied to any SAM sample and redshift.

We first briefly discuss the choice of the ML model. Linear regression and classification models are the simplest ML models to learn the relation between the input features and the output. However, linear models are limited since even the simplest non-linear transformation (e.g. a polynomial) can lead to a large increase in the number of features, thereby slowing down the learning process. Support vector machines (SVM) are powerful ML algorithms which can transform the input features into higher dimensions without explicitly transforming the features (Aizerman, Braverman & Rozonoer 1964; Boser, Guyon & Vapnik 1992). However, they suffer from increased training time complexity with the size of training data. In contrast, ensemble methods such as RF (Breiman 2001) are suitable for our purpose of learning the relation between halo properties and halo occupation because of their ability of dealing with large and high-dimensional data sets.

The RF algorithm combines the output of multiple randomly created decision trees to generate the final output. It uses bootstrap aggregation to create random subsets of the training data with replacement on which the decision trees are trained. The decision tree is a tree-like structure in which each internal node represents a ‘test’ of an attribute, each branch represents the outcome, and each terminal node or leaf represents the output (the decision taken after computing all attributes). A more detailed description of decision trees is included in Appendix A. Combining a large number of decision trees, the prediction of RF is the class that is predicted by the majority of the decision trees in the case of RF classification. For RF regression, the prediction is the average prediction from all decision trees. Thus, for our purpose here, training the RF on a subset of the Millennium halo catalogues and the corresponding SAM galaxy occupations allows to take into account all the halo properties and predict whether a given halo has a central galaxy or not (classification) and the expected number of satellite galaxies (regression).

The main advantage of decision trees is that they perform well with non-linear problems and are computationally cheap since the decision trees can be trained in parallel. One of the major concerns about decision trees is that they can be unstable due to the hierarchical nature of trees: a small change in the training set can result in a difference in the root split which is propagated down to subsequent splits. However, this is mitigated in RF by averaging the predictions over many uncorrelated trees. Decision trees also tend to be strong learners, meaning that individual trees tend to overfit the data. Overfitting is addressed by aggregating the results over many high-variance and low-bias trees. Another important feature of the RF algorithm is that it provides the relative feature importance, i.e. the contribution of each input property in making the predictions which we will examine in Section 4. For a more rigorous discussion of the RF algorithm, we refer the reader to chapters 9 and 15 of Hastie, Tibshirani & Friedman ( 2001) and chapters 6 and 7 of Géron ( 2017).

The RF model includes several ‘hyperparameters’ which characterize the ensemble of decision trees. In this work we focus on three of them, the total number of the trees in RF, the maximum depth of each tree, and the minimum number of samples in the leaf node of the tree. As common in ML analyses, we optimize the performance of the RF algorithm by doing a grid search over these parameters and finding the best-fitting values. The grid search is performed over 80 per cent of the full halo catalogue in the simulation, using the so-called 4-fold cross-validation technique (see e.g. chapter 5 of James et al. 2013). For each choice of hyperparameters, this data is split into four subsets; three are used for training and the remaining one is used for validation and obtaining the ‘performance scores’. This is repeated four times so that each of the four subsets is used for validation, and the performance scores are averaged. This process is repeated for each choice on the hyperparameters grid, resulting in the grid point with the highest score.

For classification, a useful way to evaluate its performance is to look at the confusion matrix. To illustrate this, we show in Fig. 1 the confusion matrix trained using the |$n=0.01 \, h^{3}\, {\rm Mpc}^{-3}$| galaxy sample, using all halo and environmental features. Each row represents the RF predicted class (0 or 1), whereas each column represents the true class in the SAM (0 or 1). In our case, 1 refers to haloes containing a central galaxy and 0 otherwise. Haloes containing central galaxies and predicted as such are referred to as true positives (TP), whereas those predicted as 0 are referred to as false negatives (FN). Haloes without a central galaxy and predicted as such are referred to as true negatives (TN), while those predicted as 1 are false positives (FP). A perfect classifier would have only TN and TP and zero off-diagonal values. The confusion matrix shows the fraction of haloes in each category. We see that, in our case, the fractions of TP and FN are 0.91 and 0.09, respectively, where the predictions are normalized by the total number of haloes containing a central galaxy. The fractions of TN and FN are 0.98 and 0.02, respectively, normalized in this case by the total number of haloes not containing a central galaxy.

Confusion matrix for central galaxy predictions for the |$n=0.01 \, h^{3}\, {\rm Mpc}^{-3}$| galaxy sample, with all the halo internal and environmental properties used as input. The predictions are obtained from the full sample, with the rows corresponding to the ML predicted values and the columns showing the values in the SAM (see the text).

We utilize the Python package sklearn for performing all grid searches and RF training. We use 80 per cent of the full halo catalogue in the Millennium simulation as the training set. For each application, we first set the RF hyperparameters to those that give the highest scores in the grid search. We then proceed to train the RF classification and regression models to predict the number of central and satellite galaxies in each halo. In practice, when estimating the clustering and GAB, we average the predictions of 10 training sets (each containing 80 per cent of the total haloes) drawn randomly out of 90 per cent of the full catalogue. This allows to reduce the sensitivity to the specifics of the training set (though the sets clearly still have a large overlap). The remainder 10 per cent of the haloes are left as an independent test set, not used for either the training or cross-validation.

In this section, we present the results of our RF models. For the main analysis described here, we use the stellar mass selected |$n=0.01 \, h^{3}\, {\rm Mpc}^{-3}$| sample as mentioned in Section 2.2. The direct predictions output of the ML model are the numbers of central and satellite galaxies in each halo. We comprehensively compare them with the ‘true’ distribution of the SAM galaxy sample in multiple ways. We first directly compare the galaxy numbers on a halo-by-halo basis. We then compare the halo occupation functions, namely the average number of galaxies as a function of halo mass, as well as the variations in these halo occupation functions with secondary properties (referred here as the OV; e.g. Zehavi et al. 2018). We then proceed to populate the halo sample with the predicted number of galaxies to create a mock galaxy catalogue based on the ML predictions. We calculate the clustering of the ML galaxy sample and compare to that of the SAM sample. Finally, we examine and compare the impact of GAB on the large-scale clustering signal. We describe all these in detail below. We show the results using the full halo catalogue of the Millennium simulation, which includes the training sets used to build the ML model and the smaller (10 per cent of the haloes) test sample. We have repeated our main analysis using only the test sample, finding similar results to the ones shown here.

Here we present the ML results when using all available features, namely all the internal halo properties and environmental measures specified in Table 1. The accuracy of the ML predictions for hosting a central galaxy with stellar mass larger than our sample’s threshold in the individual haloes has already been presented in Fig. 1. Again, we find that for haloes which host a central galaxy above the stellar mass threshold in the SAM, 91 per cent of them are predicted to host a central galaxy by our ML model. For haloes that do not host a central galaxy, 98 per cent of them are accurately predicted as such in our model. The difference in the relative values likely simply reflects the larger number of haloes with no central galaxy for this stellar mass threshold such that the number of misclassified haloes is roughly comparable. Note that we do not expect the ML algorithm to provide an accurate prediction for every single halo due to the stochasticity involved, for example in the scatter between stellar mass and halo mass (and such a case would indicate extreme overfitting in the least). We view this agreement as very good.

The ‘raw’ predicted numbers of satellite galaxies from the RF regression model are not required to have an integer value a priori. We assign it to the nearest integers following a Bernoulli distribution with this mean. In practice, this amounts to assigning, e.g. 4.3 satellites to 3 with a 70 per cent probability or to 4 with 30 per cent probability. The relation between these discrete (integer) predictions for the number of satellites and the SAM number of satellites in each halo is presented in Fig. 2. Each point represents the satellite occupation in a single halo, showing the scatter of the RF predictions along the y-axis. The darker blue shaded area denotes the standard deviation of the predictions at fixed SAM values. For comparison, the grey shaded area indicates the standard deviation of a simple Poisson distribution, as is often assumed in HOD modelling (the shaded area appears to increase at low numbers, just due to the log scale plotted). We find that the RF error is slightly smaller than the Poisson error. This may not indicate overfitting, since the RF performance score of the test sample is comparable to that of the training sample. Instead, it likely reflects the accuracy of the RF predictions due to the inclusion of several halo properties in addition to halo mass. Clearly, significant scatter still remains; a perfect prediction reproducing the correct number of galaxies for each halo will have no scatter on this plot (shown as the diagonal grey line). Though not shown here, for clarity, we also perform a linear fit of the points to examine any bias in the predictions. For a fully unbiased prediction, the slope of the linear fit would be one. However, we find a slope of 0.96, which indicates a slight underprediction. This is likely caused by the lower ML prediction relative to the SAM at the largest occupation numbers (high halo mass). This underprediction is also found in Xu et al. ( 2013) and is considered a result of the small number of the most massive haloes in the simulation. Since the level of the underprediction is low, it should not impact the results in this paper.

Comparison between the RF predicted number of satellite in each halo and the actual number from the SAM. The blue dots show these values for each individual halo, for the ML model applied to the |$n=0.01 \, h^{3}\, {\rm Mpc}^{-3}$| galaxy sample, using all halo features. The darker blue shaded region shows the standard deviation of ML predictions for fixed number of SAM galaxies. The diagonal grey line indicates the idealized case where the numbers are identical, and the grey shaded region indicates the standard deviation of a Poisson distribution often assumed in HOD models.

Moving away from the comparisons on an individual halo basis, we now shift to comparing the central and satellite galaxy numbers averaged in mass bins, namely the halo occupation functions commonly used in the HOD framework. The top panel of Fig. 3 compares the halo occupation function corresponding to the ML predictions (blue) with that of the SAM (black) for the |$n=0.01 \, h^{3}\, {\rm Mpc}^{-3}$| galaxy sample. We find that the predictions are in excellent agreement with the halo occupation of the SAM galaxies, as can be seen from the indistinguishable lines.

Top: the halo occupation function for the SAM |$n=0.01 \, h^{3}\, {\rm Mpc}^{-3}$| sample (black) and ML prediction (blue) using all the halo and environmental properties. The individual contributions from central and satellite galaxies are shown as dotted and dashed lines, respectively. Bottom: the galaxy two-point autocorrelation function of the ML prediction (blue) compared to the SAM (black). The small difference on small scales is due to the galaxy profile in the SAM slightly deviating from the NFW profile assumed for the ML prediction.

With the predicted number of central and satellite galaxies in each halo, we populate the haloes and create a mock galaxy catalogue to measure the clustering. For each halo, we place the central galaxy at the halo center and populate satellites with an NFW profile (Navarro, Frenk & White 1996), going out to twice the virial radius. The bottom panel of Fig. 3 shows the resulting two-point autocorrelation function relative to that measured from the SAM. Again, we find excellent agreement between the ML predictions and the SAM. On small scales, the prediction deviates from the SAM since an NFW profile is adopted in the mock catalogue, which is slightly different from the radial distribution of the SAM satellites (e.g. Jiménez et al. 2019). Since we are focused here on modelling GAB, we will only show our predicted clustering results on large scales (larger than |$\sim 7 \, h^{-1}\, {\rm Mpc}$|⁠ ) from here on.

The OVs in the predicted halo occupation functions when using all halo and environmental properties as input features. Each panel corresponds to a different secondary property, c, a0.5, δ1.25, and α0.3,1.25, as labelled. In all panels, red and blue lines represent the SAM occupations in the 10 per cent of haloes with the highest and lowest values, respectively, of the secondary properties in fixed mass bins. Pink and cyan lines show the corresponding cases for the ML predictions. The numbers of centrals, satellites, and all galaxies are shown by dotted, dashed, and solid lines, respectively.

The OVs shown in Fig. 4 generally follow the trends already examined in detail in previous works (Zehavi et al. 2018; Contreras et al. 2019; Xu et al. 2021). For example older haloes (higher formation time, smaller a0.5 values) tend to start occupying central galaxies at lower halo masses. In contrast, such haloes host, on average, less satellites than later forming haloes. The striking result in this work is the excellent agreement between the ML predictions and the SAM ones for all secondary properties. That implies that the RF algorithm is able to accurately learn and reproduce the different secondary trends. Note that while α1,5 is one of the input features, α0.3,1.25 is not, and while they may be correlated to some extent, they play different roles in GAB. Xu et al. ( 2021) show that α1,5 accounts for a small fraction of GAB, whereas α0.3,1.25 captures the full effect on galaxy clustering. The tidal anisotropy parameter α0.3,1.25 is also partially correlated with δ1.25, but include additional information on the tidal shear. So it is interesting to see that the OV dependence on α0.3,1.25 can be well reproduced by the ML algorithm without serving as input for it. More generally, since GAB is a result of halo assembly bias combined with the OV, and the individual OVs are accurately reproduced, we expect that the GAB signal can be well recovered as well.

The GAB signature is usually measured as the ratio between the correlation function of the galaxy sample and that of a shuffled sample, created by randomly reassigning the galaxies among haloes of the same mass (Croton et al. 2007). The shuffling process effectively removes the connection of the galaxies to the assembly history of the haloes and eliminates the dependence on any secondary property other than halo mass (i.e. it erases all OVs). Comparison between the clustering of the shuffled sample and the original thus reveals the overall effect of GAB, typically seen as an increased clustering amplitude on large scales. Following standard practice (Croton et al. 2007; Zehavi et al. 2018; Contreras et al. 2019; Xu et al. 2021), we shuffle the central galaxies and then move the satellites together with their associated central galaxy. This results in the shuffled sample having the same clustering as the original sample on small (one-halo) scales.

These results are examined in detail in Fig. 5, showing the different large-scale clustering measurements separately for the central galaxies only on the left-hand side and for the full (central and satellite galaxies) sample on the right. We already saw in Fig. 3 that the overall clustering of the ML mock sample is highly consistent with that of the SAM on large scales. This is presented more clearly in the top panels of Fig. 5, where the black line shows the ratio of the ML predicted clustering to that of the SAM. The shaded regions hereafter indicate the uncertainty associated with the 10 different training sets (see Section 3.2). In both cases, we see that the SAM clustering is accurately reproduced. Our results are a vast improvement compared to Xu et al. ( 2013), who recover the amplitude of galaxy clustering to 5 per cent–10 per cent using the halo occupations as well. We reproduce the clustering to subper cent precision, perhaps due to both using a larger training sample and including also environmental properties. The latter is in line with recent studies that demonstrate the important role of environment in accurately capturing the level of galaxy clustering (Hadzhiyska et al. 2020; Xu et al. 2021). We then proceed to examine the results of the shuffled samples. We shuffle each of the SAM sample and the ML mock sample in bins of fixed halo mass, as described above. The ratios of the shuffled ML predicted clustering to that of the shuffled SAM clustering are presented as the red lines in the top panels of Fig. 5. Once again, these ratios are extremely close to unity, indicating an excellent agreement between the shuffled ML clustering and the shuffled SAM clustering.

Comparison of the measured correlation functions and GAB of the SAM and the ML predicted mock catalogue when using all features. The left-hand side shows these clustering results for central galaxies only, while the right-hand side shows the same for all (central and satellite) galaxies corresponding to the |$n=0.01 \, h^{3}\, {\rm Mpc}^{-3}$| sample. For both these cases, the top panels show the ratios of the correlation function of the ML-predicted mock catalogues relative to that of the SAM. Ratios of the original (unshuffled) correlation functions are shown in black, while the ratios of the shuffled samples of each are shown in red. The bottom panels (on both sides), show the measured GAB signal, namely the ratio of the original correlation function to that of the shuffled sample. Here, the SAM GAB measurement is shown in black, while the ML GAB is shown in blue. The shaded areas, in all panels, indicate the error bar measured from 10 different shuffled samples of the SAM galaxies and the 10 different realizations of the RF model.

We examine directly the GAB signature in the bottom panels of Fig. 5. Namely, we present ratios of the large-scales correlation function of the original sample to that of the shuffled sample, ξ/ξshuffled. Black lines represent this ratio, i.e. the GAB signal in the SAM, while the blue lines represent the ML-predicted GAB signal. The error bar on the SAM measurement is the scatter from 10 different shuffled samples, while the error bar on the ML predictions arises from the 10 different training sets (each with its own shuffled sample). Again, this is shown for the central galaxies only on the left-hand side and for the full samples, including satellites on the right. These ratios have already been studied with this specific SAM sample (Zehavi et al. 2018, 2019; Xu et al. 2021). The roughly 15 per cent increase of clustering in the original SAM sample versus the shuffled one arises from the differentiated occupation of haloes with galaxies according to secondary halo properties which exhibit halo assembly bias. For example, galaxies tend to preferentially occupy older haloes which exhibit stronger clustering, resulting in an increased large-scale galaxy clustering (GAB). We note, again, that the excess clustering shown here is the overall combined effect from all secondary properties.

The remarkable result clearly shown in the bottom panels of Fig. 5 is the excellent agreement between the GAB signal measured by the ML-predicted sample and that of the original SAM galaxy sample. This is exhibited by the nearly perfect agreement between the blue and black lines in each panel, for central galaxies only (left) and for the full sample (right). The RF model applied trained on the individual halo occupations is thus able to accurately reproduce the GAB effect in the large-scale galaxy clustering. Together with the recovered OVs, we see that the ML model is highly successful in reproducing all aspects of the complex phenomena of assembly bias.

Prediction results for RF models with different input features. The first two columns indicate the input features for the central and satellite galaxies. The centrals-only cases are indicated by a ‘−’ in the second (satellite) column. The performance scores F1 and R2 for the centrals and satellites are listed in the third and fourth columns, respectively. The next column shows the recovered fraction of the correlation function, 〈ξML/ξSAM〉, averaged over scales of |$9 \!-\! 30 \, h^{-1}\, {\rm Mpc}$|⁠ . We do not include a separate column for this property measured for the shuffled samples, since its accuracy is 1.00 (within the significance quoted) for all cases shown. The final column represents the accuracy of recovering the GAB signal using the fAB measure. The main predictions are all based on the galaxy sample of number density |$n=0.01 \, h^{3}\, {\rm Mpc}^{-3}$| and are listed in top 10 lines. The predictions with all features for two other number densities |$n=0.00316 \, h^{3}\, {\rm Mpc}^{-3}$| and |$n=0.0316 \, h^{3}\, {\rm Mpc}^{-3}$| are listed at the bottom.

The above results show that the RF models are capable of accurately reproducing galaxy clustering and GAB. However, the number of input features is large, which increases the complexity and running time of RF models. In this section, we aim to build simpler RF models with fewer input features that can achieve the same purpose. In addition to the prediction of galaxy numbers per halo, the RF algorithm also provides an estimate of the relative importance of the input features (i.e. all the secondary halo and environmental properties). It is evaluated based on the contribution of the input features to the construction of the RF decision trees. We show the top 10 properties ranked by feature importance in the left-hand side panels of Figs 6 and 7, for the central galaxies and satellites predictions, respectively.

Left: relative feature importance for the top 10 features of the RF predictions for central galaxies. Right: the correlation matrix of these top 10 features. The numbers shown are Pearson correlation coefficients between each pair of features.

Relative feature importance and correlation matrix for the top 10 features of the RF predictions for satellite galaxies.

For the central galaxies, we find that Vmax, the haloes’ maximum circular velocity, is the most important feature followed by |$z$| last, Vpeak, and a0.5. Vmax can be considered as a halo mass indicator (e.g. Zehavi et al. 2019), and the other properties characterize the formation history of a halo. Vpeak, the peak value of Vmax over the history of the halo, is a special case among them since it highly correlates with Vmax (with a 0.99 Pearson correlation coefficient, as noted in the right-hand panel of Fig. 6). We perform a simple test that runs the RF prediction inputting the same feature twice (for example the halo mass), to mimic the situation of two highly correlated features). We find that it tends to maintain the importance of one feature and lower the importance of the other one. So it is likely that the roles of Vmax and Vpeak are comparable for the central galaxies prediction. Given the extreme correlation between the two, once Vmax is utilized, Vpeak does not really add any new information and thus it is not necessary to keep them both.

The importance of Vmax is consistent with the finding by Zehavi et al. ( 2019) that Vmax or Vpeak better correlates with the central galaxies occupation than Mvir in the SAM sample such that using the former reduces significantly the central galaxies OV with other secondary properties and the related trends in the stellar mass–halo mass relation. Xu & Zheng ( 2020) reach a similar conclusion with the Illustris simulation, namely that the stellar mass of central galaxies in fixed Vpeak bins exhibits a weaker dependence on halo age or concentration than that in Mvir bins. This is not surprising since Vmax or Vpeak contains more internal structure information than Mvir alone, and in particular is also related to the concentration. Recently, Lovell et al. ( 2021) provide an ML approach to predict several galaxy properties from subhalo properties based on hydrodynamic simulations, also finding that Vmax is the most important property for the prediction.

The next two properties in order of feature importance are |$z$| last and a0.5. Both are specific epochs in the formation history of the host halo. The halo formation time, a0.5, is defined as the scale factor at the time when the host halo first reached half of its present mass, so it is indicative of the halo age and is widely explored in assembly bias studies. At fixed halo mass, early-formed haloes (smaller a0.5) tend to host more massive central galaxies than late-formed haloes (larger a0.5), and thus are more likely to host central galaxies above a given stellar mass threshold (Zehavi et al. 2018). The other parameter, |$z$| last, is the redshift of the last major merger of the host halo. It is another important epoch in the mass assembly history that could relate to the formation of the central galaxy. So it is reasonable that it is important for the central galaxies ML prediction. Interestingly, we find that no environmental properties appear in the top 10 features for central galaxies. This may be supported by the fact that the OV with environment is much smaller than with internal halo properties like age (Zehavi et al. 2018), as also demonstrated in Fig. 4. However, recent studies have shown that environment is the most informative property for describing GAB (Hadzhiyska et al. 2020; Xu et al. 2021). We will provide tests in the following subsections investigating the importance and necessity of the environment for reproducing the central galaxies and full GAB.

The left-hand panel of Fig. 7 shows the feature importance for the satellites prediction. Halo mass, Mvir, is the most important feature followed by the environment features δ2.5, δ1.25, and δ5. As expected, these three environmental measures are strongly correlated with each other, as can be seen in the right-hand panel of Fig. 7. In contrast to the central galaxies prediction, we note that here the environment is more important than secondary internal halo properties for predicting the number of satellites. Halo concentration is next and Vmax and Vpeak follow but with lower importance, which is again consistent with Zehavi et al. ( 2019), who showed that using Vmax (or Vpeak) is detrimental to encapsulating the satellites OV relative to using Mvir. These differences of feature importance between the central galaxies and satellites occupations highlight again the complexities of assembly bias. They imply that the formation and evolution of central and satellite galaxies may follow different paths and are impacted by different internal or environmental halo properties, and it is reasonable to model them separately with ML.

While the RF model estimates the input features importance, we should keep in mind that the features are correlated with each other. We take this into consideration when attempting to select fewer features for a less complex model. To illustrate that, in the right-hand panel of Figs 6 and 7, we plot the correlation matrix which shows the Pearson correlation coefficients between each pair of the top 10 features included in the left-hand panels. A correlation coefficient of 1 (shown by dark blue) indicates a positive maximal one-to-one correlation between the two properties, and a correlation coefficient of −1 (shown by dark orange) indicates a maximal anticorrelation. A correlation coefficient close to 0 indicates no correlation, with the two properties largely independent of each other. Values between 0 and 1 (−1) represent then a positive (negative) correlation with scatter, and the scatter is smaller for larger absolute values indicating a tighter correlation. In selecting a subset of top features, it is more effective to select a few such features that are important and yet less correlated with each other in order to represent most of the information. For central galaxies, since Vmax and Vpeak are tightly correlated, we select Vmax, |$z$| last, a0.5, and Mvir as the top features. For the satellite galaxies, we select Mvir, δ2.5, δ1.25, and concentration c as the top features. We show in the next section the RF prediction results with the selected top four features.

In this section, we predict the number of central and satellite with the top four features selected separately for central galaxies and satellites in Section 4.2. We first perform new grid searches for the two sets of top features to tune the RF classification and regression models for centrals and satellites, respectively. The F1 and R2 scores are listed in the third and fourth lines of Table 2, which are very similar to those from the all features models. Fig. 8 presents the ML predicted OVs compared to those from the SAM. Similar to the OV prediction with all features shown in Fig. 4, the considered OVs are all accurately reproduced. This is even more impressive in this case, when using only four features for each centrals or satellites. It is worth noting that other than Mvir which is common to both no secondary property is present in both the centrals and satellites top features. Thus in all panels of Fig. 8, showing the OV with c, a0.5, δ1.25, and α0.3,1.25, these properties are not involved in all predictions. We therefore conclude that the top four features for centrals and satellites are highly efficient in capturing the information needed for reproducing the halo occupation numbers.

Similar to Fig. 4, the predicted OV with c, a0.5, δ1.25, and α0.3,1.25, but now when using only the top four features in the RF algorithm. The four features for the central galaxies are Vmax, alastmerg, a0.5, and Mvir. The four features for the satellite galaxies are Mvir, δ2.5, δ1.25, and c.

With the predicted occupations from the top features, we again populate the haloes to create a mock galaxy catalogue and measure galaxy clustering and the GAB signal. The results are presented in Fig. 9 and summarized in Table 2. For the centrals-only sample, the predicted original clustering, shuffled clustering, and the GAB are highly consistent with those of the SAM (left-hand panels), with recovered fractions of 1.00, 1.00, and 0.97, respectively. These results are very similar to those from the prediction using all features, and the RF classification with the top four features works equally well as the one with all features. It is worth noting again that the top four features for central galaxies are all halo internal properties without explicitly including environment. This seems to imply that environment measures are not necessary for recovering the centrals GAB. However, other works have shown that environment is crucial for capturing GAB (Hadzhiyska et al. 2020; Xu et al. 2021; C. Cuesta, in preparation). To gain more insight on the role of environment in recovering GAB, we examine in Section 4.4 obtaining ML predictions based on only mass and environment, and in Section 4.5 the predictions based solely on internal halo properties.

Similar to Fig. 5, the predicted galaxy clustering and GAB measurement for centrals only (left) and all (central and satellite) galaxies (right), now obtained using only the top four features for central galaxies and satellites in the ML.

The right-hand side panels of Fig. 9 provide the predicted clustering and GAB for all galaxies including satellites. The satellite occupation is predicted with the top four features specific for satellites selected in Section 4.2 (which are different than the top four features for centrals) and include environmental properties. The recovered original clustering, shuffled clustering, and GAB fraction are all in excellent agreement with the SAM measurements (with recovered fractions of 1.00 for all). These fractions are in fact slightly higher than those for the all features model, but we consider them to be equally good due to the randomness associated with the prediction, populating galaxies, and shuffling. Combining the results from the centrals and satellites predictions, we find that the ML mock with only the top four features for each can well capture the galaxy–halo connection in the SAM and reproduce the expected galaxy clustering and GAB.

In Sections 4.2 and 4.3, we saw that environmental properties are listed in the top features for the satellite galaxies occupation. However, they are not included in the top 10 features for the central galaxies prediction, and the top four features for centrals (without environment) can well reproduce the centrals GAB. This seems to suggest that environment is not necessary for a recovery of the centrals GAB. We clarify that the internal halo properties (e.g. age a0.5) are surely dependent on environment to some degree since they produce assembly bias, but it is of interest to know whether an environmental measure is needed to be explicitly included. Traditional (non-ML) analyses show that environment is the most informative property for GAB, more significant than any other single secondary property in either SAM or hydrodynamic galaxy samples (Hadzhiyska et al. 2020, 2021; Xu et al. 2021). In particular, Xu et al. ( 2021) demonstrated that δ1.25 can capture the full level of GAB in the SAM. To further examine the role of environment in GAB, we repeat our analysis but now only use the halo mass Mvir and δ1.25 as input features to the RF algorithm models.

The OVs predicted by the ML models based on Mvir and δ1.25 are shown in Appendix C1. We find that the models are less successful in reproducing the OVs compared to the models with all features and the top four features. The OV dependence on δ1.25 is recovered as expected, as well as the ones for α0.3,1.25 to a large extent. However, the variations with halo properties such as concentration and age are poorly recovered, especially for the satellites. We note that these results are in agreement with those by Xu et al. ( 2021). While they were able to mimic the full level of GAB with only halo mass and δ1.25, they were similarly unable to recover the OVs with other secondary properties. In our ML analysis, the weaker recovery is also reflected by the somewhat lower F1 and R2 performance scores of the RF models in this case (lines 5–6 in Table 2). These scores reflect the halo-by-halo prediction accuracy such that a lower value will lead to less accurate recovery of the OVs.

We then populate haloes with the predicted occupations and measure galaxy clustering and GAB. The results for these are shown in Fig. 10 and summarized in Table 2 as well. For both centrals-only and all galaxies, the predicted shuffled clustering is in perfect agreement with the SAM results (the red solid lines in the top panels), indicating that the halo mass dependence of clustering is reproduced. However, for the original (unshuffled, including assembly bias) SAM clustering the ML recovery for both these cases is slightly lower (the black solid lines in the top panels). It is still reasonably good with a recovery fraction of 0.99, but stands out in contrast to the excellent agreement of the predictions with all features and top features explored earlier. This leads to a reduced ability to recover the GAB signal, denoted by the solid blue lines in the bottom panels of Fig. 10. These correspond to fAB values of 0.86 and 0.92 for the centrals-only GAB and the all-galaxies one, respectively. This result is consistent with Xu et al. ( 2021), who show that shuffling galaxies among haloes with fixed mass and δ1.25 (which can also be considered as populating haloes according to only mass and δ1.25) reproduces ∼90 per cent of the full GAB signal. The performance of the RF models based on only mass and environment is also similar to that of the modified HOD model provided by Xu et al. ( 2021), while in the latter the GAB parameters are tunable to reproduce the full effect.

Similar to Fig. 9, the predicted galaxy clustering and GAB measurement for centrals only (left) and all galaxies (right), using only halo mass and environment δ1.25 for both centrals and satellites.

Our analysis suggests that mass and environment are efficient in capturing most of the GAB signal and are useful for reproducing galaxy clustering within 1 per cent if halo internal properties are unavailable. Combined with the results from Section 4.3, we find that the central GAB can be recovered with either a few internal halo properties or the environment. The former achieves the purpose by capturing most of the assembly bias effects in halo occupation, whereas the latter achieves this by ‘mimicking’ the effect on the clustering. The satellites assembly bias effects can be largely recovered by environment alone, but including information on internal properties improves the OV. Would internal properties alone be able to reproduce both the centrals and satellites GAB? Is the environment required for reproducing the full GAB? We answer these questions in Section 4.5 by testing the RF models using now only the internal halo properties.

In this section, we explore the performance of the ML predictions when using only internal halo properties, commonly associated with halo assembly bias, rather than environment measures directly. We include all the halo properties listed in lines 1–13 of Table 1. In contrast to the previous case with halo mass and environment, the models with internal properties accurately recover the OV with concentration and a0.5 accurately, as shown in Fig. C2 in Appendix C. The OV with δ1.25 and α0.3,1.5 are partially recovered, with the centrals OV well reproduced but smaller OV for the satellite galaxies. As before, we proceed to create mock galaxy catalogues with the ML predicted occupations to study the impact on clustering and GAB.

The clustering and GAB of the RF mock are shown in Fig. 11. For the central galaxies only (left-hand side), we find that the original clustering, shuffled clustering, and GAB are all well reproduced at sub per cent accuracy. These results are similar to those with only the top four properties shown in Section 4.3, which for the central galaxies were comprised of only internal properties (Vmax, |$z$| last, a0.5, and Mvir). These top properties appear to include most of the information needed to reproduce the centrals clustering and GAB such that now including all internal properties does not change the results. The situation for the satellites, however, is different since environment measures have a prominent role in the top features. Consequently, we find that when adding the satellite galaxies, the clustering and GAB are not well with only the internal properties. The recovered clustering is lower than that of the SAM by 3 per cent, and only 70 per cent of the GAB is reproduced.

Similar to Fig. 5, the predicted galaxy clustering and GAB for centrals only (left) and all galaxies (right) when using all internal properties (and no environment measures) for the ML predictions.

We conclude that while the environment is not necessary for centrals clustering, it is required for an accurate representation of the satellites clustering. This is consistent with the feature importance provided by the RF models. In summary, secondary halo properties include enough information to recover in full the centrals OV, clustering and GAB, but the environment is needed for accurately predicting the satellites OV and the full level of clustering and GAB, and cannot be replaced with internal properties alone.

The main purpose of this paper is to explore the possibility of creating realistic mock galaxy catalogues from halo catalogues of N-body simulations using ML to capture the detailed galaxy–halo connection. However, for some low-resolution N-body simulations, the halo merger tree which follows the haloes’ evolution is unavailable. In such cases, one will not be able to obtain halo properties that rely on the merger tree, such as a0.5, Vpeak, and |$z$| last. The only available properties will be single-epoch properties typically obtained from the final snapshot of the simulation. These include Mvir, Vmax, concentration c, angular momentum j, and the environment measures. In this section we test the performance of ML models based on these single-epoch properties. We include, for both centrals and satellites, the above four internal halo properties and δ1.25.

We find that the OVs in this case are mostly well reproduced as shown in Appendix C. The OV with c and δ1.25 are particularly well reproduced, as expected, since they are part of the input features. The only notable deviation is for the centrals OV with a0.5, where the ML prediction is slightly smaller than in the SAM. The predicted clustering and GAB signal are shown in Fig. 12. For the centrals-only prediction, both galaxy clustering and GAB are extremely well reproduced. Adding the satellites, the SAM clustering is recovered to within 1 per cent and the GAB is recovered to within 5 per cent. These are better than the ML with only internal properties or Mvir and δ1.25 alone, and slightly worse than the models with all features or the top four features. We suspect that including additional available (single-epoch) environment measures such as δ2.5 would have improved this result.

Similar to Fig. 11, the predicted galaxy clustering and GAB for centrals only (left) and all galaxies (right), now using only the following single-epoch (i.e. not involving the halo merger tree) features Mvir, Vmax, concentration c, angular momentum j, and δ1.25 for both centrals and satellites.

Overall, the analysis illustrates that when the halo formation history is not available (for example, in low-resolution N-body simulations), ML models can still reproduce the clustering and GAB to reasonable accuracy. Using ML to predict the halo occupation and populate haloes with galaxies accordingly thus provide a viable practical approach to creating realistic mock galaxy catalogues, even in such cases.

In this paper, we describe an ML approach to predict the number of galaxies above a stellar mass threshold in dark matter haloes using halo and environment properties as input. We use the halo catalogue from the Millennium simulation and the galaxy sample from the Guo et al. ( 2011) SAM model to train and test our ML method. We use RF classification and regression for the central galaxies and satellites, respectively, and adopt commonly used F1 and R2 scores to evaluate the performance of the models. We test different combinations of input properties. For each set of the input properties, we tune the hyperparameters of the RF models to maximize the performance scores. With the predicted number of central and satellite galaxies in each halo, we then populate the Millennium simulation haloes to create a mock galaxy catalogue and measure the galaxy clustering and GAB signal to compare with those of the SAM.

We start by using all the available internal and environmental halo properties, listed in Table 1, as input features. The predicted HOD and OVs are consistent with those measured from the SAM. The ML mock catalogue matches well the galaxy clustering, shuffled sample clustering, and GAB as that of the original SAM sample. The clustering is recovered to subper cent accuracy and GAB is recovered at the two per cent level. Our results show that ML is capable of capturing the complex high-dimensional relations between halo properties and the galaxy occupation in the SAM model and reproduce the expected galaxy clustering accurately, including the intricate effects of assembly bias.

The RF models also provide an estimate of the relative importance of the different features. We find that Vmax is the most important feature for central galaxies, followed by formation history (internal) properties and halo mass. Environmental properties are not included in the top 10 features. On the other hand, the satellite galaxies prediction relies the most on halo mass and environmental properties. We construct simpler RF models with the top four halo properties for the centrals and satellite galaxies separately, based on the feature importance and correlation matrix between them. We select Vmax, |$z$| last, a0.5, and Mvir for central galaxies prediction and Mvir, δ1.25, δ2.5, and concentration c for the satellites prediction. The OVs, clustering, and GAB are again well reproduced. This demonstrates that the ML methodology is powerful enough such that, with only a few halo properties, it can achieve similar performance as when using all the available information.

We perform two additional tests to further explore the role of the environment in reproducing galaxy clustering and GAB. We first use only halo mass and one environmental property (δ1.25) as input for the RF models. With the ML-constructed mock, we still recover the SAM (original and shuffled) galaxy clustering to within 1 per cent and about 92 per cent of the full GAB signal (Fig. 10). We conclude that δ1.25 along with the host halo mass is enough to reproduce GAB to |$\sim 10{{\ \rm per\ cent}}$| accuracy. This is in agreement with previous works (Hadzhiyska et al. 2020; Contreras et al. 2021; Xu et al. 2021) that showed that using environmental properties can realistically incorporate assembly bias into empirical models, such as the HOD or SHAM. However, these methods do not recover the full OV for halo properties with inherent halo assembly bias, such as concentration or age (see Appendix C for more details). This puts a limitation on such approaches when using statistics that need a more detailed modelling of the galaxy–halo connection (like galaxy lensing). Other approaches that add assembly bias to mock catalogues using a single secondary property like the halo concentration will also necessarily fail to reproduce the galaxy–halo connection since such properties are not able to capture on their own the full GAB of a semi-analytic galaxy sample (Croton et al. 2007; Xu et al. 2021). To our knowledge, the approach presented in this paper is the most efficient model capable of populating galaxies in N-Body simulations while taking into account the correlations between the halo occupation and the secondary halo properties, and recovering a realistic GAB signal.

The second test employs all secondary assembly bias properties as input, excluding the environment. The clustering and GAB for central galaxies alone are recovered at subper cent accuracy at the same level as those with all or top four properties. However, after adding satellite galaxies, the predicted ML mock catalogue only recovers about 70 per cent of the GAB signal. This clearly indicates that internal properties alone are not able to fully capture the relation between the satellite occupations and the host haloes. Perhaps further information can be introduced by including additional internal properties not included in this work; however, using readily available environment measures seems the more practical approach here. Combining the results from the two tests, we find that both internal properties and environmental properties can reproduce the centrals clustering and GAB, but that environment is necessary for reproducing the full clustering and GAB. Furthermore, environment alone (together with halo mass) goes a long way toward mimicking the correct level of assembly bias; however, including assembly bias properties is needed to recover the OV with such properties and reproduce GAB to per cent level accuracy.

Finally, to explore a potential application of our ML method in cases where the halo merger tree might not be available in low-resolution N-body simulations, we limit the input properties to single-epoch ones, which can be obtained from the present-day simulation. We therefore use Mvir, Vmax, concentration, angular momentum, and δ1.25 as input for the RF models. The OVs in this case are reasonably reproduced, galaxy clustering is matched at subper cent level, and the GAB signal is recovered to 5 per cent. An improvement in the GAB level may be reached if including additional environment parameters. Utilizing such a model can be a practical approach for populating large dark matter-only simulations, like the Millennium XXL Simulation (Angulo et al. 2021) and others, where the resolution of the halo merger trees is insufficient for use in a SAM. Instead, one can train and fine-tune an ML model on a smaller volume high-resolution galaxy formation simulation. Once the model is determined, it is straight forward to apply it to the larger simulation to create mock galaxy catalogues with all the required attributes.

Overall, our results demonstrate the ability of ML to successfully capture the high-dimensional relationship between the halo occupation and multiple halo properties. Our tests here are with a SAM, but we expect similar performance when matching hydrodynamical simulations, which we leave for future work. As just mentioned, it is particularly advantageous to learn these relations from existing SAM or hydrodynamic galaxy samples in order to create realistic mock galaxy catalogues with haloes in larger cosmological volumes. This has the advantage of reproducing the detailed galaxy–halo connection of state-of-the-art galaxy formation models, which might be computationally prohibitive otherwise. Additionally, with the single-epoch test, we show that ML can also be used to reproduce galaxy clustering and assembly bias in low-resolution N-body simulations for emulators, which are becoming benchmarks for cosmological studies. In this work, we focus on predicting the occupation of galaxies in halo for stellar mass selected samples, but it can be extended to other types of galaxy samples, for example star formation rate selected samples and colour selected sample, which are also frequently used in observations as well as galaxy samples at higher redshifts. We leave these as well for future studies.

Different studies in the literature have focused on predicting galaxy properties from haloes with ML techniques. Xu et al. ( 2013) predict the number of galaxies based on six halo properties and reproduce the galaxy clustering to a 5 per cent–10 per cent, which is similar to our internal properties predictions without using environment. Our extended work now reaches subper cent accuracy. Other works based on ML techniques predict properties of central galaxies such as stellar mass, star formation rate, and gas mass to mimic galaxy formation in hydrodynamic simulations (e.g. Kamdar et al. 2016b; Agarwal et al. 2018; Wadekar et al. 2020). In contrast, our study using the occupation number more directly probes galaxy clustering and assembly bias, and allows to naturally predict both central and satellite galaxies.

With the predicted number of galaxies from the RF models, when populating haloes, we assume the galaxy positions trace the dark matter. However, it is also possible to predict the galaxy positions with ML models directly. For example, Zhang et al. ( 2019) use deep learning to predict 3D galaxy distribution from the dark matter distribution in the Illustris simulation. It may also be possible to predict the radial distribution bias and velocity bias of galaxies compared to the dark matter for individual haloes, if these are used as target for learning in addition to galaxy numbers.

For the purpose of modelling the halo occupation, our work can be considered as an ML alternative to the HOD approach in some applications. The standard HOD framework models the number of galaxies in a halo as a function of only halo mass. Different extensions of the HOD (e.g. Hearin et al. 2016; Xu et al. 2021; Yuan et al. 2021) include an additional dependence on one or two secondary halo properties, but the galaxy–halo relations obtained are still limited. With ML-based methods, the non-linear dependence of the halo occupation on multiple halo and environment properties can be maximally reproduced without assuming an analytic relation between them or fixing the parameters. Similarly, compared to empirical SHAM models, ML methods can capture and reproduce more complex multivariate dependencies between the galaxy and halo properties. This advantage makes ML a powerful approach for studying the galaxy–halo connection and for creating realistic mock galaxy catalogues, which will be useful for upcoming large galaxy surveys.

We clarify, however, that we are not suggesting here that the HOD approach may be completely replaced by ML and it is still essential in many applications. For example, HODs are clearly needed when aiming to constrain the galaxy–halo connection directly from observed clustering measurements or create mock catalogues mimicking the observed clustering since ML can only learn and reproduce known galaxy–halo connections. Similarly, if one wishes to produce sets of mock catalogues with varied HOD models capable of reproducing any clustering signal, then the HOD approach (and different extensions thereof) provides a practical and flexible way of doing so. In contrast, as already discussed, ML are especially powerful if one wishes to create many realizations of a specific galaxy formation model, such as a given SAM or a hydrodynamical simulation (for example for testing algorithms, forecasting, and generating covariance matrices). In such cases, one can train the ML once and then efficiently run it on dark matter-only simulations to accurately reproduce the target galaxy population, including the nuanced effects of GAB. In such applications, the ML approach is superior to the standard HOD in terms of recovering the detailed galaxy–halo connection with only a moderate additional computational effort.

This work was made possible by the efforts of Gerard Lemson and colleagues at the German Astronomical Virtual Observatory in setting up the Millennium Simulation database in Garching. We thank the anonymous referee for their useful comments that helped the clarity of the paper. XX, SK, and IZ acknowledge support by National Science Foundation grant AST-1612085. SC acknowledges the support of the ‘Juan de la Cierva Formación’ fellowship (FJCI-2017-33816).

The data underlying this article are available in GitHub at https://github.com/xiaojux2020/RFmodels.

http://gavo.mpa-garching.mpg.de/Millennium/

Agarwal S. , Davé R. , Bassett B. A. , 2018, MNRAS, 478, 3410 10.1093/mnras/sty1169

Aizerman M. A. , Braverman E. M. , Rozonoer L. I. , 1964, Autom. Remote Control, 25, 1175

Angulo R. E. , Springel V. , White S. D. M. , Jenkins A. , Baugh C. M. , Frenk C. S. , 2012, MNRAS, 426, 2046 10.1111/j.1365-2966.2012.21830.x

Aragon-Calvo M. A. , 2019, MNRAS, 484, 5771 10.1093/mnras/stz393

Arjona R. , Nesseris S. , 2020, Phys. Rev. D, 101, 123525 10.1103/PhysRevD.101.123525

Armitage T. J. , Kay S. T. , Barnes D. J. , 2019, MNRAS, 484, 1526 10.1093/mnras/stz039

Artale M. C. , Zehavi I. , Contreras S. , Norberg P. , 2018, MNRAS, 480, 3978 10.1093/mnras/sty2110

Behroozi P. S. , Conroy C. , Wechsler R. H. , 2010, ApJ, 717, 379 10.1088/0004-637X/717/1/379

Behroozi P. S. , Wechsler R. H. , Wu H.-Y. , 2013, ApJ, 762, 109 10.1088/0004-637X/762/2/109

Behroozi P. , Wechsler R. H. , Hearin A. P. , Conroy C. , 2019, MNRAS, 488, 3143 10.1093/mnras/stz1182

Berger P. , Stein G. , 2019, MNRAS, 482, 2861 10.1093/mnras/sty2949

Berlind A. A. , Weinberg D. H. , 2002, ApJ, 575, 587 10.1086/341469

Bond J. R. , Cole S. , Efstathiou G. , Kaiser N. , 1991, ApJ, 379, 440 10.1086/170520

Bose S. , Eisenstein D. J. , Hernquist L. , Pillepich A. , Nelson P. , Marinacci F. , Volker S. , Vogelsberger M. , 2019, MNRAS, 490, 5693 10.1093/mnras/stz2546

Boser B. E. , Guyon I. M. , Vapnik V. N. , 1992, 5th Annual ACM Workshop on COLT. ACM Press, Pittsburgh, PA, p. 144

Calderon V. F. , Berlind A. A. , 2019, MNRAS, 490, 2367 10.1093/mnras/stz2775

Chaves-Montero J. , Angulo R. E. , Schaye J. , Schaller M. , Crain R. A. , Furlong M. , Theuns T. , 2016, MNRAS, 460, 3100 10.1093/mnras/stw1225

Cheng T. Y. et al. , 2020, MNRAS, 493, 4209 10.1093/mnras/staa501

Conroy C. , Wechsler R. H. , Kravtsov A. V. , 2006, ApJ, 647, 201 10.1086/503602

Contreras S. , Zehavi I. , Padilla N. , Baugh C. M. , Jimenez E. , Lacerna I. , 2019, MNRAS, 484, 1133 10.1093/mnras/stz018

Contreras S. , Angulo R. , Zennaro M. , 2020, preprint (arXiv:2012.06596)

Contreras S. , Angulo R. , Zennaro M. , 2021, MNRAS, 504, 5205 10.1093/mnras/stab1170

Cooray A. , Sheth R. , 2002, Phys. Rep., 372, 1 10.1016/S0370-1573(02)00276-4

Cora S. A. et al. , 2018, MNRAS, 479, 2 10.1093/mnras/sty1131

Croton D. J. et al. , 2006, MNRAS, 365, 11 10.1111/j.1365-2966.2005.09675.x

Croton D. J. , Gao L. , White S. D. M. , 2007, MNRAS, 374, 1303 10.1111/j.1365-2966.2006.11230.x

Croton D. J. et al. , 2016, ApJ, 222, 22

Dark Energy Survey Collaboration , 2016, MNRAS, 470, 1270 10.1093/mnras/stw641

Davis M. , Efstathiou G. , Frenk C. S. , White S. D. M. , 1985, ApJ, 292, 371 10.1086/163168

De La Calleja J. , Fuentes O. , 2004, MNRAS, 349, 87 10.1111/j.1365-2966.2004.07442.x

De Lucia G. , Blaizot J. , 2007, MNRAS, 375, 2 10.1111/j.1365-2966.2006.11287.x

De Lucia G. , Kauffmann G. , White S. D. M. , 2004, MNRAS, 349, 1101 10.1111/j.1365-2966.2004.07584.x

de Oliveira R. A. , Li Y. , Villaescusa-Navarro F. , Ho S. , Spergel D. N. , 2020, preprint (arXiv:2012.00240)

Gao L. , White S. D. M. , 2007, MNRAS, 377, L5 10.1111/j.1745-3933.2007.00292.x

Gao L. , Springel V. , White S. D. M. , 2005, MNRAS, 363, L66 10.1111/j.1745-3933.2005.00084.x

Géron A. , 2017, Hands-on Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media, Sebastopol, CA

Guo Q. et al. , 2011, MNRAS, 413, 101 10.1111/j.1365-2966.2010.18114.x

Guo Q. , White S. , Angulo R. E. , Henriques B. , Lemson G. , Boylan-Kolchin M. , Thomas P. , Short C. , 2013, MNRAS, 428, 1351 10.1093/mnras/sts115

Guo H. et al. , 2016, MNRAS, 459, 3040 10.1093/mnras/stw845

Hadzhiyska B. , Bose S. , Eisenstein D. , Hernquist L. , Spergel D. N. , 2020, MNRAS, 493, 5506 10.1093/mnras/staa623

Hadzhiyska B. , Bose S. , Eisenstein D. , Hernquist L. , 2021, MNRAS, 501, 1603 10.1093/mnras/staa3776

Han J. , Li Y. , Jing Y. P. , Nishimichi T. , Wang W. , Jiang C. , 2019, MNRAS, 482, 1900 10.1093/mnras/sty2822

Hastie T. , Tibshirani R. , Friedman J. , 2001, The Elements of Statistical Learning. Springer New York Inc, New York

Hearin A. P. , Zentner A. R. , van den Bosch F. C. , Campbell D. , Tollerud E. , 2016, MNRAS, 460, 2552 10.1093/mnras/stw840

Henriques B. M. B. , White S. D. M. , Thomas P. A. , Angulo R. , Guo Q. , Lemson G. , Springel V. , Overzier R. , 2015, MNRAS, 451, 2663 10.1093/mnras/stv705

Henriques B. M. B. , Yates R. M. , Fu J. , Guo Q. , Kauffmann G. , Srisawat C. , Thomas P. A. , White S. D. M. , 2020, MNRAS, 491, 5795 10.1093/mnras/stz3233

Ivezić Z. et al. , 2019, ApJ, 873, 111 10.3847/1538-4357/ab042c

James G. , Witten D. , Hastie T. , Tibshirani R. , 2013, An Introduction to Statistical Learning. Springer New York Inc, New York

Jiménez E. , Contreras S. , Padilla N. , Zehavi I. , Baugh C. M. , Gonzalez-Perez V. , 2019, MNRAS, 490, 3532 10.1093/mnras/stz2790

Kamdar H. M. , Turk M. J. , Brunner R. J. , 2016a, MNRAS, 455, 642 10.1093/mnras/stv2310

Kamdar H. M. , Turk M. J. , Brunner R. J. , 2016b, MNRAS, 457, 1162 10.1093/mnras/stv2981

Lange J. U. , van den Bosch F. C. , Zentner A. R. , Wang K. , Hearin A. P. , Guo H. , 2019, MNRAS, 490, 1870 10.1093/mnras/stz2664

Lin Y.-T. , Mandelbaum R. , Huang Y.-H. , Huang H.-J. , Dalal N. , Diemer B. , Jian H.-Y. , Kravtsov A. , 2016, ApJ, 819, 119 10.3847/0004-637X/819/2/119

Lovell C. C. , Wilkins S. M. , Thomas P. A. , Schaller M. , Baugh C. M. , Fabbian G. , Bahé Y. , 2021, preprint (arXiv:2106.04980)

LSST Science Collaborations , 2009, preprint (arXiv:0912.0201)

Lu Y. , Yang X. , Shen S. , 2015, ApJ, 804, 55 10.1088/0004-637X/804/1/55

Lucie-Smith L. , Peiris H. V. , Pontzen A. , Lochner M. , 2018, MNRAS, 479, 3405 10.1093/mnras/sty1719

Mao Y. , Zentner A. R. , Wechsler R. H. , 2018, MNRAS, 474, 5143 10.1093/mnras/stx3111

McCarthy K. S. , Zheng Z. , Guo H. , 2019, MNRAS, 487, 2424 10.1093/mnras/stz1461

McEwen J. E. , Weinberg D. H. , 2018, MNRAS, 477, 4348 10.1093/mnras/sty882

Mo H. J. , White S. D. M. , 1996, MNRAS, 282, 347 10.1093/mnras/282.2.347

Moews B. , Davé R. , Mitra S. , Hassan S. , Cui W. , 2021, MNRAS, 504, 4024 10.1093/mnras/stab1120

Moster B. P. , Naab T. , White S. D. M. , 2018, MNRAS, 477, 1822 10.1093/mnras/sty655

Mucesh S. et al. , 2021, MNRAS, 502, 2770 10.1093/mnras/stab164

Navarro J. F. , Frenk C. S. , White S. D. M. , 1996, ApJ, 462, 563 10.1086/177173

Nelson D. et al. , 2019, MNRAS, 490, 3234 10.1093/mnras/stz2306

Ntampaka M. , Eisenstein D. J. , Yuan S. , Garrison L. H. , 2020, ApJ, 889, 151 10.3847/1538-4357/ab5f5e

Paranjape A. , Sheth R. K. , Desjacques V. , 2013, MNRAS, 431, 1503 10.1093/mnras/stt267

Paranjape A. , Hahn O. , Sheth R. K. , 2018, MNRAS, 476, 3631 10.1093/mnras/sty496

Prada F. , Klypin A. A. , Cuesta A. J. , Betancort-Rijo J. E. , Primack J. , 2012, MNRAS, 423, 3018 10.1111/j.1365-2966.2012.21007.x

Press W. H. , Schechter P. , 1974, ApJ, 187, 425 10.1086/152650

Ramakrishnan S. , Paranjape A. , Hahn O. , Sheth R. K. , 2019, MNRAS, 489, 2977 10.1093/mnras/stz2344

Reddick R. M. , Wechsler R. H. , Tinker J. L. , Behroozi P. S. , 2013, ApJ, 771, 30 10.1088/0004-637X/771/1/30

Salcedo A. N. , Maller A. H. , Berlind A. A. , Sinha M. , McBride C. K. , Behroozi P. S. , Wechsler R. H. , Weinberg D. H. , 2018, MNRAS, 475, 4411 10.1093/mnras/sty109

Sánchez C. et al. , 2014, MNRAS, 445, 1482 10.1093/mnras/stu1836

Schaye J. et al. , 2015, MNRAS, 446, 521 10.1093/mnras/stu2058

Sheth R. K. , Tormen G. , 1999, MNRAS, 308, 119 10.1046/j.1365-8711.1999.02692.x

Sheth R. K. , Tormen G. , 2004, MNRAS, 350, 1385 10.1111/j.1365-2966.2004.07733.x

Spergel D. N. et al. , 2003, ApJS, 148, 175 10.1086/377226

Springel V. , Yoshida N. , White S. D. M. , 2001a, New Astron., 6, 79 10.1016/S1384-1076(01)00042-2

Springel V. , White S. D. M. , Tormen G. , Kauffmann G. , 2001b, MNRAS, 328, 726 10.1046/j.1365-8711.2001.04912.x

Springel V. et al. , 2005, Nature, 435, 629 10.1038/nature03597

Stevens A. R. , Croton D. J. , Mutch S. J. , 2016, MNRAS, 461, 859 10.1093/mnras/stw1332

Tanaka M. et al. , 2018, PASJ, 70, S9 10.1093/pasj/psx077

Villaescusa-Navarro F. et al. , 2019, ApJS, 250, 2 10.3847/1538-4365/ab9d82

Wadekar D. , Villaescusa-Navarro F. , Ho S. , Perreault-Levasseur L. , 2020, preprint (arXiv:2012.00111)

Wang L. , Weinmann S. M. , De Lucia G. , Yang X. , 2013, MNRAS, 433, 515 10.1093/mnras/stt743

Wang J. , Bose S. , Frenk C. S. , Gao L. , Jenkins A. , Springel V. , White S. D. M. , 2020, Nature, 585, 39 10.1038/s41586-020-2642-9

Wechsler R. H. , Zentner A. R. , Bullock J. S. , Kravstov A. V. , Allgood B. , 2006, ApJ, 652, 71 10.1086/507120

White S. D. M. , Rees M. J. , 1978, MNRAS, 183, 341 10.1093/mnras/183.3.341

Wu J. F. , Peek J. E. G. , 2020, preprint (arXiv:2009.12318)

Xu X. , Zheng Z. , 2018, MNRAS, 479, 1579 10.1093/mnras/sty1547

Xu X. , Zheng Z. , 2020, MNRAS, 492, 2739 10.1093/mnras/staa009

Xu X. , Ho S. , Trac H. , Schneider J. , Poczos B. , Ntampaka M. , 2013, ApJ, 772, 147 10.1088/0004-637X/772/2/147

Xu X. , Zehavi I. , Contreras S. , 2021, MNRAS, 502, 3242 10.1093/mnras/stab100

York D. G. et al. , 2000, AJ, 120, 1579 10.1086/301513

Yuan S. , Hadzhiyska B. , Bose S. , Eisenstein D. J. , Guo H. , 2021, MNRAS, 502, 3582 10.1093/mnras/stab235

Zehavi I. et al. , 2005, ApJ, 630, 1 10.1086/431891

Zehavi I. et al. , 2011, ApJ, 736, 59 10.1088/0004-637X/736/1/59

Zehavi I. , Contreras S. , Padilla N. , Smith N. J. , Baugh C. M. , Norberg P. , 2018, ApJ, 853, 84 10.3847/1538-4357/aaa54a

Zehavi I. , Kerby S. E. , Contreras S. , Jiménez E. , Padilla N. , Baugh C. M. , 2019, ApJ, 887, 17 10.3847/1538-4357/ab4d4d

Zentner A. R. , Hearin A. P. , van den Bosch C. F. , 2014, MNRAS, 443, 3044 10.1093/mnras/stu1383

Zhang X. , Wang Y. , Zhang W. , Sun Y. , He S. , Contardo G. , Villaescusa-Navarro F. , Ho S. , 2019, preprint (arXiv:1902.05965)

Zheng Z. et al. , 2005, ApJ, 633, 791 10.1086/466510

Zhou R. et al. , 2021, MNRAS, 501, 3309 10.1093/mnras/staa3764

Zu Y. , Zheng Z. , Zhu G. , Ying Y. P. , 2008, ApJ, 686, 41 10.1086/591071

A key element of the RF machine-learning methodology is decision trees. The most popular decision tree algorithm is known as the Classification And Regression Tree (CART). In CART, a decision tree starts with a root node containing all input instances in the training set. It then splits the root node into two (known as binary splits) based on an input feature and its threshold (e.g. |$M_{\rm vir} \le 10^{14}\, h^{-1}\, {\rm M}_\odot$|⁠ ) that minimizes the CART cost function. The tree continues to split the nodes into two until it reaches the allowed maximum tree depth or the minimum allowed number of instances in a leaf node, where a leaf is a node that cannot be split any further. The maximum tree depth and minimum instances in a leaf serve as regularization hyperparameters which limit the growth of a tree to prevent overfitting. For classification, the output of the leaf is the class with the maximum number of training instances in it. For regression, the output is the mean of the ground ‘truth’ values in the leaf. After constructing the decision tree, predictions for new input data can be made starting from the root node and following the branching splitting nodes. Finally, it ends at a leaf from which the output can be obtained.

To further investigate the ability of RF models to reproduce the GAB, we perform a similar analysis to that presented in Section 4.1, using all features available for the ML prediction, for two additional stellar mass selected galaxy samples with |$n=0.00316 \, h^{3}\, {\rm Mpc}^{-3}$| and |$n=0.0316 \, h^{3}\, {\rm Mpc}^{-3}$|⁠ . These correspond to stellar mass thresholds of |$3.88\times 10^{10} \, h^{-1}\, {\rm M}_\odot$| and |$1.85 \times 10^{9} \, h^{-1}\, {\rm M}_\odot$|⁠ , respectively. The clustering results are shown in Figs B1 and B2, and are also included in Table 2.

Similar to Fig. 5, the ML predicted galaxy clustering and GAB with all features for the galaxy sample with number density |$n=0.00316 \, h^{3}\, {\rm Mpc}^{-3}$|⁠ .

The same as in Fig. B1 but for galaxy number density of |$n=0.0316 \, h^{3}\, {\rm Mpc}^{-3}$|⁠ .

For the lowest number density sample, the results contain a higher level of noise, due to the smaller sample size. The F1 and R2 performance scores are correspondingly worse than for our default |$n=0.01 \, h^{3}\, {\rm Mpc}^{-3}$| sample as well as the predicted clustering, especially on very large scales. This leads to a recovery of about 83 per cent of the GAB obtained for the central galaxies and 96 per cent recovery of the GAB for all (central and satellites) galaxies. However, as can be seen in Fig. B1, the larger uncertainties on these measurements imply a smaller level of discrepancy than a naive interpretation of these numbers. Furthermore, the SAM GAB measurements show an uncharacteristic scale-dependent behaviour on the largest scales, which the ML predictions do not recover. This apparent scale dependence is likely just noise (Xu et al. 2021) such that the agreement is probably better than it seems.

On the other hand, for the sample with the highest number density, the sample size is larger accordingly, so that the measurement uncertainties and performance scores are better. The predicted clustering and GAB are all very close to 100 per cent in this case as expected. We note that this sample includes also less massive galaxies resulting in slightly larger amount of GAB. We conclude that the accuracy of the ML predictions is fairly robust to the GAB level and not specific to the default |$n=0.01 \, h^{3}\, {\rm Mpc}^{-3}$| sample, but is somewhat sensitive to the level of noise as reflected by the size of the galaxy sample.

In this appendix, we provide the predicted OVs for the RF models in Sections 4.4 and 4.5. Fig. C1 shows the predicted OVs by the RF models when using only halo mass and δ1.25 as input. Comparing to the SAM results, the OV with δ1.25 is accurately recovered as expected, as well the OV with α0.3,1.25 to a large extent since α0.3,1.25 is correlated with δ1.25. However, the predicted OV with either concentration or a0.5 is not reproduced. For the centrals OV the trend with these properties is still there but to a much lesser degree than that of the SAM, indicated by the smaller difference of centrals occupations between upper and lower 10 per cent of the concentration and a0.5. The satellites OV with these two internal properties is entirely missing, with identical satellite occupations for the upper and lower 10 per cent of the haloes. This may seem surprising initially since for the satellite galaxies Mvir and δ1.25 are two of the top four features which are able to recover the OV and clustering well. However, this arises due to the lack of any internal halo property other than mass as input for the RF models (while in the top features the halo concentration is included). Similar results for the OV were also obtained by Xu et al. ( 2021) when using Mvir and δ1.25. Despite the failure in recovering the OV with internal properties, the ML prediction based on mass and δ1.25 still reproduces the roughly correct level of clustering as in the SAM and a large fraction (0.92) of the GAB signal, as shown in Section 4.4.

Similar to Figs 4 and 8, the predicted OV with c, a0.5, δ1.25, and α0.3,1.25, but now using only Mvir and δ1.25 as inputs for the RF algorithm for both centrals and satellites.

Fig. C2 shows the predicted OVs of the RF models using all internal halo properties and no environmental measures (Section 4.5). In this case, we are able to reproduce quite well the OV dependences on all properties. Both the centrals and satellites OV with the internal properties c and a0.5 are recovered remarkably well, essentially by construction since these properties are included in the training. The centrals OV with the environmental properties α0.3,1.25 and δ1.25 are also well recovered. However, the predicted satellites OV with these properties is smaller than that of the SAM. These results are consistent with the feature importance discussed in Section 4.2, where the environmental measures are among the top features for the satellite galaxies but not for the centrals. As a result, when excluding the environmental measures, the centrals-only clustering and GAB are fully recovered, but only 70 per cent of the GAB signal is reproduced when using both centrals and satellites. This indicates that the environment is important for reproducing the satellites OV and GAB, and cannot be replaced with the impact of the internal halo properties. We conclude that with only internal properties, the centrals GAB can be well reproduced, but the environment is important for reproducing the full GAB of all galaxies.

Similar to Fig. C1, the predicted OV now with all internal halo properties as input features.

Similar to Fig. C2, the predicted OV here using only single-epoch properties.

Finally, for completeness, we also show in Fig. C3 the OV for the single-epoch properties discussed in Section 4.6. Since the concentration and δ1.25 are included in the input features for the RF models, their OVs are well recovered. The OV with α0.3,1.25 is also well reproduced, again likely due to the correlation with δ1.25. The OV with age is accurately obtained for the satellites, but for the central galaxies it slightly deviates from the measurement in the SAM, producing a smaller OV. This may not be too surprising as a0.5 is not included in the input features, since it requires the halo merger tree to be computed. However, the concentration (which is included as a feature in this analysis) is correlated with a0.5, and as such carries with it some (incomplete) information on age as well. The resulting clustering and GAB are reasonably well reproduced.

Xiaoju Xu and Saurabh Kumar are authors contributed equally to the work and share co-first authorship.

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

100% Payment Protection

Fast Delivery,Within 7 Days

To Provide Satisfactory Service

Predicting halo occupation and galaxy assembly bias with machine learning | Monthly Notices of the Royal Astronomical Society | Oxford Academic

Featured Products

News & Blog

Close Navigation

Close Navigation

Eaton introduces pipeline strainer range at Aquatech

Covid’s return forces firms to rework plans | Mint

linkedin

linkedin

Listen

Cubic Boron Nitride Micron Powder Market Research Report – Global Forecast Till 2030 – Indian Defence News

Cubic Boron Nitride Micron Powder Market Research Report – Global Forecast Till 2030 – Indian Defence News

Webb’s Mid-Infrared Spectroscopy Will Reveal Molecules, Elements – James Webb Space Telescope