Random regression test-day model for the analysis of dairy cattle production data in South Africa : Creating the framework

E.F. Dzomba, K.A. Nephawe, A.N. Maiwashe, S.W.P. Cloete, M. Chimonyo, C.B. Banga, C.J.C. Muller and K. Dzama Department of Animal Science, University of Stellenbosch, Private Bag X1, Matieland 7602, South Africa Limpopo Department of Agriculture, 69 Biccard Street, Private Bag X9487, Polokwane 0700, South Africa Agricultural Research Council, Animal Breeding and Genetics, Private Bag X2, Irene 0062, South Africa Institute for Animal Production, Private Bag X1, Elsenburg 7607, South Africa Discipline of Animal & Poultry Science, University of KwaZulu Natal, Private Bag X01, Scottsville, Pietermaritzburg 3209, South Africa


Introduction
Genetic evaluation of dairy sires and cows has evolved immensely over the years.From the initial stages when simple dam-daughter comparisons were made, rapid advances in computer hardware and improvements in computing algorithms have made it possible to implement modern methods for analysis.Several countries are now using best linear unbiased prediction (BLUP) under animal models for national genetic evaluations based either on lactation yields or test-day yields.
In South Africa, estimates of breeding values (BV) for production traits and somatic cell scores of dairy cattle are based on test-day (TD) yields of milk, protein, fat, as well as somatic cell count.Within the National Dairy Animal Improvement Scheme of South Africa, daily yields of milk, protein and fat percentages are recorded every five weeks.These recordings are subsequently used directly in genetic evaluations using a fixed regression test-day model (Mostert et al., 2006b) instead of yields aggregated over 305-days of lactation.A test-day model (TDM) is a statistical procedure which considers all genetic and environmental effects directly on a test-day basis (Swalve, 1995).Data for test-day production of dairy cows provide an example of repeated measures or longitudinal data, the essential feature of which is the presence of correlations between tests on the same animal.It is important to explore the potential of any statistical and computing technique which allows a direct and more efficient utilization of all available test-day records for genetic evaluation of dairy cattle.
Use of the TDM approach allows a more detailed statistical model to be developed, which accounts for environmental variation specific to individual TD yields and genetic effects associated with individual animals.It offers the opportunity to directly account for short-term environmental factors specific to individual yields such as gestation period.The TDM also overcomes the need to predict 305-day yields or for projection of incomplete lactations.Furthermore, the TDM allows for precise definition of the contemporary group (CG).With the TD approach, definition of CG including test-month improves the properties of the statistical model.Solutions emanating from such CG effects can be utilized to improve herd management.
Within the TDM approach, the genetic component of the lactation curve can be modelled by fitting regression coefficients for each animal, commonly referred to as random regression (RR) coefficients (Schaeffer & Dekkers, 1994).The additive genetic solutions can be extracted from the BV estimates for the RR coefficients (Jamrozik et al., 1997).It becomes possible to genetically rank animals for each TD yield by estimating a BV of each animal for each TD yield.The estimated BV is given as a product of the RR coefficients and the days in milk (DIM) dependent covariates.Monitoring of management of individual herds and of individual cows within a herd is also an added advantage through the simple comparison between actual and expected production.
For the South African Holstein and Jersey cow populations, Mostert et al. (2004;2006a) reported genetic correlations between TD milk yields of different lactations to differ from one.This study led to the implementation of a fixed regression TD model, but recommended the use of RR functions in the genetic evaluation of South African dairy cattle.A random regression TDM approach was first implemented in Canada (Schaeffer et al., 2000) in 1999 and several countries that are members of Interbull have since adopted various forms of the methodology, including Belgium, Germany, the Netherlands, Italy, Finland, Denmark and Sweden (Interbull, 2009).Interbull is an international non-profit making organization responsible for promotion, development and standardization of genetic evaluation of dairy cattle.There are currently 27 countries, including South Africa, participating in Interbull evaluations.South Africa participates in international genetic evaluations of dairy cattle conducted by Interbull for which it has been a member since 1999.
The purpose of this review is to describe the random regression methodology in dairy cattle genetic evaluation and explore how a framework of their adoption for TD data analysis in South Africa can be built.

Methods for genetic evaluation of TD records
Interest has grown in changing the data used in genetic evaluation of dairy cattle from combined 305-day mature equivalent lactation yields to individual TD yields.The 305-day mature equivalent adjusts the current production record for a cow to what she would be producing after three years in lactation or greater as a mature cow.The current method for genetic evaluation uses several daily measurements usually taken once a month (test-day) on an individual cow over the course of the lactation.The idea of using TD measurements in genetic evaluation has been a subject for research for a long time (Searle, 1961;Meyer et al., 1989;Stanton et al., 1992;Ptak & Schaeffer, 1993;Meyer & Hill, 1997;Wiggans & Goddard, 1997).
Data from the milk recording scheme are often analyzed by regarding TD records from a cow in single or multiple trait analyses or as repeated measurements of the same trait along a lactation curve, potentially applying some correction for DIM or age at recording.Various methods have been used to analyze TD records which represent longitudinal data (Swalve, 1995;2000;Misztal et al., 2000;Schaeffer et al., 2000;Jensen, 2001).Most of these methods can be regarded as being derived from a model in which the traits have a patterned covariance matrix, but these methods vary in assumptions about the structure of the covariance matrix (White et al., 1999).
Firstly, in single trait analysis with a repeatability model, constant genetic variance over DIM and a genetic correlation of one between TD records taken at different DIM, is assumed (Ptak & Schaeffer, 1993).
Secondly, multivariate analysis treats each TD record at different DIM as a different trait (Meyer et al., 1989;Pander et al., 1992).Swalve (1995) observed that some authors arbitrarily divided the DIM range into intervals (early, mid and late lactation) that represent individual, but correlated traits and treated the measurements of these different intervals as different traits.The approach has major drawbacks that include inadequate use of information provided at test-days, hence fails to account for constraints imposed on the covariance structure.
Thirdly, lactation curves have been fitted at a phenotypic level and the parameters of the curve have subsequently been analyzed as new traits (Stanton et al., 1992).However, this approach results in failure to fully account for the systematic environmental effects (VanRaden, 1997).
Fourth, as a way of improving the current model for dairy cattle data analysis, the random regression approach has been proposed for South Africa (Mostert, 2007) and is already being applied in some countries' dairy cattle genetic evaluations (Hammami et al., 2008).

The random regression approach
The additive genetic values (estimated breeding values, EBV) of animals are usually obtained from mixed model analyses.For the trait under consideration, a linear regression of observations on indicator variables is performed.Animals' additive genetic effects are fitted as random effects.Because functions of time, such as DIM, can be readily modelled in the mixed model framework (Henderson, 1982), trajectories (e.g.lactation curve) can be described.The covariables are usually nonlinear functions such as polynomials or splines relating time to the traits e.g.milk, fat or protein yield.Fitting sets of RR coefficients for each individual random factor (e.g.additive genetic and permanent environmental effects) produces the estimates of the corresponding trajectories.This in short, describes the RR model.
For the evaluation of TD records, the RR test-day animal model is considered the most appealing statistically.It is often used to fit the RR coefficients in a linear model to obtain genetic parameters and breeding values.There are two approaches to the RR model (RRM): RR on lactation curve functions (e.g. the Wilmink's function) or RR on polynomials or splines.The number of parameters that can be fitted to describe a lactation curve is flexible with the RR where a lactation curve function is used.Jamrozik & Schaeffer (2002) found that the TDM with Legendre polynomials outperformed the TDM with a lactation curve function, considering the same number of parameters in terms of statistics on the goodness of fit.

History of random regression models in dairy cattle genetic evaluation
The general concept of using RR for analysis of covariance in an animal breeding context was suggested by Henderson (1982).Kirkpatrick & Heckman (1989) and Kirkpatrick et al. (1990;1994) introduced the infinite-dimensional model for traits measured repeatedly per individual, and suggested to model genetic covariances of trajectories through covariance functions.However, initial applications of the RRM were in genetic evaluation of dairy cows, using records from individual test-days to model the lactation curve (Schaeffer & Dekkers, 1994;Jamrozik et al., 1997).Since then, the RRM has become a standard for analyses of repeated measured records from animal breeding schemes.Other areas of animal breeding that have already utilized RRM include conformation traits (Uribe et al., 2000), body condition scores (Berry et al., 2003a), feed intake (Veerkamp & Thompson, 1999); growth in pigs (Lorenzo Bermejo, 2003), sheep (Lewis & Brotherstone, 2002) and beef cattle (Nephawe, 2004;Meyer, 2005a); and litter size in pigs (Lukovic et al., 2004).The RRM has also been used for analysis of survival data (Veerkamp et al., 2001) and for assessing genotype by environment interactions using a continuum of an environmental parameter as covariance functions in reaction norms (Strandberg et al., 2000;Calus & Veerkamp, 2003;Berry et al., 2003b;Shariati et al., 2007).

Differences between random regression and fixed regression test -day models
The fixed regression TDM in current use for dairy cattle genetic evaluation in South Africa uses an animal model with test-day records that includes Wilmink's (1987) covariables to describe the general shape of the lactation curve within fixed subclasses for age and season of calving (Mostert et al., 2006b).Contemporary groups include cows tested on the same day within a herd (herd-test date, HTD) which reduces residual variation substantially more than would herd-year-season of calving groups (Ptak & Schaeffer, 1993).Further, the model assumes a standard fixed lactation curve for all cows in the same ageseason subclass, and the estimated additive genetic effects of animals reflect differences in the height of these curves.Thus, differences in lactation persistency are ignored.Correlations between yields at different days in milk are assumed to be the same regardless of time elapsing between test-day measures.The assumption that the variances are homogenous throughout the lactation is difficult to justify.Studies on heterogeneity of variance have been conducted in South Africa.Specifically, it had been discovered that older sire proofs were much higher than for younger bulls with progeny still active in the herds (Mostert et al., 2006a).As a result, the SA fixed regression test-day model incorporates a fixed calving year effect to account for this.However, failure to pre-adjust for heterogeneous variance in test-day models often inflates genetic variances resulting in biased estimated breeding values and lowers their accuracy (Strabel et al., 2006) This is likely due to a set of nonspecified factors in the model equation (e.g.days open, pregnancy status, characteristics of the dry period, body condition at calving, etc.) that make the temporary measurement errors larger and highly variable at the beginning and at the end of the lactation (Lopez-Romero et al., 2003).The reasons for pre-adjustment for heterogeneous variance due to DIM and parity in the South African fixed regression model are twofold; firstly, it is meant to correct the bias due to residual variances being higher in the beginning and end of lactation than in mid-lactation and secondly, it corrects for first lactations having higher residual variances compared to second and third lactations (Mostert et al., 2006a).
A simplified scalar version of the fixed regression model would be: where HTD is the fixed herd TD effect, a is the random additive genetic effect of the cow, p is the random permanent environmental effect associated with each cow and e is the random residual (Swalve, 2000;Jensen, 2001).The lactation curve is modelled using the regression parameters b i , and x i are the corresponding time (days in milk) covariates.
An extension of the fixed regression TDM to a RRM would be desirable in several ways.It will allow for the inclusion of random regression coefficients for the lactation curve for each cow (Henderson, 1982).The lactation curve for an individual cow could be viewed as two sets of regressions on DIM.Fixed regressions for all cows belonging to the same subclass of age-season of calving describe the general shape for that cow, and the random regressions for a cow describe the deviations from the fixed regressions, which allow cows to have differently shaped lactation curves.
A random regressions test-day model (RR-TDM) is an extension of the TDM with fixed regressions.The basic structure of RRM is similar in most applications.The shape of the lactation curve is assumed to be influenced by random genetic and permanent environmental effects.As such, genetic and permanent environmental correlations between yields at different DIM can take values less than one.An added advantage is that the model can accommodate heterogeneous additive genetic and permanent environmental variances during lactation, the degree of which varies according to the regression functions chosen to model the trajectory of lactation.The covariates used in the regression part of TDM are usually functions of the day in lactation when the measurements were made.
In simplified scalar form, the model is: , where y is an observation on an animal belonging to a certain fixed factor grouping at a certain time, HTD the herd-test date effect is independent of the time scale for the observations, ∑b i x i is a linear or nonlinear function or functions that account for the phenotypic trajectory of the average observations across all animals (it accounts for different lactation curve shapes for groups of animals defined by years of birth, parity number, and age and season of calving within parities, for example), a j is the additive genetic effect corresponding to regression coefficient j, x j are the corresponding time covariates, and similarly for the permanent environmental effect subscripted by k, m1 and m2 denote the order of the regression function, e is a random residual effect with mean zero and with possibly different variances for each time or functions of time (Swalve, 2000;Jensen, 2001).The different subscripts indicate that the covariates in different parts of the model are not necessarily the same.When compared with the fixed regression TDM, this corresponds to using regressions to model the additive genetic and the permanent environmental effects.In principle, the covariates x i can be any covariate but are usually relatively simple functions fitted on DIM such as polynomials, orthogonal polynomials (e.g.Legendre polynomials), splines or the parameters of lactation functions proposed by Wood (1967), Ali & Schaeffer (1987) and Wilmink (1987)

Choice of basis functions
Theoretically, any function can be used in RRM as a basis function (Swalve, 2000;Meyer, 2005b).Legendre polynomials are the most common, because the correlations between parameters are lower than with other functions (Kirkpatrick et al., 1990;1994;Van der Werf, 1997).Orthogonal polynomials are able to model lactation curves for a range of covariance structures, but they also have undesirable properties (Misztal, 2006).Fit at the extremes of the trajectories may be poor especially for high orders of fit (Meyer, 2005b) and there may be problems of convergence for large data sets.Several alternatives have been proposed and these include fractional polynomials and linear and B-splines.Fractional polynomials use roots and logs and were advocated for by Robert-Granie´ et al. (2002).Splines are curves constructed from piecewise lower degree polynomials which are joined smoothly at selected points (knots).Splines are readily fitted within the mixed model analyses (Verbyla et al., 1999;Ruppert et al., 2003).White et al. (1999) used cubic splines, while Torres & Quaas (2001) used B-splines with 10 knots in separate RR analyses of test-day records of dairy cows.Too many knots would increase model complexity, while too few knots would reduce accuracy in estimates (Meyer, 2005b).It is important to compare RR models with South African data using lactation curve functions, orthogonal polynomials and splines.

Advantages of random regression models
Advantages of RR test-day models over other approaches of evaluating test-day records are now widely acknowledged (Bohmanova et al., 2008;Hammami et al., 2008): 1.This type of model provides a continuous treatment of observation over time and is able to incorporate heterogeneous variances and covariances among measures along time (including days that were not sampled) with a potentially reduced number of parameters compared with the multiple trait approach (Schaeffer & Dekkers, 1994;Lidaeur et al., 2003).
2. Every record contributes information at the value of the control variable at which it is measured.Arbitrary or inappropriate corrections for the differences in the control variable are therefore rendered useless (Van der Werf, 1997).3.With regards to estimation of variance components, random regression models facilitate parsimonious description of changing and potentially complex covariance structures, thereby utilizing the data more efficiently and generating breeding values of higher accuracies (Jamrozik & Schaeffer, 1997;Meyer, 1998).4. Because the lactation curve is allowed to differ for each cow, this facilitates accounting for the variability in persistency and makes possible the prediction of evaluations for persistency, thereby providing additional information for selection (Jamrozik et al., 1998;Swalve & Gengler, 1999;Lin & Togashi, 2005). 5.The RRM also allows a cow to be evaluated on the basis of any number of TD records during lactation.Related to this, as only eight to 10 TD yields per cow per lactation may be collected, this could result in lower costs of recording (Schaeffer et al., 2000).However, there are issues of accuracy associated with this.EBVs based on one test tend to be of low accuracy.A number of countries require a minimum of three test-day records per lactation for inclusion in genetic evaluation.6.The RRM for TD yields can account more precisely for environmental factors that could affect cows differently during lactation (Schaeffer & Dekkers, 1994).7. Due to emphasis on more yield information, a RRM results in top animals which are less related and hence results in reduced rates of inbreeding compared to lactation models (Mrode & Coffey, 2008).
While being conceptually appealing, practical applications of random regression models in animal breeding have been plagued by problems associated with large numbers of parameters to be estimated, poor polynomial approximation and therefore the necessity of analysing much larger sets of data, implausible estimates at the extremes of trajectories, and associated high computational requirements (Swalve, 2000;Jensen, 2001;Schaeffer, 2004;Meyer, 2005b;Misztal, 2006).

Partitioning variance with random regression model
The first estimates of variance components for test-day milk yields obtained by RRM were published by Jamrozik & Schaeffer (1997).The RRM were used for modelling genetic effects only.Meyer & Hill (1997) and Meyer (1998) demonstrated the use of covariance functions to model additive genetic and permanent environmental effects in random regression TDMs.The covariance function describes the covariance structure of an infinite-dimension character, such as test-day milk yields, as a function of time.The covariance function is equivalent to a RRM if the same functions are used (Meyer & Hill, 1997;Van der Werf et al., 1998).The equivalence of the RRM with the covariance function is useful when analyzing data observed at many time periods, because the number of regression coefficients determines the number of covariances to be estimated for each source of variation in a RRM.In a univariate RRM, k regression coefficients result in k(k+1)/2 covariance estimates.The covariance function is used to reduce the rank of the covariance matrix from n, the number of traits, to k, the number of functions, when starting from a multiple trait approach (Meyer & Fitzpatrick, 2005).
Standard mixed-model-based variance component procedures (i.e.Restricted maximum Likelihood: REML or Bayesian methods based on Markov chain Monte Carlo methodology: MCMC) can be used to estimate covariance functions directly from the data (Jensen, 2001).High computational demands limit the size of the datasets and the nature of the models that can be analyzed using REML, but algorithms for multivariate analyses via AIREML are readily adapted to the estimation of covariances among random regression coefficients (Meyer & Kirkpatrick, 2005).Sorensen & Gianola (2002) noted that Bayesian estimation is now standard for quantitative genetic analyses.Particularly popular are schemes that sample from fully conditional posterior distributions of the parameters of interest.These are computationally easy to implement.Jamrozik (2004) discussed implementation issues of Markov chain Monte Carlo methods for random regression analyses.

Modelling environmental effects in the random regression model
Milk production is influenced by exactly the same environmental factors whether a TDM or lactation model is used in genetic evaluation.However, for a TDM, the stage of lactation is an important consideration, because of the curvilinear relationship that exists between the stage of lactation and milk production (Swalve, 1995;2000).The TDMs often use types of covariates or mathematical functions, in a regression, to account for stage of lactation.Meyer (2005a) and Meyer & Kirkpatrick (2005) noted that the resultant lactation curve parameters can be considered as examples of 'function-valued traits' implying that mathematical functions are in use.
The adoption of TDM over the lactation model replaced the use of herd-year-season (HYS) with herdtest-date (HTD).The HTD accounts for the effects of herd and the year and the season of production whereas HYS effect is commonly used to account for the effects of the individual herd, the year, and the season of calving and the interactions among them.With a TDM, further effects that can be fitted in the analysis include age at calving, parity and pregnancy (Swalve, 1995).
The random regression TDM can account for many environmental factors that could affect cows differently during the lactation (Schaeffer & Deckers, 1994).The lactation curve is split into two parts: a fixed part (average lactation curve) and a random animal specific part (deviation from the average curve).To account for the variability within lactation stage, an appropriate sub-model is fitted on stage of lactation, nested within parts of the model that account for environmental effects.There are profound differences in the manner in which environmental variation is accounted for with RRM in respect to definition of subgroups for fixed regression on the stage of lactation (Zavadilova et al., 2005).Frequently used factors are season of calving and/or classes of age at calving (Reents et al., 1998;Strabel & Misztal, 1999;Lidauer et al., 2000;Schaeffer et al., 2000).Other models used include the effects of days carried calf (Lidauer et al., 2000).For South Africa, it is important to investigate how best the information collected when testing herds can be used in genetic analysis to account for the environmental variation.Mostert et al. (2006b) defined a fixed regression TD-model which passed the necessary trend validation tests required by Interbull to ensure that the model sufficiently accounts for all environmental effects.Such studies can also attempt to recommend inclusion of valuable variables that the current milk recording system ignores or encourage inclusion of some traits such as fertility measures in the routine genetic evaluations.The SA Dairy Animal Improvement Scheme records artificial insemination information.Unfortunately, the participants of the Scheme are still reluctant to participate.

Persistency of lactation
Dairy breeders focus on modelling the individual genetic curves of the cows and estimating genetic parameters of the lactation curves to select for lactation yields or persistency (Shanks et al., 1981;Danell, 1982;Ferris et al., 1985;Gengler, 1996;Jamrozik & Schaeffer, 1997).Although the definition of persistency varies, generally it refers to the rate of decline in production after peak milk yield production has been reached (Swalve & Gengler, 1999).High persistency is associated with a slow rate of decline in production whereas low persistency is associated with a rapid rate of decline.Persistent cows are more desirable because they are more efficient in roughage usage, suffer less metabolic stress due to high peak yield and are thus more disease-resistant (Solkner & Fuchs, 1987).Genetic modification of the lactation curves are concerned with the artificial redistribution of total lactation responses among different stages of the lactation (Lin & Togashi, 2005).In a recent study, Mostert et al. (2008) laid out the framework for inclusion of persistency of lactation in genetic evaluation of South African dairy cattle based on the Canadian Persistency Index.As a result, persistency of production has been implemented in routine genetic evaluations thereby highlighting the economic importance of persistency.
In describing the persistency of milk production during lactation, the choice of a parameter that gives a correct description of the shape of a lactation curve is important.It is therefore important to develop an evaluation method in which genetic differences in persistency can be evaluated on a routine basis.
A key issue in genetic evaluation of persistency is trait definition.Gengler (1995;1996) identified three types of measures of persistency which are: measures based on ratios of yields, measures based on variation of yields and measures developed out of functions that describe lactation yields.There is, however, no clear consensus on how best to mathematically model persistency.The procedure most widely used to measure lactation persistency nowadays is based on the by-product of the random regression test day model.Druet et al. (2005) showed that the first and second eigenvectors of the estimated genetic covariance matrix in a random regression model may serve as proxies for yield and persistency.Use of these eigenvectors in random regression test-day models is computationally advantageous but there is still no clear biological interpretation of the eigenvectors.

Conclusion
Attempts to improve the accuracy of estimated breeding values, reduce the generation interval and boost response to selection for dairy cattle and the quest to provide more comprehensive management information to dairy farmers are stimulating interest in advancing the conceptual framework of the TDM.The RRM approach probably wields the potential to realize these benefits from the South African dairy cattle genetic evaluation programme.Replacing the current TDM with a RRM requires research to demonstrate the benefits.Currently research should be focused on defining the RRM to be implemented, investigating the environmental effects to be included in the model and estimating the covariance structure among observations and genetic parameters for traits to be included in the breeding programme for dairy cattle in South Africa.These are the requisite steps towards adoption of a RRM framework for analysis of dairy TD records.

Table 1
.The results of the genetic evaluations for the South African dairy herd have had a fixed regression model defined by the following parameter sizes shown in Table1(Personal communication: B. Mostert, 2010, ARC, Private Bag X02, Irene 0062, South Africa).Using a random regression model would probably increase the number of dairy cattle evaluated thereby improving the accuracy of estimating their proofs.Size of parameters included in the fixed regression model used in genetic evaluation of South African Holstein cattle from 2007 to 2010 The South African Journal of Animal Science is available online at http://www.sasas.co.za/sajas.asp277