Gene-set enrichment analysis of selective sweeps reveals phenotypic traits in Nguni cattle

Adaptation of animals to different environments is typically associated with structural and functional genomic variations. High throughput SNP genotyping and next-generation sequencing (NGS) have made it possible to study positive selection footprints and adaptation traits. Nguni is a small frame-size breed, mostly horned, and well known for being adapted to diverse South African environmental conditions. This study used previously identified selective sweeps to perform functional analysis of genes related to phenotypic characteristics in Nguni. Two hundred and sixty-four candidate selective sweeps were used for gene-set enrichment analysis in molecular functional categories (KEGG pathways) using the database for annotation, visualization, and integrated discovery (DAVID). In total, 107 genes were identified across all the chromosomes with 74 genes associated with eight phenotype queries, including fat content, milk production, walking ability, heat tolerance, meat production, reproduction, and bone and muscle development. Gene CRHR 2 was associated with meat quality (juiciness and flavour). The IRAK3 gene was associated with decreased body size, feed intake and fatness in cattle , and CARD15 with disease resistance. Gene annotation using phenotype queries identified four genes ( SPI , YWHAZ , RGS4, and RGS5 ) that were associated with myometrial relaxation in cattle. Genes such as NOD2 and IL21R were associated with inflammatory bowel diseases in cattle, whereas CPLS gene was associated with fat content. These genes are important to the phenotypic and adaptive characteristics present in South African Nguni cattle and hold potential for selection for traits of economic importance. ______________________________________________________________________________________ genes were examined for enrichment in molecular functional categories using KEGG with selected phenotype queries such as fat content, milk production, walking ability, heat tolerance, meat production, reproduction, and bone and muscle development. The queries were chosen as desirable traits of adaptation in Nguni cattle. The gene list obtained DAVID.


Introduction
South Africa is the home of five indigenous cattle breeds, namely Afrikaner, Bonsmara, Drakensberger, Nguni, and Tuli. These breeds have played an important role in the history of the country as well as socio-cultural roles in many African societies (Ramsey et al., 2000;Mwai et al., 2015;Mapiye et al., 2019). They survive under harsh local conditions and have lower susceptibility to disease and parasites. In a world threatened by climate change, indigenous/local breeds are sources of irreplaceable genetic materials. They can survive unfavourable conditions posed by drought, extreme heat, and tropical diseases (Hoffmann, 2013). For example, Nguni is well known for its adaptive features, hardiness, and excellent resistance to internal and external parasites with natural immunity to tick-borne diseases (Marufu et al., 2013;Mapholi et al., 2015). It is a small-framed breed compared with most of the other large-framed beef breeds in South Africa, mostly horned, with a variety of vibrant red and black colour patterns, which make the breed unique and easily distinguishable from other breeds (Bester et al., 2003, Tada et al., 2013. Furthermore, Nguni is a multi-purpose breed used for meat, milk, leather, and socio-cultural functions such as dowries, communal feasts, and religious sacrifices (Bettencourt et al., 2015).
Natural and artificial selection played a major role in the phenotypic diversity of domesticated animals, which resulted in the formation of diverse livestock breeds adapted to a wide variety of environments and with special characteristics. The phenotypic variation in these animals has aided mapping of causal genes and the mutations underlying a variety of traits (Casas & Kehrli, 2016;Müller et al., 2017;Jivanji et al., 2019). The genotype-phenotype map can increase the understanding of how changes in the genome can bring about alterations in both quantitative and discrete traits (Wright, 2015). Phenotypes that have been linked to domestication include meat and milk production, fertility, coat colour, decreased fearfulness, social motivation, and mild temperament (Qanbari et al., 2014). Selection that affected these phenotypes has left detectable signatures/sweeps within the genome of modern cattle (Bovine HapMap Consortium, 2009;Qanbari et al., 2014). These sweeps are beneficial mutations that arise and rapidly increase in frequency in the population, and can be detected as reduced local variability, deviations in the site frequency spectrum (SFS), increased linkage disequilibrium and extended haplotype structure (Stella et al., 2010;Carneiro et al., 2011;Qanbari et al., 2014). As a result, these sweeps can be used to scan for genes involved in recent adaptation.
Identification of selection sweeps is currently one of the areas of interest in evolutionary genetics since it can provide insights into the evolutionary processes involved in shaping the genomes of animals as well as functional information about genes and genomic regions (Gouveia et al., 2014). Recent genomic technologies have made it possible to study the genomes of large animals such as cattle to detect selection footprints and understand the genetic architecture of local adaptation, phenotypic variability, and gene function (Ramey et al., 2013;Huber et al., 2014;Naval-Sanchez, 2020). Although techniques for variant detection are now becoming routine, the key question remains for the functions of detected variants, genes, gene products, as well as their interactions and regulation in the genome (Gasperskaja & Kueinskas, 2017).
Growing evidence suggests that livestock breeding could benefit from a deep understanding of the regulatory mechanisms between the genotype and phenotype relationships. The location of these variants can result in phenotype differences, including the level of DNA methylation, transcription factor binding sites, alternative splicing, and protein translation and modification, since the function of variants depends on their location (Zhao et al., 2020). More novel functional variants are being confirmed with bioinformatics analysis, but most of them lack underlying molecular mechanisms (Crouch & Bodmer, 2020). Gene ontology captures statements of how genes function, where in the cell they function, and what biological processes dictate the outcome (Thomas, 2017). Gene-set enrichment and pathway-based analyses have been used to investigate the polygenic background of complex traits, such as leucosis, bull fertility, and meat quality (Abdalla et al., 2016). These enrichment analyses direct the focus from a single gene to a group-based analysis. Findings of these analyses can contribute to an improved understanding of the genetic and biological architecture of complex traits (Visscher et al., 2017).
Few studies have identified genomic regions and variants in selective sweeps and copy number variant (CNV) regions in South African indigenous cattle using bovine SNP array (Makina et al., 2015;Wang et al., 2015) and whole genome sequencing (Zwane et al., 2019). These studies provide the foundation for understanding genetic variation between the indigenous breeds and identifying genes of economic importance. The occurrence of selection has changed patterns of variation such that each form of selection causes specific changes in selected loci and in neutral loci linked to them (Kreitman, 2000;Gouveia et al., 2014). Locating these changes may enhance our understanding of the variations associated with adaptation in local breeds. The aim of this study was to perform the functional enrichment analysis to identify genes pertaining phenotypic variation in South African Nguni cattle using selective sweeps identified by Zwane et al. (2019).

Materials and Methods
Selective sweeps for Nguni cattle analysed in this study were identified in previous research (Zwane et al., 2019). In the study, 30 animals from Nguni were sequenced at 30x coverage to discover variants using Illumina HiSeq 2500 (Animal ethics approval EC: S4285-15 of University of Pretoria). Selective sweep regions were identified using the Z-transformed pooled heterozygosity (ZHp) scores computed from a 50% overlapping sliding window approach with 150 kb windows of SNP distribution. A total of 264 candidate selective sweep regions (ZHp Z-scores ≤ −4) were confirmed and used for gene enrichment analysis. Animal quantitative trait loci database (QTLdb) was used to retrieve and visualize the QTL information within the selective sweep regions as it provides tools for aligning various genome features to QTLs and enables comparison of QTL results within species and across experiments (Hu et al., 2013).
A database for annotation, visualisation, and integrated discovery (DAVID) v6.8 tool (Huang et al., 2008;Maiorano et al., 2018) was used to identify significant (P < 0.05) gene ontology (GO) terms and KEGG (Kyoto encyclopedia of genes and genomes) pathways using a gene list with significant SNPs based on ZHp Z-scores. The GO database defines biological descriptors to genes based on the features of their encoded products, and is partitioned into three components, that is, biological process, molecular function, and cellular component. The KEGG pathway database contains metabolic and regulatory pathways, representing the actual knowledge on molecular interactions and reaction networks (Azizpour et al., 2020). The gene ID list was converted into gene names and symbols using DAVID for annotation.
The genes were examined for enrichment in molecular functional categories using KEGG pathways (a module in DAVID) with selected phenotype queries such as fat content, milk production, walking ability, heat tolerance, meat production, reproduction, and bone and muscle development. The queries were chosen as desirable traits of adaptation in Nguni cattle. The gene list with enriched functional annotations and their statistical P-values was obtained from DAVID. The P-values were calculated and adjusted to increase the statistical significance (from the default 0.025 to 0.05) using Benjamini-Hochberg method. A P-value of 0.05 was used as a criterion cutoff, whereby gene sets with P-values less than 0.05 were considered significantly enriched (Nguyen et al., 2019;Jafari & Ansari-Pour, 2019). All genes were associated to one or multiple process, function, and component annotations (Berardini et al., 2012). Nearly all of the genes were assigned to a GO term. The fold enrichment was used to select the top-listed gene GO terms associated with identified genes, with significance of P <0.05. The false discovery rate (FDR) was used to determine significance of enrichment or overrepresentation of terms for each annotation (Lewin & Grieve, 2006).

Results and discussion
Bovine functional genomics tools have enhanced the discovery of genes underlying important phenotypic traits and their implementation in livestock selection programmes. This study used previously identified selective sweep regions to perform enrichment analysis to identify genes associated with phenotypes in South African Nguni cattle. Nguni possess desirable phenotypic and adaptive characteristics such as tolerance to heat, drought, diseases, and parasites (Mapholi et al., 2017;Mapiye et al., 2019), ability to reproduce under low input systems (Scholtz & Theunissen, 2010), grazing ability, and good milk and beef quality (Marufu et al., 2011). These characteristics require understanding of the genes involved, which may be important in sustainable production of Nguni and other local breeds in changing environments in view of climate change. Figure 1A indicates the 264 selective sweep regions used in this study, and the heterozygosity levels in all chromosomes are displayed in Figure 1B. The regions with the most significant selective sweeps were identified in chromosome 8 and 19. Genomic regions subjected to high selective pressures can display signatures such as reduced nucleotide diversity, stretches of homozygous loci, shifted site frequency spectrum and reduced recombination rate, and a number of selective sweeps can be identified in these regions (Aslam et al., 2014;Purfield et al., 2017). The lowest heterozygosity in the selective sweep regions was observed in chromosome 8, 19, and 27 ( Figure 1B). Heterozygosity shows the genetic variation in a population. However, studies indicated that a hard/soft sweep would leave a footprint of reduced heterozygosity in the area of the genome subjected to selection (Oleksyk et al., 2008;Burke, 2012;Purfield et al., 2017). Therefore, identification of selective sweeps in regions of low/reduced heterozygosity could aid the discovery of novel genes subjected to selection.
In this study, 13 selective sweeps with low heterozygosity were identified in chromosomes 2, 4, 5, 8, 17, 19, and 27 ( Figure 1B). These regions had the most significant selective sweeps (ZHp Z-scores ≤ -7), showing the presence of possible strong/hard sweeps. Supplementary Table 1 shows the most significant selective sweep regions (ZHp Z-scores ≤ -6) identified in Nguni. Forty-seven (47) putative selective sweeps were identified. However, some regions were unknown with little or no QTL information. Druet et al. (2013) used a hidden Markov model to identify selective sweeps with long stretches of reduced heterozygosity in cattle. They identified genes associated with coat colour and horn development in these regions. Analysis of pooled genome sequences from Djallonke and Sahelian sheep in Ghana revealed co-localization of regions of reduced heterozygosity with candidate genes for disease resistance and adaptation to a tropical environment (Yaro et al., 2019). Strong selective pressure can lead to an increase in the frequencies of favourable alleles, leading to a decreased genetic variation in the genes under selection. Owing to the physical association of the locus with nearby loci (genetic linkage), genetic variants located in the loci may also suffer a decrease in genetic variability (Aramburu et al., 2020). Soft selective sweeps (where more than one allele is targeted by selection) are much harder to detect owing to their resistance to heterozygosity decrease (Sun et al., 2014;Aramburu et al., 2020). Strong selective sweeps are easier to detect since they are determined by the increase in the frequency of a specific allele (in some cases leading to complete fixation). These are therefore useful for detection of candidate genes responsible for important traits in response to selection (Sun et al., 2014;Aramburu et al., 2020). Gene annotation of 264 selective sweeps identified 107 genes located in 29 chromosomes. These genes were previously associated with various functions in cattle, human and mice. Most of the gene associations were conducted in model animals such as mice, humans, and zebra fish rather than in cattle. However, the gene functions are mostly the same because of DNA similarities between humans and cattle (Zimin et al., 2009, Costilla et al., 2020. Previous studies discovered more than 200 homologous blocks of DNA between the two species (human and cattle), which could also enable the mapping of human to cattle sequences (Everts-van der Wind et al., 2005;Zimin et al., 2009;Larkin, 2011). Table 1 shows the top 10 significant genes (P <0.05) identified using DAVID. Genes such as corticotropin-releasing hormone receptor 2 (CRHR 2 ), nucleotide binding oligomerization domain containing 2 (NOD 2 ), tyrosine 3-monooxygenase/tryptophan 5-monooxygenase activation protein zeta (YWHAZ), interleukin 1 receptor associated kinase 3 (IRAK3) and interleukin 21 receptor (IL21R) were high-ranking genes identified in this study. These are protein-coding genes most viable in human, mouse, cattle, and dogs (Luisi et al., 2015). The CRHR 2 gene in chromosome 4 was associated with insulin sensitivity in this analysis. This gene has been associated with juiciness and flavour in cattle (Jiang et al., 2009). Together with CRHR 1 , the CRHR 2 gene is expressed in the bovine adrenal gland and plays a role in the thyroid physiological function through autocrine/paracrine mechanisms (Squillacioti et al., 2012). The presence of this gene in Nguni cattle suggest the same regulatory pattern based on urocortin-like immunoreactivity (UCN-IR) and CRHR 2 -IR that is found in the thyroid follicular and parafollicular cells (Squillacioti et al., 2012). In pigs, CRHR 2 was found in skeletal smooth and cardiac muscles, and in the brain (Choi et al., 2017). This gene has been associated with meat quality on Duroc x Pietrain populations (Casiro et al., 2017).
The NOD 2 gene , also known as caspase recruitment domain-containing protein 15 (CARD15), is a pattern recognition receptor for bacterial lipopolysaccharides, which release inflammatory mediators that recognize the muramyl dipeptide and activate NF-kb, which controls transcription of tumour necrosis factorα, interferon-γ, interleukin-1β and interleukin-12 (Wang et al., 2015). The CARD15 protein contributes to immune response of cells and plays an important role in antibody production.
CARD15 gene variations are associated with diseases such as bovine tuberculosis, Crohn's disease (inflammatory bowel disease), and paratuberculosis (Pinedo et al., 2009;Küpper et al., 2014;Wang et al., 2015). The gene's relationship to diseases and its conservation between species suggest that it may have a conserved role in bovine disease resistance. Comparative analysis of the CARD15 gene between the bovine, mouse and human revealed a high level of conservation of this gene in sequence, genomic structure and protein domains between species (Taylor et al., 2006).
The CARD15 gene polymorphisms were associated with susceptibility to tuberculosis in Chinese Holstein cattle and haplotype markers such as TGGACA and CAGACA were identified (Wang et al., 2015). This showed that genetic markers for this gene can be identified in local breeds and used in breeding animals with high resistance to bovine tuberculosis, as was demonstrated in the Chinese Holstein cattle (Wang et al., 2015).
Parasitic diseases are major challenges to cattle production in South Africa (Katikati & Fourie, 2019). With no effective vaccination programmes available against most parasitic diseases, significant morbidity and mortality in the developing countries remain limitations to effective sustainable prediction. Based on human studies, the IL21R gene is highly expressed in parasitized organs of infected individuals and in murine models of parasitic diseases (Solaymani-Mohammadi et al., 2019). Nguni cattle are known to be less susceptible to parasite diseases that affect livestock in South Africa, such as ticks and tick-borne diseases (Mapholi et al., 2017). The presence of IL21R gene in Nguni could assist in the enhancement of their innate immunity. IL21R is expressed at high levels on intestinal epithelial cells and stomal fibroblasts in inflammatory bowel diseases. IL21 induces macrophage inflammatory protein-3 alpha (MIP-3alpha), a T-cell chemoattractant in epithelial cells, and has been signalling between epithelial and immune cells in the gut (Caruso et al., 2007).
The IRAK3 and specific protein 1 (SP1) genes have been associated with decreased body size in this analysis, as shown in Table 1. IRAK3 is a cytoplasmic homeostatic mediator of inflammatory responses, which is potentially useful as a prognostic marker in inflammation (Freihat et al., 2019). It inhibits signalling cascades downstream of myddosome complexes associated with toll-like receptors and contains a death domain that interacts with other IRAK family members involved with tumour necrosis factor receptor associated factor 6 (TRAF6) (Freihat et al., 2019). The IRAK3 was identified within large selective sweep spanning 430 kb on Bos taurus chromosome 5 (BTA5) (Naval-Sanchez et al., 2020), the largest region under selection, which also displayed the largest difference between indicine and taurine cattle. The SP1 transcriptional factor has a putative binding site in cattle and has shown numerous polymorphisms in the promoter region of the bovine leptin gene. Polymorphisms in SP1 gene have been associated with production traits such as feed intake and fatness in cattle (Adamowicz et al., 2006). The presence of these genes in Nguni could be associated with their thin covering of fat compared with other South African beef breeds (Chulayo & Muchenje, 2016). Nguni can fatten to high body mass on natural grazing, producing good quality carcasses, an even distribution of fat, and excellent marbling (Maciel et al., 2013;Katikati, 2017). The regulator of G-protein signalling 5 (RGS5) gene was associated with decreased body fat, which is an observed characteristic in Nguni cattle (Tada et al., 2013). These genes have been identified in various studies in cattle and have biological functions related to milk production, walking ability, bone and muscle development, heat tolerance, reproduction, and fat (Zimin et al., 2009;Underwood et al., 2015;Weldenegodguad et al., 2019;Leal-Gutiérrez et al., 2019;Rebl et al., 2019;Kopke et al., 2020).
Enrichment analysis of genes was performed using phenotype queries such as milk production, walking ability, bone and muscle development, heat tolerance, reproduction and fat. Figure 2 shows the top 10 genes (P < 0.05) that were associated with the phenotype queries as identified in Nguni. Genes were mostly associated with walking ability, fat digestion, bone and muscle development, reproduction and heat tolerance. Some of these genes were identified as highly significant (P >0.025) in Table 1, showing their significant molecular functions. These genes were previously associated with phenotypes in Bos taurus (Weldenegodguad et al., 2019) and species such as mice, chicken and other cattle breeds. Genes such as NOD2 and IL21R were associated with inflammatory bowel disease (IBD), which has been reported to be common in all ruminants, and pasteurized milk could be the potential source of IBD in humans and in dairy cows. It is characterized by numerous acid-fast Mycobacterium avium paratuberculosis (MAP) (McNees et al., 2016), a real threat to both cattle and human beings, and can survive in contaminated pastures for a year, remain viable after milk pasteurization, and resists the ripening process of cheese (Fawzy et al., 2013).
The colipase (CLPS) gene identified in this study was associated with fat digestion in Nguni cattle. It consists of enterostatin, which has a biological activity as a satiety signal, and is known to suppress intake of a high fat diet in rat (Lin et al., 2007). In pigs, the CLPS gene was located in chromosome 7 and considered a candidate gene for QTL, which affects carcass fatness (Jankowiak et al., 2008). Figure 2A illustrated that four genes were associated with myometrial relaxation and contraction pathways in Nguni cattle, namely SP1, YWHAZ, RGS4, and RGS5. These genes were associated with fat digestion, bone and muscle development, reproduction, and walking ability in this study, which confirmed earlier results. This was in line with the observable phenotypes in Nguni cattle such as walking ability (Matjuda, 2012). The RGS5 gene was previously associated with ruminal vascularity and lean growth (Kern et al., 2017). In humans, this gene is transcribed in myometrial muscle associated with contractile activity and muscle relaxation during pregnancy, and acts on the myometrium to regulate muscle contraction. This was in line with the studies by Gunaje et al. (2011) andDaniel et al. (2016), which showed that the regulators of G protein signalling genes (RGS4 and RGS4) were critical mediators of vascular smooth muscle contraction and potentially arterial remodelling. The RGS5 gene has also been associated with heat tolerance in cattle (Stronen et al., 2019). More information on gene molecular functions and biological processes is given in Supplementary Table 2.
In Table 2 the top GO terms (P <0.05) with molecular functions in Nguni cattle are summarized. The GO enrichment analysis revealed three molecular functions associated with kinase activity (GO:0004674~protein serine/threonine kinase activity, GO:0004672~protein kinase activity, and GO:0016301~kinase activity). The protein kinase activity has been reported to be negatively associated with intramuscular fat content in longissimus dorsi muscle of beef cattle (Underwood et al., 2008). This could explain the lacking of marbling in meat produced by Nguni cattle (Chulayo et al., 2016). Two GO molecular processes related to activatory and regulatory activity (GO:0005096~GTPase activator activity; GO:0008047~enzyme activator activity; GO:0030695~GTPase regulator activity; GO:0060589~nucleosidetriphosphatase regulator activity, respectively) were also identified.

Figure 2
Top 10 genes that were highly associated with cattle phenotypes in Nguni based on their enrichment scores The GTPase activator activity (GO:0005096) is a family of regulatory proteins that can bind to activate G-proteins and stimulate their essential GTPase activity. The G-protein is involved in important cellular processes and physiological functions such as senses of light, taste and smell, neurotransmission, metabolism, endocrine and exocrine secretion (Syrovatkina et al., 2016;Wang, 2018). The phosphotransferase activity orthology related term GO:0016773 (alcohol groupacceptor) was also enriched. This term has been enriched significantly for bovine digital dermatitism, which is a foot disease related to several pains and discomfort and other problems that are prevalent in cattle kept indoors and to lameness in dairy cows (Kopke et al., 2020, Salano et al., 2016. Table 2 also shows that many of the significantly enriched terms were related to gene kinase and protein kinase activity and transcription. The GO terms were significantly enriched for different genome functions in Nguni cattle. Among the highest ranking molecular functions was kinase activity, in which genes such as IRAK3, PRKAA1, YWHAZ and RGS5 were identified. Here the importance of IRAK3 gene in immune response to infections is emphasized (Rebl et al., 2019), which could explain Nguni cattle being mostly adapted and resitant to a wide range of diseases (Mapholi et al., 2017), and PRKAA1 gene as a tumor supressor in mice (Vara-Ciruelos et al., 2019). McKay et al. (2018) reported conserved DNA mythylation levels and patterns in PRKAA1 and PRKAB1 genes (also known as 5' AMP-activated protein kinase (AMPK) genes) at various methylation sites in cattle breeds including bison, and highlighted the need to maintain the functioning of these genes as vital for biological functions. YWHAZ has been identified as the most stable gene in bovine neutrophils (Crookenden et al., 2017) and RGS5 has been associated with feed efficiency in cattle (Kern et al., 2017). These genes are most significant in South African Nguni cattle.

Conclusion
Gene-set enrichment and functional annotation were used successfully to associate selective sweeps with phenotypic variation in Nguni cattle. Genes identified in this study were highly associated with molecular functions that may underlie phenotypic and adaptive characteristics in Nguni cattle. These include genes that were associated with body size, feed intake, meat quality, heat tolerance, and disease resistance. These results provide the basis for understanding genetic mechanisms involved in economic traits and adaptation to local conditions. Further studies are needed on signalling pathways and expression of these genes. Similar studies are needed for other indigenous breeds as efforts to characterize them genetically for improvement and enhanced food production in the era of climate change.