SEQUENCE ANALYSIS OF MATURASE K (MATK): A CHLOROPLAST-ENCODING GENE IN SOME SELECTED PULSES

The application and utilization of sequence data has been found very informative in the characterization and phylogenetic relationship of different crops species. This study aimed to use bioinformatics tools to characterize the matK gene in some selected legumes with special reference to pigeon pea [cajanus cajan (L.)Millsp] matK sequence as a quarry sequence. Nucleotide and amino acid sequence of matK gene of 10 legumes were retrieved from NCBI database and analysed for homology, physiochemical properties, motifs, GC content as well as phylogenetic relationships. Results showed that the nucleotide and amino acid sequence lengths of this gene among the selected legumes differs. Its nucleotide length varied between 631-1580bp, while the amino acids sequence varied between 21 and 509 residues. P. tetragonolobus matK and C. cajan matK sequences had percentage identity of 88% while V. sativa had the lowest percentage identity of 70%. G.tomentella and P. tetragonolobus matK sequence shared the same percentage similarity of 91% with C.cajan while V. sativa had the least (78%) with C.cajan. The motif predicted were tyrosine kinase phosphorylation site, N-myristoylation site, N-glycosylation site, protein kinase phosphorylation site, casein kinase II phosphorylation site and cAMPand –cGMP dependent protein kinase phosphorylation site. However, microbodies C-terminal targeting site was only predicted in the amino acid sequence of matK gene of P. sativum and C.cajan. Phylogenetically, two major clades were revealed with P.sativum, V.sativa, and C. arientinum matK gene sequence in clade A and matK gene sequence of P.tetragonolobus, C. cajan, G. tomentella, P.vulgaris, V.unguiculata, V. angularis and V. radiate in clade B. It showed that clade A diverged from the ancestry legume approximately 39MYA while legume sequences in clade B diverged from the ancestor about 57MYA. GC content of the nucleotide sequence of matK gene of V. sativa was highest (31.37%) with the range in the selected legume varying between 7.29%-31.37%. The secondary structure of amino acids sequence of matK gene in the selected legume revealed the alpha helix (34.14%-41.27%), extended strand (11.56%-20.99%) and random coil (39.48%51.76%). The major domain architecture found in the amino acid sequence were single and double types. Implicitly, though maturase K gene sequences in the selected legumes differ in lengths physiochemical properties, GC content and motif. The result of this study revealed that C.cajan matK gene sequences is closely related to that of P. tetragonolobus but distant to V. unguiculata as well as P. vulgaris.


INTRODUCTION
The recent upsurge in the application and utilization of molecular/sequence data to systematic and evolutionary queries has led to significant contributions to effective classification of both plants and animals. Presently, many chloroplast, mitochondrial and nuclear genes have been utilized for studying and understanding sequence variations and evolutionary trends at the genus level (Clark et al., 1995;Hsiao et al., 1999). Before now, among the genes, sequences for the rbcl gene was frequently used and analysed by researchers in the bid to understanding plant systematics beyond the family level (Donoghue et al., 1992;Chase et al., 1993;Duval et al., 1993). However, maturase K (matk) gene, formally known as orfk has emerged as a gene of interest with potential in plant molecular systematics and evolution because of the genes' rapid evolution at nucleotide and corresponding amino acid levels (Johnson and Soltis, 1995;Liang and Hilu, 1996;Miller et al., 2006).
Due to the rapid rates of substitution, rare presence of frame shift indels as well as few cases of premature stop codons, it has been opined that matk may not be functional in some plants (Kores et al., 2000;Whitten et al., 2000;Kugita et al., 2003;Hidulgo et al., 2004;Jankowiak et al., 2004). It has however been observed that the RNA transcripts of trnK, trnA, trnl, rpsl2, rpl2 and atpF require MATK for intron excision (Jenkins et al., 1997;Vogel et al., 1999). The tRNA or protein products from these genes are required for normal chloroplast function including photosynthesis, implying that MATK has an essential function in the chloroplast, importantly as a post-transcriptional splicing factor (Michelle et al., 2007).
According to , phylogenetic analysis of a data set composed of matK, rbcL and trnT-F sequences from basal angiosperms demonstrated that matK contributes more parsimony informative character and significantly more phylogenetic structure on average per parsimony informative site than the highly conserved chloroplast gene rbcL. The chloroplast matK gene has two important unique features that underscore its usefulness in plant's molecular systematics and evolution including its fast evolutionary rate. According to Johnson and Soltis (1994), Olmstead and Palmer (1994), the rate of nucleotide susbstitution in matK is three times higher than that of the large subunit of Rubisco (rbcL) and six fold higher than the amino acid substitution rate, which significantly presents it as a fast evolving gene. This capacity of matK gene also provides high phylogenetic signal for resolving evolutionary relationships and relatedness among plants at all taxonomic levels (Soltis and Soltis, 1998;. Maturase K is a chloroplast-encoding gene nested between the 5'and 3' exons of trnK, tRNA-Lysine (Sugita et al., 1985) in the large single copy region of the chloroplast genome (Steane, 2005;Daniell et al., 2006;Turnmel et al., 2006). For emphasis, maturases are enzymes that catalyse non-autocatalytic intron removal from premature RNAs. The importance attached to the leguminous family cannot be overemphasized, especially in mitigating protein deficiency in the rural population, which is more than 60% of the entire population in most sub-Saharan African countries, including Nigeria. This study aimed to use bioinformatics tools to characterize the matK gene in some selected legumes.

Retrieval of nucleotides and amino acid sequences
The nucleotide and amino acid sequences of maturase K (matK) of C. arientinum (Chick pea), V. unguiculata (Cowpea), C. cajan (Pigeon pea), P. vulgaris (Common bean), Pisium sativum (Garden pea) and Psophocarpus tetragonolobus (winged bean), Vigna sativa (Tare vetch),Vigna radiata (mung bean) and Glycine tomentella were downloaded from the gene bank by obtaining the FASTA format from the National Centre for Biotechnology Information (NCBI, USA) database. The accession numbers of the sequences retrieved for the various legumes were noted along with the gene names, sequence length as well as the crop names were retrieved using the FASTA format option. Pair wise and multiple sequence alignments were carried out to align all retrieved sequences using MEGA 6 software as modified by Thompson et al. (2014).

Determination of percent identity and similarity (homology)
Percentage identity and similarity among the nucleotide and amino acid sequences of the retrieved maturase K (MatK) genes of the selected pulses were determined using similarity homology comparison tool for more than two sequences option of the basic alignment search tool. The nucleotide and amino acid sequence of C. cajan was used as the query sequence. This is premised on the fact that C. cajan is reported to be abiotic stress tolerant and high adaptability.

Determination of physico-chemical properties of amino acid sequences of matk gene of selected pulses
The physico-chemical properties of the MatK protein genes of the 10 leguminous species were determined using the Expert Protein Analysis System (EXPASY), (www.EXPASY.org. Protein characterization and function options, which is protparam was then selected from the tools option. The FASTA formatted amino acid sequence for sequence and physiochemical properties. Physicochemical properties of matK gene that were analysed using Expasy software are as follows Theoretical PI, molecular weight, number of amino acid residues, amino acid and atomic compositions, instability index, extinction coefficient, aliphatic index and hydropathicity

Determination of predicted protein motifs and structures for MatK genes
The motifs in the amino acid sequences of MatK protein gene of the selected pulses were predicted using the protparam site (http://prosite.expasy.org/scanprosite/) FASTA formatted protein sequences were used in the scan at high sensitivity.

Prediction of secondary protein and tertiary protein structures
Prediction of motif for secondary structure was achieved using GORIV software as modified by Garnieret et al. (2015). The motif for the predicted tertiary structure (3D structure) for matk genes was obtained using the Phyre2 software (http://www.sbg.bio.ic.ac.uk/phyre2/html/page.cgi?id=ind ex) amino acid sequence earlier retrieved from the NCBI databases and modified by Kelley and Stemberg (2009).

Determination of start and end codons, and GC content of matK genes for legumes for the selected species
The start and end codons for the matK protein genes of the selected leguminous species (putative region) was determined using the GENSCAN software as modified by Burge (2011). The GENSCAN software was also used to determine the Guanine-Cytosine (G-C) content for each amino acid sequence of each leguminous species. http://genes.mit.edu/GENSCAN.html.

Determination of domain architecture of amino acid sequences of MatK gene in the selected legumes
The domain architecture of the amino acid sequences of MatK gene in the selected legumes was determined using the Expasy online (http://prosite.expasy.org/scanprosite/), where the amino acid of the query sequences are scanned for domain architecture.

Determination of phylogenetic and evolutionary history of matK genes
The phylogenetic analysis and evolutionary history were determined using the Molecular Evolution and Genetic Analysis (MEGA 6) software with maximum livelihood option for the construction of phylogenetic tree for the selected legumes using their MEGA aligned retrieved nucleotide sequences from the NCBI database. The evolutionary history or pathway was traced using the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) based on the Jones -Taylor-Thompsom (JTT) matrix -based model. The reliability of the inferred phylogenetic tree was evaluated using the Boostrap analysis of 1000 replications The time of divergence or evolutionary history of MatK protein genes of the legumes was estimated based on the nucleotides percent substitution obtained per site.

Retrieval of nucleotide and amino acid sequences
Results obtained for sequence lengths of nucleotide and amino acid of matK gene showed that the nucleotide sequence lengths ranged from 631-1580bps while amino acids sequence lengths ranged from 21-509 residues. It was observed that nucleotide sequences of matK genes for P. tetragonolobus, G. tomentella, C. arietinum and V. sativa were the longest while P. sativum sequence was the shortest (641bps). This trend was however, observed for the amino acid sequences lengths of MatK gene of these legumes, which may have stemmed from the fact that MatK gene sequences of those legumes with longer lengths have been completely sequenced while have partial CDS.

Determination of percentage identity and similarity for amino acid and nucleotide sequences
Results on percentage identity showed that the highest identity was observed in the matK gene of P. tetragonolobus with 88% identity while the least identical species with the gene was V.sativa showing 70% identity (Table 3) taking C.cajanas the standard. On the other hand, the highest similarity in the Mat K genes was shared by G.tomentella and P.tetregonolobus (91%) The least percent similarity was however observed in V.sativa, which showed 78 percent similarity with C, cajan. Also percent identity of nucleotide sequence using C. cajan as a reference crop showed that P. tetragonolobus, G. tomentella, P. vulgaris and V. angularis had sequence identity greater than 90% while C. arientinum, V. satia and P. sativum had percent identity with C. cajan greater than 80% (Tables 2 & 3).

Physicochemical properties
Physicochemical properties of matK protein showed that the number of amino acid residues ranged from 199-509 with G. tomentella, V. angularis, C. arienatum and P. sativum having above 500 residues of amino acids while V. unguiculata had the least 199 residues. Molecular weight for G. tomentella, V. angularis, C. arienatum and P. sativum were greater than 60000 Daltons while V. unguiculata was the lowest (24302.14Daltons). Result on theoretical Pl was greater than 9.00 for all amino acid sequence of matK gene in the selected legumes. It was observed that the total -Ve and +Ve charges, total number atoms and extinction coefficient of amino acid sequence of matK gene in G. tomentella, V. angularis, C. arientinum and P. sativum were higher than other legumes investigated. However, this trend was not followed for instability index and alphatic index as these properties were high for P. tetragonolobus (80.52; 97.46), P. vulgaris, V. unguiculata, V. sativa as well as for C. cajan.

Motifs in amino acid sequence of matK gene in selected legumes
Analysis of motifs in amino acid sequence of matK gene in the 10 legumes investigated showed that there are 6 motifs revealed in G. tomentella, 6 in P. tetragonolobus, 3 in P. vulgaris, 5 in V. angularis, 6 in V. unguiculata, 5 in C. arienatum, 6 in V. radiata, 5 in V. sativa, 8 in P. sativum and 7 motifs in C. cajan. The most common motifs found in the amino acid sequences of matK gene in the selected legumes were tyrosine kinase phosphorylation site, N-myristoylation site, Nglycosylation site, protein kinase phosphorylation site, casein kinase II phosphorylation site as well as cAMPand -cGMP dependent protein kinase phosphorylation site. Conversely, microbodies C-terminal targeting site motif was observed in the amino acid sequence of matK gene of P. sativum and C. cajan (Table 5). From the table, the positions of these motifs revealed in different legumes vary though the motifs are the same.

Phylogenetic relationship and relative time of evolution of matK gene sequence in 10 legumes selected
The matK gene sequences from the 10 legumes analysed showed that there were two clades formed. Clade A had sequences of P. sativum, V. sativa and C. arientinum while clade B had P. tetragonolobus, C. cajan, G. tomentella, P. vulgaris, V. uniguiculata, V. angularis as well as V. radiata. However, clade A was sub-clade into two with P. tetragonolobus, C. cajan and G. tomentella were found in sub-clade I while P. vulgaris, V. unguiculata, V. angularis and V. radiata were found in the sub-clade II (Fig. 1). The evolutionary history of matK gene sequences revealed 2 clades, which was the same as the clades in the phylogenetic tree of the gene. It showed that clade A diverged from the ancestry legume approximately 39MYA while legume sequences in clade B diverged from the ancestor about 57MYA. C. cajan and P. tetragonolobus had diverged about 35MYA probably from G. tomentella.
G-C contents and other parameters of the nucleotide sequences of matK gene in the selected 10 legumes G-C contents analysis revealed that V. sativa had the highest G-C content (31.37), which was followed by P. sativum (30.71). It ranged from 27.29-31.37. Additionally, poly A + tail was absent in the nucleotide sequences of P. tetragonolobus, V. angularis, V. unguiculata, V. radiata as well as C. cajan However, poly A -tail was present in all the nucleotide sequences ofmatK gene in the 10 legumes although they occupy varying positions. There were no initial and terminal exons as well as no peptides and coding sequences (CDS) predicted.

Secondary structure of amino acid sequences of matK gene in selected legumes
Analysis of the secondary structures of the amino acid sequences of matK gene showed that the region covered by random coil was the highest in the sequence comparing alpha helix and extended strand (Table 7). Alpha helix ranged from 31.64% -41.27%, which for P. sativum and P. tetragonolobus while extended strand SEQUENCE ANALYSIS OF MATURASE K (MATK) ranged from 11.56% (V. radiata) -20.99% (G. tomentella). For the random coil, it ranged from 39.48% (P. tetragonolobus) -51.76% (V. radiata).

Domain architecture of amino acid sequence of Matk gene in selected legumes
Here we report only two types of domain architecture in the amino acid sequence of MatK gene in the selected legumes, namely single and double domain (Table 8). However, the constraint in the above result stemmed from the fact that the coding sequences of MatK gene in the selected legumes have not been completely sequenced, implying that those with partial CDS and having single domain architecture might have more than as reported in this paper.

DISCUSSION
Data mined from sequenced genes have been very pivotal in molecular systematic studies. Importantly, analyses of the DNA sequences of various species provide valuable information about their taxonomy, gene make up as well as utilizations. Undoubtedly, genomic regions vary considerably in their potential phylogenetic informativeness and their contributions in resolving a given set of taxa over specified time (Hilu et al., 2014). Specifically, there are two schools of thought regarding the utilization of rapidly evolving regions as against slowly evolving regions of the genome. According to Graham and Olmstead (2000), Wang et al. (2009, rapidly evolving regions will be better used for shallow evolutionary histories while slowly evolving regions for deeper epochs. Their argument was premised on the fact that multiple hits confounded by extended time scale could be significant enough to conceal phylogenetic signals and elevate homoplasy, with saturation reaching levels that can negatively impact tree structure (Graybeal, 1994;Wenzel and Siddal, 1999;Klopfstein et al., 2010;Townsend et al., 2012). The accumulation of multiple hits in rapidly evolving regions is capable of obscuring potential synapomorphies as well as results in long branch alteration (Townsend, 2007;Magallon and Sanderson, 2002).On the contrary, however, the opposing school of thought opined that rapidly evolving regions promotes effectiveness and less constrained genomic regions in deep level phylogenetics (Yang, 1998;Hilu et al., 2008;Hilu and Liang, 1997;Worberg et al., 2007).According to Hilu et al. (2014), phylogenetic signal from rapidly evolving and un-constrained matK provides by far the most structure and accuracy, whereas slowly evolving, constrained and un-constrained genes display decreasing degrees of informativeness and tree structure. This was also the same position that  had earlier posited that matK gene is very informative in plant systematics owing to its high phylogenetic signal when compared with other genes such as rbcL.
We report that nucleotide sequence length for P. tetragonolobus, G. tomentella, C. arientinum and V. sativa had similar sequence lengths (>1500bps) while C. cajan, P. vulgaris, and V. unguiculata also had similar lengths (>700bps). The same trend was observed for the amino acid sequence lengths for afore-mentioned legumes. It should also be mentioned here that the sequences of the later were only partially sequenced. However, it has been observed that variations within a family of related nucleic acids and protein sequences provide an invaluable source of information for evolution. Variations in sequence lengths in different organisms have been attributed to indels mutations that have accumulated during evolution. What this might suggest is that legumes with similar nucleotide and amino acid sequence lengths probably may have evolved at the same time or differentiated/diverged from their ancestry root almost the same time. The other likelihood is the fact that though they are legumes, their genus are not the same, which might not be unconnected with the earlier indel mutations creating evolution divergences as well as variations in sequence lengths of nucleotides and amino acids.
According to Stone et al. (2010), organisms with high percentage sequence similarity in their genes have a similar pattern of evolution and differentiation. Sequence similarity implies that the two sequences share a common evolutionary ancestor otherwise known as homologs but should be noted that homologous sequence do not always or necessarily share significant sequence similarity. From our result, P. tetragonolobus and G. tomentella share more than 90% amino acid sequence similarity with C. cajan implying close relatedness. P. vulgaris, V. angularis, V. unguiculata, V. radiata and C. arientinum share more than 80% similarity while V. sativa and P. sativum share more than 70% sequence similarity with C. cajan. According to Kajita et al. (2001), if two sequences have sequence identity greater than 70%, the implication is that they have about 90% probability or more to share the same biological processes and functions. WE report nucleotide and amino acid sequence identity greater than 70% except for P. sativum sequence that had70%.
This notwithstanding, what it might suggest is that matK gene found in the legumes analysed may perform similar functions and undergo the same processes. It may be recalled that Kores et al. (2000), Kujita et al. (2003), Jankowiak et al., (2004) had earlier feared that matK gene may not be functional in some plants due to rare presence of indels as well as premature stop codons. This was countered by Michelle et al. (2007) who observed that matK is involved in posttranscriptional splicing in the chloroplast. However, what this present analysis cannot infer is whether though having high percentage identity, which should have implied similar functionality, is their levels of functionality.
Protein in the same family share at least more than 30% amino acid sequence similarity with the resultant sharing of some structural characteristics (Wojciechowski et al., 2004). It thus suggest that matK gene in the respective legumes share very structural features owing to the high percentage similarity in their amino acid sequences, their genus notwithstanding. Sequence similarity off approximately 70% may suggest identical homology, functionality and very high conservation in matK gene.
The expected value (E-value) assess the significance of single pair wise alignment, which is related to the p-value. The lower the E-value, the less likely the database match is a result of random chance and thus the more significant the match is. Interestingly, E-value less than 1e-50 (E<1e-50) indicates that the match was as a result of homologous relationships. It might therefore be wise to affirm that the nucleotide and amino acid sequence identity and similarity were homologs and as such indicate strong relationship evolutionarily.
Our result on physicochemical properties of amino acid sequences of matK gene in the respective legumes showed that the higher the number of amino acids residues, the weightier; higher positively and negatively charges more number of atoms as well as extinction coefficient. Positively charged residues were greater than negatively charged residues, which implies that maturase K protein is an extracellular protein instead of an intracellular protein (Andrade et al., 1998). Guruprasad et al. (1990) observed that instability index more than 40 implies that the matK protein is unstable in vitro. Except matK amino acid sequence of C. cajan that had instability index of 38.18, other matK gene of other legumes analysed had instability index more than 40. Nikhil et al. (2009) reported that the instability index is a function of the abundance of cysteine in the formation of disulphide bond in the matK protein molecule. From this report, it thus mean that excepting of matK protein of C. cajan, other matK proteins have low cysteine for disulphide bond formation. Proteins could either be hydrophobic or hydrophilic. In the report of kyle and Doolittle (1982), grand average hydropathicity (GRAVY) value greater than zero indicates a relatively hydrophobic protein. Our present report suggests that matK protein is relatively hydrophobic (-0.155-0.021).
Sequence motifs re short recurring patterns in the DNA that are presumed to possess a biological functions. Usually, they indicate sequence-specific binding sites for protein in the form of enzymes (Nucleases, transcription factors, etc.). The fact that almost similar motifs were predicted on the amino acid sequence of matK gene in the 10 legumes analysed except microbodies c-terminal targeting site found in the matK gene of P. sativum and C. cajan, the differences in the positions they occupy indicate that the sequencespecific binding sites for proteins differ and might however, create functionality differences. It should be underscored that some of these positions that specifically bind to these motifs are involved in varying important processes at the RNA level including ribosome binding, mRNA processing, termination of transcription, etc. These proteins have varying initiation and termination sites.
Phylogenetically, we observed that matK gene of the different legumes were clade based on percentage identity and similarity of the sequences. The implication is that the more the sequence homology, the more probability of them to be clade together. This was evidenced on the sub-clade A of clade 1 comprising of G.tomentella, C. cajan and P. tetragonolobus ( Figure  1a) and percentage identity and similarity of nucleotide and amino acid sequences (Tables 2 & 3) of matK gene. What this might suggest is that matK gene though coming from a common ancestor, diverge evolutionarily probably due to indel mutations. This gave rise to sequence homology and possible similar functionality and structurally characteristics. Though these different legumes fall into different genus, their gene sequences showed high homology thus being clustered together. According to Wojciechowski et al. (2004) reported that fabaceae is generally monophyletic implying that it contains clade containing an ancestral species and all its descendants. However, some are paraphyletic. Using rbcL gene (Kajita et al., 2001) and trnL intron (Bruneau et al., 2001;Herendeen et al., 2003) sequences, analyses of matK sequences support the monophyly of the leguminous family. What it portend is that the sequences of these genes used are might be highly homologous. Though our present result could not trace the ancestral parent, it may not negated the earlier positions on monophyletic concept.  reported that Fabaceae started their diversification approximately 60MYA while the most important clades diverged some 50MYA. Lavin et al. (2005) and  reported that the age of the main cesalphinoideae clades have been estimated as being between 56 and 34MYA while the basal group of the mimosoideae was put as 44±2.6MYA. Using the matK gene sequences from the various legumes, Clade 1 (Figure 1b), diversification from the ancestral root was approximately 39MYA while clade 2 legumes diverged about 57MYA. The report of Bruneau et al. does not imply that matK gene sequences must diverge at the same time rather the family being monophyletic would have had an ancestor (which may have been extinct) with this gene but on several mutations had caused the variations observed in the different legumes analysed.
According to Smarda et al. (2014), the hypotheses by several authors as regards the biological impact of GC content variation in microbial and vertebrate organisms notwithstanding, the biological significance of GC content diversity in plants remains unclear due to lack of sufficiently robust genomic data. GC content showed a quadratic relationship with genome size, with the deceases in GC content in larger genomes possibly being a consequences of the higher biochemical cost of GC base synthesis. GC-rich DNA aids cell freezing and desiccation. Important to mention is the fact that genomic adaptations associated with changing GC content might have played a significant role in the evolution of plants. Base composition is a fundamental property of genomes and a strong influence on gene function and regulation (Li and Du, 2014). In higher organisms, the GC content was lower in dicot plants and highest in monocot plants (Li and Du, 2014). GC content of monocots varied between 33.6-48.9% (Smarda et al., 2014). Analysis of the GC content of the nucleotide sequences of matK gene in the selected legumes ranged from 27.29% to 30.71%. This confirms the earlier report of Li and Du (2014).Poly (A) tail is a common modification of eukaryotic mRNA and plays many fundamental roles in mRNA stability (Mangus et al., 2003;Coller and Parker, 2004). It is a long chain of adenine nucleotides that is added to the 3' end mRNA during RNA process as it increases the stability of the molecule.
The secondary structure of amino acid sequences of matK gene in selected legumes revealed that alpha helix, extended or beta strand as well as random coil. Usually, a region of secondary structure that is not an alpha helix, β-sheet, or a recognisable turn is commonly known as a coil (Mount, 2004). From the result alpha helix is the most abundant helical conformation found in globular proteins accounting for 32-38% of all the residues. Regions richer in alanine (A), glutamic acid (E), leucine (L) and methionine (M) and poorer in proline (P), glycine (G), tyrosine (Y) and serine (S) (AELM>PGYS) tend to form an alpha helix. This might mean that amino acid sequences of matK gene in the legumes were higher in AELM and lower in PGYS. Additionally, our result on secondary structure of the amino acid sequences of matK gene in the selected legumes suggests that the percentage unrecognisable regions (random coil) was higher than each of recognisable regions (alpha helix and extended strand). Protein domains are the structural, functional, evolutionary units of the protein (Zhang et al., 2012). Usually, proteins with the similar architectures are close homologs, while different proteins possess distinct domain architectures. The implication might mean that those sequences composed of more than one single domain may have been invented by rearrangement, duplication, insertion, deletion, fusion and fission of domains (Teichmann et al., 1998;Gough, 2005;Kummerfeld and Teichmann, 2005;Fong et al., 2007;Ekman et al., 2007) This is premised on the fact simple domain architectures per protein are more often than not created de novo (Fong et al., 2007). It should be noted that the fact that some matK gene sequences of selected legumes are partially sequenced might conceal some information that may be important in resolving the differences in the legume sequences.

CONCLUSION
Expectedly, there are differences observed in the matK gene sequences in the selected legumes considering the parameters analysed. However, our results revealed some degree of identity and similarity in the sequences especially between C. cajan and P. tetragonolobus sequences. This might implicitly mean same functions in these legumes.