Functional and Structural Characterization of SARS-Cov-2 Spike Protein: An In Silico Study

Background Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the cause of the global outbreak of coronavirus disease 2019 (Covid-19), which has been considered as a pandemic by WHO. SARS-CoV-2 encodes four major structural proteins, among which spike protein has always been a main target for new vaccine studies. This in silico study aimed to investigate some physicochemical, functional, immunological, and structural features of spike protein using several bioinformatics tools. Method We retrieved all SARS-CoV-2 spike protein sequences from different countries registered in NCBI GenBank. CLC Sequence Viewer was employed to translate and align the sequences, and several programs were utilized to predict B-cell epitopes. Modification sites such as phosphorylation, glycosylation, and disulfide bonds were defined. Secondary and tertiary structures of all sequences were further computed. Results Some mutations were determined, where only one (D614G) had a high prevalence. The mutations did not impact the B-cell and physicochemical properties of the spike protein. Seven disulfide bonds were specified and also predicted in several N-link glycosylation and phosphorylation sites. The results also indicated that spike protein is a non-allergen. Conclusion In summary, our findings provided a deep understanding of spike protein, which can be valuable for future studies on SARS-CoV-2 infections and design of new vaccines.


INTRODUCTION
Coronaviridae is a family of enveloped, positive-sense singlestranded RNA viruses (ssRNA+) comprising coronaviruses for birds, bafiniviruses for fishes, and corona-and toroviruses for mammals (1). At the end of 2019, a series of pneumonia cases were reported from the Hubei Province of China with clinical presentations significantly resembling viral pneumonia (2). The The resulting virus and disease are currently called severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and coronavirus disease 2019 (COVID- 19), respectively (3).
There is limited information concerning the pathogenesis of COVID-19, and evidence has shown that the main mechanism is similar to SARS-CoV and MERS-CoV (4). The spike (S) protein of coronaviruses mediates viral entry into target cells. This entry is due to the binding of the surface unit (S1) of the S protein to a cellular receptor, known as angiotensinconverting enzyme 2 (ACE2). SARS-S and SARS-2-S share a high amino acid homology (>70%) (5). The interaction between viral proteins and cell membrane receptors is a critical step in the virus pathogenesis (6). The virus probably pass through major passages of the upper respiratory tract, especially nasal and larynx mucosa (7). The main target of virus entrance is lungs through the respiratory tract, but virus would also attack and enters other organs that express the type 2 transmembrane serine protease (TMPRSS2) and ACE2 receptor protein. The consequential of infection in host cells causes an excess release of proinflammatory cytokines that causes a cytokine storm (8).
COVID-19 patients exhibit various symptoms that cannot be easily distinguished from other respiratory diseases. Based on the severity of symptoms, this disease is classified into mild, moderate, severe, and critical (9). These symptoms, which may appear within a week after exposure to the virus, mainly include fever, cough, shortness of breath, chills, headache, muscle pain, and loss of taste or smell (10). The main reported complications associated with COVID-19 were pneumonia, heart injury, liver and kidney failure, and superinfections (11). Recent estimates showed that approximately half of died people with COVID-19 had a underlying diseases, where hypertension (46%) had the highest occurrence followed by diabetes (26%), cardiovascular disease (21%), malignancy (11%), chronic obstructive pulmonary disease (COPD) (8%), kidney disease (7%), and liver disease (3%) (12).
To date, there exists no specific antiviral treatment recommended for COVID-19, and no vaccine is currently available (13). The current appropriate treatments include oxygen therapy (which is the major intervention), administration of antibiotics to prevent bacterial co-infections, fluid management, and supportive use of traditional medicine (14,15). Other carried out strategies were using antivirals (Lopinavir, Ritonavir, Ribavirin, Favipiravir (T-705), Remdesivir, Oseltamivir, Chloroquine, and Interferon), and convalescent plasma (16). However, still the treatment effectiveness is greatly varied, so future studies on SARS-CoV-2 genome organization can help design and develop effective antiviral drugs or inhibition approaches.
Over the past decades, bioinformatics has emerged as a powerful tool for analyzing bacterial and viral genomes, predicting the structure and function of proteins, and designing new vaccines (17,18). Due to the global health emergency declared for COVID-19 and the importance of any effort to control the outbreak, the present in silico study aimed to investigate some physicochemical, functional, immunological, and structural features of spike protein using several bioinformatics tools.

RESULTS
Amino acid changes: Analysis showed that spike protein was a highly conserved protein, and only one high frequent mutation (D614G) was detected in comparison with the reference sequence. Table 2 summarizes all mutations established in spike protein. In addition, the phylogenetic tree results are illustrated in Figure  1. The phylogenetic analysis showed two main clusters, the upper one containing eight sequences form Spain, USA, and South Africa, and the second one including other sequences and reference sequences. Two sequences from Iran and a sequence form USA were very close to the reference sequence. Interestingly, almost all sequences from South East Asia (China, Japan, South Korea) were close to each other, and the majority of USA sequences were almost located in upper cluster.
ProtParam analysis: ProtParam analysis indicated that spike protein is an acidic peptide due to the high percentage of its acidic amino acids (Theoretical pI: 6.2). The instability index, an estimate of the stability of a protein in a test tube, was 33.01 and showed that the spike was a stable peptide. Aliphatic index, a positive factor for the increased thermostability of proteins, this factor was 84.67 which revealed that this peptide was a thermostable one. GRAVY is a hydropathy index which augmented with the increase in the positive score. Thus, the peptide was also a hydrophilic one (-0.079). Table 3 shows the postmodification and disulfide sites prediction; based on our results, the spike was highly phosphorylated, and four conserved positions were further suggested. Glycosylation prediction by two online software showed seven positions (61,74, 234, 282, 616, 709, and 1195); and results showed the prediction of possible disulfide bonds by Dianna and Scratch, which determined several cysteines. Secondary and tertiary structure prediction:

Postmodification and disulfide bond results:
The secondary structure results using SOPMA showed that random coil was the major structure with 43.9% and after that Alpha helix, extended strand and Beta turn by 29.3%, 23.3% and 3.5% respectively. Table 4 presents the qualification results of the refined models suggested by 3Drefine. Figure 2 illustrates the tertiary structure of spike protein.

DISCUSSION
The results of the present study showed that the spike protein was highly conserved, and high prevalence mutation was detected only in one site (D614G  (20). Similar to our findings, substitution in amino acid 614 (D614G) was the most prevalent mutation (25%). Previous studies suggested a region, KRSFIEDLLFNKV, as a potential Achilles' heel for controlling the life cycle of SARS-CoV-2. This site is exposed and this region is required for proteolytic activation cleavage (21,22). In addition, it is a well-conserved region located on the surface of the virus. Similar to previous investigations, our findings showed that KRSFIEDLLFNKV was completely conserved among all selected sequences from all regions. Interestingly, prediction of post-modification sites revealed that this region was phosphorylated. It was further predicted as a Bcell epitope, confirming its importance as a possible candidate for designing new vaccines.
Spike proteins contain a receptor binding domain (RBD) positioned between amino acids 331 and 524. Mutations in this region may critically impact virus entry and attachment to ACE2 receptor (23). In one sequence, we detected a substitution in this region, indicating that this domain is highly conserved and could be a new target for inhibiting virus attachment.
Contrary to our predictions, Banerjee et al. specified four mutations (348, 476, 483, and 520) with very low prevalence (20). The difference between the two studies regarding the number of the mutations might be ascribed to the different sets of sequences and study methods.
Korber et al. focused on D614G substitution as an urgent concern, proposing that this mutation began spreading in Europe in early February 2020 (24). Although they were not able to define the origin of this mutation, there existed certain hypotheses as to its Chinese or European origin. The potential impacts suggested for this mutation are increased viral transmission, infected spike, enhanced receptor binding, and ADE (antibody-dependent enhancement) antibody elicitation (24). In agreement with Korber's study, our results indicated the spread of D614G substitution. Moreover, almost all sequences from North America (USA) and three sequences from Europe (Spain) harbored this mutation. Interestingly, this mutation was not detected in the sequences from China and South East Asia (Japan and South Korea).
Our analysis described spike protein as acidic, thermostable, and hydrophilic. However, because spike requires some post-modification processes, it seems yeast, and mammalian cells can better express this protein. Similar to our ProtParam prediction, Walls et al and Ou et al used different cell lines to express spike protein, which showed its stability in mammalian cells (25,26). Likewise, Zhang et al. expressed spike protein in Escherichia coli; they confirmed that E. coli was an appropriate host for the expression of spike (27).
Phosphorylation prediction showed four completely conserved sites among the selected sequences. Previous studies suggested some functions for protein phosphorylation in coronavirus. Petit et al proposed that phosphorylation is vital in the retention of spike protein at cell surfaces (28). Furthermore, Davidson et al stated that the phosphorylation sites on the spike glycoprotein might be necessary for assembling the trimer (29). Therefore, it can be concluded that blocking the phosphorylation process could be an effective approach to disturb the spike protein function. Fung et al. defined the vital role of glycosylation in antigenicity, fusogenic, and immunomodulatory activities of the spike protein (30). Glycosylation prediction by NetNGlyc and Nglyde determined seven positions. Of these, except in the position from a Brazil sequence (74), which showed a substitution, other sites were highly conserved and seemingly highly vital to spike protein function. Shajahan et al .and Watanabeet al., using the high-resolution mass spectrometry, revealed 22 glycosylation sites for spike protein (31,32). Seven positions mentioned in our findings were similar to the foregoing studies. Similar to this study, Kumar et  It has been proposed that disulfide bonds are required for a proper folding and trimerization of coronavirus spike protein (30). Dianna and Scratch results showed numerous positions for disulfide bonds that were completely conserved in all analyzed sequences. Dianna uses a support vector machine (SVM) with degree 2 polynomial kernel for the spectrum representation, and Scratch works based on 2D recurrent neural network, support vector machine, graph matching, and regression algorithms. Both online software are wellknown and were previously employed in numerous studies to define disulfide bonds. Ibrahim et al. made use of a combined molecular docking and structural bioinformatics; they detected 13 disulfide bonds in four distinct regions and suggested that these regions were involved in cell attachment (34 (35)(36)(37). It was shown that neutralizing antibody responses to the spike protein began by week two and in most patients developed by week three. Immunoinformatics analysis of spike protein by certain online databases suggested four regions that confirmed the possible potential of this protein for inducing humeral immune system. Interestingly, no mutation was detected in these regions; hence, they could be proper regions for the production of new vaccines. In addition, estimating allergenic characteristics showed that spike protein could not provoke allergenic reactions in humans.
Ahmed et al. used bioinformatics approaches to define B-cell epitopes in different proteins of SARS-CoV-2 (38). They were able to define 23 B-cell epitopes for spike protein; our prediction, on the other hand, showed new regions, a difference possibly attributable to the different sequences used in both studies.
Moreover, through bioinformatics analysis and machine learning, Grifoni et al. and Fast et al. analyzed spike protein to define immunological properties. Compared with our prediction (39,40), the two sites (674-687 and 807-816) were similar.
As a major limitation of this study, information on the COVID-19 crisis is constantly changing, and day-by-day number of new sequences in online databases are updated, therefore our study may not present a comprehensive view of spike protein. However, as a preliminary study, our results provide an insight for further works.
In summary, the results of the present study provided a comprehensive understanding of spike protein which can be used for further studies. This protein is a highly capable epitope on the SARS-CoV-2 surface which included several features appropriate in a vaccine construct. Other features of spike protein could be employed to express this protein, and postmodification sites could be utilized as new targets for SARS-CoV-2 inhibitors. Meanwhile, it is not easy to forecast any realistic scenario, but mutations in spike protein suggest potential impacts on the pathogenesis of the virus in near future.