Citation
Protein phosphorylation site prediction - a deep learning approach

Material Information

Title:
Protein phosphorylation site prediction - a deep learning approach
Creator:
Mahdi, Mohammed Abed
Place of Publication:
Denver, CO
Publisher:
University of Colorado Denver
Publication Date:
Language:
English

Thesis/Dissertation Information

Degree:
Master's ( Master of science)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Computer Science and Engineering, CU Denver
Degree Disciplines:
Computer science
Committee Chair:
Biswas, Ashis Kumer
Committee Members:
Alaghband, Gita
Chlebus, Bogdan S.

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
Copyright Mohammed Abed Mahdi. Permission granted to University of Colorado Denver to digitize and display this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.

Downloads

This item is only available as the following downloads:


Full Text

PAGE 1

PROTEINPHOSPHORYLATIONSITEPREDICTIONADEEPLEARNINGAPPROACH by MOHAMMEDABEDMAHDI B.S.,Thi-QarUniversity,2007 Athesissubmittedtothe FacultyoftheGraduateSchoolofthe UniversityofColoradoinpartialfulllment oftherequirementsforthedegreeof MasterofComputerScience ComputerScienceandEngineeringProgram 2018

PAGE 2

ThisthesisforComputerSciencedegreeby MohammedAbedMahdi hasbeenapprovedforthe ComputerScienceandEngineeringProgram by Advisor: AshisKumerBiswas CommitteMembers: GitaAlaghband BogdanS.Chlebus ii

PAGE 3

Mahdi,MohammedAbed PROTEINPHOSPHORYLATIONSITEPREDICTIONADEEPLEARNINGAPPROACH ThesisdirectedbyAssistantProfessorAshisKumerBiswas. ABSTRACT Fordecades,proteinphosphorylationhasfascinatedbiologicalresearchersandcomputerscientists. Thephosphorylationofproteinsitespredictionissimpletostate,buthardtomakeanaccurateprediction throughit.Theevolutionofmachinelearninganddeeplearningalgorithmswithhigh-performance computersystemsnowadaysthatusedexactandheuristicwayshavebeenappliedintheeldofbiological researchtomakeacorrectprediction.Inthepastdecade,theapproachofthesealgorithmshasprovided benetsforthe insilico phosphorylationsitepredictionsystems.Thesemethodsareprevalentnowadays whichgiveaveryaccuratepredictionofthephosphorylationsitesoftheprotein.Theproblemrequires averyhighlevelofclassicationperformancetoseparatethephosphorylationfromnon-phosphorylation groupsofproteinsequences.Inadditiontotheprimaryinformationoftheproteinsequences,thekinase enzymes,whichareresponsiblefortheactivationphosphorylationgroupduringthegrowthoftheprotein, havebeenusedasthemainfeaturesareverysignicantdatathatcanbeusedintheclassicationof thephosphorylationsiteprediction.However,only30%oftheactualphosphorylationsitescontain kinaseannotations,thatmeansmorethan70%oftheproteindatasetareeliminatedinthedesignofthe existingkinase-specicpredictionsystems.Inthisthesis,ageneralizedpredictionsystemisproposed withouttheneedforkinasinformationandsolelybasedonPositionSpecicScoringMatricesPSSM, whichrepresentevolutionaryinformationofproteins,calculatedfromthemultiplesequencealignment ofproteinsagainstanonredundantsetofproteinsequences.Besides,wehavecollectedmanyother featuresfromtheproteinandcreateadatabasetobuildaveryrobustclassicationsystemthatwillhelp tomakeahighlyaccurateprediction.Withthisinformation,aconvolutionalneuralnetworkCNNis implementedtoextractthefeaturesandmakeaprediction.Moreover,HybridCNNSupportVector MachinesSVMclassieristrainedonthefeaturesthatextractfromtheCNNtodevisethegeneralized predictionsystem.Aftertheimplementationofeachalgorithm,evolutiononanindependentdatasetwas performedforindividualalgorithms.Wefoundthattheproposedalgorithmsshowagoodperformance comparedtoexistingpredictors. Theformandcontentofthisabstractareapproved.Irecommenditspublication. Approved:AshisKumerBiswas iii

PAGE 4

Idedicatethisworkto: Myfamily,bothpresentandfuture,especiallymyparents. Theonewhoisabsentfromtheeyesoftheworld,buthelivesinthedeepestrecessesofmyheart guidesmetospiritualthingsthatfeedmewithdeepfeelingslikelove,caring,kindness,deep knowledge,learningthevirtueofpatience,andconnection,justasthesunhiddenbehindtheclouds providesuswithlightandwarmth. iv

PAGE 5

ACKNOWLEDGEMENTS Firstofall,Iwouldliketoexpressmysinceregratitudetomysponsor,theHigherCommitteeof EducationDevelopmentinIraqHCED,forgivingmethiswonderfulopportunitytonishmymaster's degreeintheUnitedStates. Also,Iwouldliketothankmyadvisor,Dr.AshisBiswasforgivingmetheopportunitytogrowboth academicallyandprofessionallyasaresearchscientistinhislab.WhenIstartedgraduatestudies,I hadlittleexposuretotheeldsofbioinformatics,biochemistry,andmedicine.ThroughDr.Biswas' guidanceandmentorship,Ihavehadtheopportunitytoworkonanumberofexcitingprojectswhich havedevelopedmyprofessionalinterests. IwouldalsoliketothankprofessorBogdanChlebusforallthepersonalandprofessionaladvicehe providedmewith,aswellasthefriendlyconversationsthatledtoideasforconductingqualityresearch. SpecialthankstoprofessorGitaAlaghband,whosewiseadvicehelpedmendthemotivationand strengthtosetforthonthisacademicadventure. Iamgratefultomyfamilyfortheirlove,understanding,andsacriceduringmyyearsawayfrom home;especiallymyparentsforbeingalightinthedarknessasalways.IwouldnotbewhereIam withouttheirprayersandtheirunwaveringsupport. MyappreciationalsogoestomyfriendsandcolleaguesattheDepartmentofComputerScienceand Engineeringfortheircontinuoussupportandencouragement,particularlyduringstressfulperiodsof research.IwouldliketoespeciallythankHawkarOagazwhowasalwayseagertohelpwhenIneeded assistance. Iamveryappreciativeandblessedtoallmyhostfamilieshere;theamazingJewishfamilyandthe awesomeChristianfamilies,fortheirhospitalityandforbeingthereeverytimeIneededthem. Andlastbutnotleast,Iwouldalsoliketoextendmythankstomyfriendsbackhome.Also,tomy oldselfandmysoulmatewhowillalwaysbealiveandrememberedasmyinspirationalsource,"you maynowrestinthepeace."Icompletedmymission! v

PAGE 6

TABLEOFCONTENTS CHAPTER I.Introduction............................................1 I.1Background.........................................1 I.2AminoAcids........................................2 I.3ProteinDenition.....................................4 I.3.1StructureoftheProtein............................4 I.3.2ProteinPhosphorylation.............................5 I.3.3TheFormalDenitionofTheProblem....................7 I.4LiteratureReviews.....................................9 I.4.1PhosPred-RF:ANovelSequence-BasedPredictorforPhosphorylation SitesUsingSequentialInformationOnly........................10 I.4.2PhosphoPredict:Abioinformaticstoolforpredictionofhuman kinase-specicphosphorylationsubstratesandsitesbyintegratingheterogeneous featureselection......................................10 I.4.3RF-Phos:ANovelGeneralPhosphorylationSitePredictionToolBased onRandomForest.....................................10 I.4.4PredictionofphosphorylationsitesbasedonKrawtchoukimagemoments.11 I.4.5PhosContext2vec:adistributedrepresentationofresiduelevelsequence contextsanditsapplicationtogeneralandkinasespeccphosphorylationsite prediction..........................................11 I.4.6MusiteDeep:adeep-learningframeworkforgeneralandkinase-specic phosphorylationsiteprediction..............................12 I.5Ourcontribution......................................12 II.PROTEINPHOSPHORYLATIONCLASSIFICATIONANDDEEPLEARNING ALGORITHMS............................................14 II.0.1ExtractingEvolutionaryinformationofPhosphoproteinsbyusing PSI-BLAST.........................................14 II.0.2Deeplearning..................................20 II.0.3Summary....................................22 III.SUMMARYOFTHEPROPOSEDPREDICTIONMETHOD...............23 III.1SummaryoftheProposedPredictionSystem.....................23 vi

PAGE 7

III.2Materials..........................................23 III.2.1EvolutionaryProleasFoundationforSystemDesign...........23 III.2.2ConvolutionalneuralnetworkCNN.....................24 III.2.3HybridCNNSupportVectorMachinesSVMclassier..........24 III.3Methodology........................................25 III.3.1Datasets....................................25 III.3.2Database....................................26 III.3.3PositiveDatasetPreparation.........................29 III.3.4NegativeDatasetPreparation.........................31 III.3.5TrainingSystemDesign............................31 III.3.6TestingtheProposedMethod.........................32 III.3.7EvaluatingtheProposedPredictionSystem.................33 III.4Testingthesystem....................................35 III.5SummaryoftheProposedPredictionSystem.....................36 IV.RESULTS............................................37 IV.1Dataset B .........................................37 IV.1.1ConvolutionalNeuralNetworksCNNModel................38 IV.1.2HybridCNNSupportVectorMachinesSVMclassier..........47 IV.2Dataset C .........................................56 IV.2.1ConvolutionalNeuralNetworksCNNModel................57 IV.2.2HybridCNNSupportVectorMachinesSVMclassier..........60 IV.3Details...........................................64 IV.3.1Details.....................................64 IV.4EectofWindowSize..................................64 IV.5TheElapsedTimeResults................................66 IV.6Comparisonwiththeexistingmethods.........................67 vii

PAGE 8

V.DISCUSSION..........................................69 V.1SummaryofThesis....................................69 V.2ConclusionandFutureWork...............................70 REFERENCES...............................................72 APPENDIX A.TABLEANDABBREVIATION................................77 A.1TwentyAminoAcids....................................77 A.2Abbreviations.......................................77 viii

PAGE 9

LISTOFTABLES TABLE III.1HyperParametersoftheCNN...................................24 III.2DatasetPhospho.ELMversion9..................................26 III.3DatasetB...............................................26 IV.1CNN1050............................................40 IV.2CNN2050............................................42 IV.3CNN10100...........................................44 IV.4CNN20100...........................................46 IV.5HybridCNNSVMclassier1050...............................49 IV.6HybridCNNSVMclassier2050...............................51 IV.7HybridCNNSVMclassier10100..............................53 IV.8HybridCNNSVMclassier20100..............................55 IV.9ThemeanofallthemeasurementsintheCNNclassier2050...............58 IV.10ThestandarddeviationofalltheassessmentsintheCNNclassier2050.........59 IV.11ThemeanofallthemeasurementsinHybridCNNSupportVectorMachinesSVM classier2050..........................................61 IV.12ThestandarddeviationofalltheassessmentsintheCNNclassier2050.........63 IV.13Thecomparisonoftheproposedsystemwithsomeoftheexistingsystemsregarding predictionscores............................................67 A.1Alistoftwentykindsofaminoacidsthatsupportthehumanbody,eachofthemhasits functions.Thecombinationofthetwentyaminoacidsmakesupanyprotein..........77 ix

PAGE 10

LISTOFFIGURES FIGURE I.1Thecomprehensivestructureofalpha-aminoacidadaptedfrom[1],wherethecarboxyl groupisontherightandtheaminogroupontheleftside.....................3 I.2Thechainofaminoacidsthatmadeupthepolypeptide,whichisaproteintheprimary structureoftheproteinadaptedfrom[2]..............................4 I.3Thefourstandardstructureoftheprotein,fromlefttoright,are:PrimaryStructure, SecondaryStructure,TertiaryStructure,andQuaternaryStructurerespectivelyadapted from[3]................................................6 I.4Phosphorylationanddephosphorylationrulesbyproteinkinasesandproteinphosphatases adaptedfrom[4]...........................................7 I.5Theillustrationoftheobjectivefunctionoftheproposedsystem.................8 II.1GraphicaldiagramoftheoriginalBLASTalgorithmadaptedfrom[5]............15 II.2AowchartexplainingthestepsofthePSI-BLASTprogramthatgenerateaPosition SpecicScoringMatrixPSSMprole...............................16 II.3AowchartillustratingthemainstepsofbuildingtheConvolutionalneuralnetworkCNN.21 III.1APSSMproleoftheproteinA0AVK6.Itcanbeseenfromtheprolethattheresultant proleis31 20matrix.ThearrowsintheguremeanthatTyrosineresidue'sconserved nessinposition521and524aredierent.The P inthetopleftindicatestotheposition, andthe S istheprimarysequenceofthisfragmentoftheprotein................30 III.2FlowchartexplainstheDetailedSystemoftheproposedPredictionmodelwhiletesting....36 IV.1ThearchitectureoftheCNNthatusedintrainingthemodelwiththewindowsizeis 31 20 .Thegurealsoexplainedindetailstheinputandtheoutputparametersforeach layerintheCNN............................................38 IV.2ThearchitectureoftheSVMthatusedintrainingthemodelalongwiththeCNNthathas inputequalsto 31 20 forthewindowsize.Thegurealsoexplainedindetailstheinput andtheoutputparametersforeachlayerintheCNNmodelthatisatteningasafactorin theendtobetheinputoftheSVM.................................47 IV.3EectsofwindowsizeonPhosphorylatedSerine,Threonine,andTyrosinesitesprediction performanceinboththeCNNandHybridCNNSVMclassier.................56 IV.4EectsofwindowsizeonPhosphorylatedSerine,Threonine,andTyrosinesitesprediction performanceinboththeCNNandHybridCNNSVMclassierintermoftheaccuracy assessment...............................................64 IV.5Themeanofthebaselineerrorsinpercentagesforalltheresiduesofprotein phosphorylationsitesinbothmodelstheCNNandHybridCNNSVM.Thewindowsizeof thePSSMproleis,17,19,21,23,25,27,29,31forthethreeresiduesS,T,Yoftheprotein..65 x

PAGE 11

IV.6ThemeanofElapsedTimeresultsforallwindowsizesWithbatchsizeis50,andepochis 20usingbothmodelstheCNNandthehybridCNNSVM...................66 xi

PAGE 12

CHAPTERI Introduction I.1Background Proteinsexistineverybiologicalsysteminthehumanbody.Theevolutionoftheproteinduringits journeygivesitthestrengthandthecapabilitytofunctionandachievecertaingoalsinthesystems. TherststepinthejourneyofaproteinstartsintheDNAandtranscribesittoRNAandlastly translatingtoprotein.However,itisnotthewholestoryoftheproteinestablishment.Therefore, throughouttheirjourney,theproteinhasitssphereofprocessesthattheycomeacrossaseriesof transformationsandchangesintheshapesandfunctions.Eventually,theseshapesleadtotheprotein, whichisessentialtothecelllife.Thehumanbodycontinuestogeneratemanysignalsthattransfer betweenlivingcells. Proteinsareresponsibleforthesekindsofsignaling,butthequestioninourmindishowtheproteins transmitthesesignals?Themainoperationintheproteinisphosphorylation,whichisasignicant techniquebywhichthesignalstransferbetweenbothprokaryoticandeukaryotic[6][7].Phosphorylation intheproteinmoleculewasrstaddressedin1955,byFischerandKrebs.Itwasdiscoveredthat approximately30%ofproteinsinthehumanbodyhavetheabilitytobephosphorylated[8].They foundsomeevidencesinanenzymecalledtheglycogen-degradingfromrabbitskeletalmuscle[9].Kinase andPhosphatasearetwomainenzymesinthebiologicalnetworkwhichcausephosphorylationand reversiblephosphorylation.Therefore,theyareworkinganoppositeway,whereKinaseisinvolvedinthe phosphorylationprocesswhilePhosphataseisassociatedwithdephosphorylationoperation.Reversible phosphorylationisresponsibleforcontrollinginmostofthecellularprocessesofthecell,andthese modicationsaectatleast30%ofproteins[10][11].Theproteincanbephosphorylatedinasinglesite ormultiplesitesofaminoacidssequence,andbothoperationshaveadierentimpactonthefunctionof theprotein,thustheentirecell.Thetransformationprocessofproteinphosphorylationmainlyoccurson serine,threonine,andtyrosineresidueswhicharecausedbyproteinkinase[12].Ontheotherhand,the reversiblephosphorylationcanchangethearchitectureofnumerousenzymesandreceptors,whichleads toactivate"turnonthestate"ordeactivatethem"turnothestate".Althoughthephosphorylation occursmainlyonserine,threonine,andtyrosineresiduesineukaryoticproteins,phosphorylationcan happeninprokaryoticproteinsespeciallyintheprimaryaminoacidresidueshistidine,arginineorlysine [7].Ahydrophobicportionoftheproteincanbedevelopedintoapolarandhighlyhydrophilicbyadding aphosphate )]TJ/F11 9.9626 Tf 7.748 0 Td [(PO 4 moleculetoapolaralkylradical )]TJ/F11 9.9626 Tf 7.749 0 Td [(R group. 1

PAGE 13

Despiterapidlyincreasinginthenumberofthedatabasesthathavebeendevelopedandreleasedinthe lastdecaderelatedtotheproteinphosphorylationsitesidentication,avastamountofphosphorylation sitesofproteinshavenotbeenannotated.Experimentsthatwereconductedbya 2 D gelelectrophoresis appearthat30%to50%inaneukaryoticcellthatundergoesphosphorylation[13].Therefore,making precisephosphorylationpredictionsitesineukaryoticproteinswillhelpusbetterunderstandeventsand molecularinteractions.Inthepastdecade,researchershavedonediggingintheareaofphosphorylation sitesinbothexperimentalandcomputationalmethodssoastogointoacomprehensiveinvestigation ofthisfundamentalphenomenonintheprotein.Despitethe invivo and invitro techniqueshavehigh accuracyinthepredictionofphosphorylationsitesoftheprotein,theysuerfrommultiplefactorssuch asthetimeconsumingoftheprocess,ahighlyexpensive,andthelimitationofanenzymaticalreaction [14][15].Therefore,thesemethodsarenotsuitablefordealingwiththousandsofproteinsanddetect theirphosphorylationsitesinashortamountoftime.Consequently,theresearchcommunityhasbeen thinkinganalternativeapproachtoovercomethedisadvantageofusingthecomplicatedexperimental method.Researcherscomewiththeideaofusingcomputationalmethods insilico prediction,whichis usefulandhasahighlevelofperformanceregardingtheprediction. Insummary,tounderstandbetterthebiologicalbehaviorofthecellinthehumanbodyandhow cellularcellsareaectedbythechangeprocessofphosphorylation,wehavetoidentifyphosphorylation proteinsandtheirphosphorylationsites.Includingthekinaseinformationandanydependencyof proteininthedesigningthemodelwillmaketheperformanceofthe insilico predictionsystembetter andpromising.Understandingthestructureandtheoperationsofcellularnetworksofproteinkinase substratesaresignicationforbuildingaremarkablepredictionsystem[16].Therefore,themotivation behindthisthesisisdesignanddevelopanovel insilico predictionsystemthatwillbeabletoidentify phosphorylationsitesofproteinfromanysequenceofagivenprotein;alsoitcanovercomekinase restrictionsandotherdependenciesthatmayappear.Inthischapter,briefbackgroundinformation isprovidedaboutproteins,theirstructures,andaminoacids.Italsopresentedthedenitionofthe problemandhighlightedthechallenges.Finally,variousrecentresearchpapersrelatedtothethesisare surveyed. I.2AminoAcids Aminoacidsplayprimaryroleincreatingblocksoftheproteins.Thehumanbodyfacilitatesamino acidtoproduceproteins,whicharesignicantcomponentsoflife.Aminoacidsarepolypeptides,which arebasicunitsofshapingaprotein.Thereareapproximatelymorethaneightyaminoacidsfoundin theorganismcells,andtwentyofthemarestandardbuildingblocksoftheproteins[17].Thereason 2

PAGE 14

whytheyareknownbytheaminoacidsbecausetheyhavetheaminogroup )]TJ/F11 9.9626 Tf 7.748 0 Td [(NH 2 inonesideand acarboxylicacidgroup )]TJ/F11 9.9626 Tf 7.749 0 Td [(COOH intheotherside.Thesequencesintheproteinhavebeenencoded bythecontentofgenedeterminationofthespecicaminoacidsandtheseriesoftheseaminoacidsin aparticularprotein.Chemicalcharacteristicsoftheaminoacidscandeterminethebiologicalenergy andfunctionoftheprotein.Thegeneralandtypicalstructureofeachoneofthetwentyaminoacids containsahydrogenatom )]TJ/F11 9.9626 Tf 7.748 0 Td [(H ,anaminogroup )]TJ/F11 9.9626 Tf 7.749 0 Td [(NH 2 ,acarboxylicgroup )]TJ/F11 9.9626 Tf 7.749 0 Td [(COOH ,plusthe alkylradicalgroup )]TJ/F11 9.9626 Tf 7.749 0 Td [(R ,whichisidentifyingthetypeoftheaminoacid,anditcanbeofanylengthof group[17][18].Ifthe )]TJ/F11 9.9626 Tf 7.748 0 Td [(R group,forinstance,hastheonlyHinit,thisiscalledglycineandrepresents thesimplestshapeoftheaminoacids,whilethecomplexformofaminoacidsisknownasTryptophan. Theaminoandcarboxylicacidsgroupsinthealphasidearelinkedtothesamecarbon,whichiscalled thealphacarbon.ThecomprehensiveformulaoftheaminoacidsisshowninFigureI.1. FigureI.1:Thecomprehensivestructureofalpha-aminoacidadaptedfrom[1],wherethecarboxyl groupisontherightandtheaminogroupontheleftside. Aminoacidsaregatheringtoshapeashortpolymerchaincalledpeptides,oralongchainofpeptides, whicharecalledapolypeptideoraprotein[18].Thesepolymerscanbeinalinearandanunbranched form.Thetranslationoperationistheprimaryprocesstomakeprotein,whichincludesadditionalamino acidsthatgraduallychainuptoformaproteinbyaribosome[19].Aminoacidsareaddedbytheway thatisreadthroughthecodeofthegenefroman mRNA template,whichmeansacopyof RNA of theorganism'sgenes.Thestandardgeneticcodecanformthetwentyaminoacids;theyareperceived asproteinogenicorstandardaminoacids.Asitwasstatedabove,aminoacidsplaydierentrolesin metabolism,whichmakethemveryessentialtolife.Besidesofcreatingblocksofprotein,aminoacids areconsideredacrucialsubjectinotherareasrelatedtothebiologicalscopeincludingtheshaping ofelementsofco-enzymes,andasin S -adenosylmethionine,orasprecursorsforthebiosynthesisof 3

PAGE 15

moleculessuchasiron.Inshort,aminoacidsareinvolvedineverypieceandsubjectinbiochemistry, whichincludesthenutritionforlivingcells[20]. I.3ProteinDenition Proteinsareconsideredorganiccombinationscreatedbyalinearchainofaminoacids;theyarealso knownaspolypeptides.Thepeptidebondsareresponsibleforlinkingaminoacidstogetherinachain ofapolymer;thejoinoperationcanbecompletedbetweenthecarboxylicacidgroupfromoneamino acidandthenextgroups. Proteinsareincrediblyimportantbecausetheyplayavitalroleperformingnecessaryfunctionsin everycellinthelivingorganism.Dipeptidesareformediftwoaminoacidsarecombined.Ifitisachain ofthreetotenaminoacids,itiscalledtripeptides.Also,ifastringcontainsmorethantenaminoacids, theyareknownaspolypeptidesLastly,iftheyare50ormoreaminoacidscometogetherinthesame form,thisshapeiscalledaprotein[21]. Theactivityofthegenedenestheproteinviaaseriesofaminoacids,whicharethegeneticcode thatresponsibleforencodingthegenesthatcodetheproteins.Theresiduesofproteincanbemodied chemicallybythePost-TranslationalModicationPTM,andthatcanbeoccurredduringorafter synthesis.Thisprocesscanaectthepropertyofchemicalandphysicalintheprotein,whichmay includefolding,localizationtothecell,function,stability,activationanddeactivationoperation,and distribution.Proteinsaresubstantialaccessoriesofanorganismandinvolveinpracticallyeveryprocess inacell,likeotherlivingmacromoleculessuchaspolysaccharidesandnucleicacids. FigureI.2:Thechainofaminoacidsthatmadeupthepolypeptide,whichisaproteintheprimary structureoftheproteinadaptedfrom[2]. 4

PAGE 16

I.3.1StructureoftheProtein Thestructureoftheproteinissignicantfordeterminingthefundamentalandultimatefunctionsfor theproteinintheorganism.Theproteinstructureoftheglobaldomainisclassiedtofour-hierarchical stages. 1.PrimaryStructure: Theprimarystructureistherstlevelinthisdesign,anditisindicatedas 1 becausethesequence ofaminoacidsintheproteinisjustaone )]TJ/F15 9.9626 Tf 7.749 0 Td [(dimensionalarray.Itistraversingfromtheterminal N totheendofachaininthesequence,whichistheterminal C [18]. 2.SecondaryStructure: Asecondarystructuredescribesthelocalsynthesisoftheaminoacidsintheprotein.Thesecondary structurecanbearesultoffoldingpartsofaproteinmoleculeintosheetsorbends,thusforming dierentshapesofproteins.Thisoperationoccursbyhydrogenbondsbetweenaminogroupsin thesequencethatcarryapositivechargeandketo-groupsthatispartiallynegative[18]. Fourgeneralkindsexistofthesecondarystructuredependingonthetypeoftheprotein: a )]TJ/F15 9.9626 Tf 7.749 0 Td [(helix b )]TJ/F15 9.9626 Tf 7.749 0 Td [(strand cturns dcoilsSomeotherkindsexisttheoretically,butveryrarelyappearintheproteins. 3.TertiaryStructure: Asingleproteincanbeshapedinathreedimensionalstructure,whichisknownasatertiary structureoftheprotein.Thiskindisdrivenbyacompactingandatwistingprocessofthe secondarystructureonitself.Thehydrophobicandionicinteractionsthatoccurbetweenamino acidsinthe R )]TJ/F15 9.9626 Tf 12.157 0 Td [(groupareresponsiblefordeterminingthetertiarystructureoftheproteinby gatheringtheseaminoacidstogetherinthespace[18]. 4.QuaternaryStructure: Quaternarystructureisresponsibleforformingafunctionofthesingleproteinbycombining severalpolypeptides.Thequaternarystructurecanconsistofseveralidenticalornotidentical subunitsproteins,polypeptides,whicharecalledprotomers.Thesesubunitscouldbemonomers, dimers,trimers,oranynumbers[18]. 5

PAGE 17

FigureI.3:Thefourstandardstructureoftheprotein,fromlefttoright,are:PrimaryStructure, SecondaryStructure,TertiaryStructure,andQuaternaryStructurerespectivelyadaptedfrom[3]. I.3.2ProteinPhosphorylation Proteinphosphorylationisoneofthemostsignicantposttranslationalmodicationineukaryotes. Bymodulatingproteinfunctionviatheadditionofanegativelychargedphosphategrouptoaserine Ser,S,threonineThr,T,ortyrosineTyr,Yresidue,phosphorylationregulatesmanycellular processes,includingsignaltransduction,geneexpression,cellcycleprogression,cytoskeletalregulation, andapoptosis[22].Proteinphosphorylationisamodicationprocess,whichisaPostTranslational ModicationPTM,thattakesplaceinthepolargroupRoftheprotein.Thisprocessoccursmainly inaserineorathreonineoratyrosineresidueoftheaminoacidsbyaddingaphosphategroup )]TJ/F11 9.9626 Tf 7.748 0 Td [(PO 4 viaaproteinkinaseenzyme. Itwasknown,earlyasthe19thcentury,thatphosphatescouldbelinkedtoproteins.Theprotein phosphorylation'sconcept,foundbyEdmondFischerandEdwinKrebs,appearedinthersttimefrom theexpressionofadualneedforATPand a convertingenzymeknownphosphorylasekinaseinthe invitro conversionofphosphorylase b tophosphorylase[23][24].Surprisingly, PRenzyme wasresponsible forreversetheconvertingprocessphosphorylase a to b ,whichhadbeendiscoveredearlyin1943by CoriandGreen[24].Theseexamplesofphosphoproteinsweremostlyfoundinmilkandegg,andthe phosphorusthatwasprovidedbythesephosphoproteinsasanutrientwasconsideredabiologicalmethod. 6

PAGE 18

In1950,thenewconceptsof"phosphoproteins"havebeguntoariseascentralregulatorsofcellular life.Aswementionedabove,thestartingpointofthisevolutionhadoccurredin1954,whenphosphorylationwasabiologicalreactioncausingbyconveyaphosphateontoanotherprotein[25].literally,many ofphosphorylationstatusesafterthathavebeendescribedincellsofeukarya,andtheyhaveconnected withdierenteventsofsignalingandregulation.TheoperationofthephosphorylationsiteoftheproteinisfurtherexplainedinFigureI.4.Theproteinofaliverenzymeisresponsibleforcatalyzingof phosphorylationofcaseinandlaterbecameasaproteinkinase.Thesignicantcontributionthathas beenmadebackthenbyFischerandKrebs[16]andWosilaitandSutherland[26]throughshowingthat anenzymeinvolvedinthemetabolismofglycogenwasregulatedbyaddingorremovingaphosphate. Thisoperationcouldbethelogicalexplanationinthattimeofcontroltheactivityoftheenzymeby reversiblephosphorylation.Later,itwasprovedtobeavalidideaandnowpracticallyparticipateand applyineveryanaspectofabiologicalconceptofthecell. FigureI.4:Phosphorylationanddephosphorylationrulesbyproteinkinasesandproteinphosphatases adaptedfrom[4]. I.3.3TheFormalDenitionofTheProblem Thebasicideaoftheproposedpredictionsystemisreceivingasequenceofaminoacidsfromthe primarystructureofaprotein,whichiscontainingaseriesofserine,threonine,andtyrosine,individually 7

PAGE 19

orasagroup.Aftergettingtheseaminoacids,thesystemwilldothepredictionprocess,whichwill produceasequenceofaminoacidswiththesamelengthastheinputsequencebutwithannotationson everyoneofthisthreephosphorylationworthyaminoacids.Theannotationwillinvolveanegativeor apositivesignoneachserine,threonine,andtyrosineindividually,wherea'+'symbolindicatesthat thisaminoacidhasbeenphosphorylatedwhilethe'-'notationsignalizesthatthisaminoacidhasnot beenphosphorylatedaccordingtoourpredictionalgorithm.Theproposedsysteminthisthesiscanbe viewedinFigureIV.13: FigureI.5:Theillustrationoftheobjectivefunctionoftheproposedsystem. I.3.3.1Theimportanceofdeterminingphosphorylationsitesinproteins Theproteinphosphorylationisanimportantmethodtounderstandtheoperationofprotein-protein interactionsinsomebiologicalcircumstances,whichleadtoadjustpathwaysofintracellularsignaling,regulationofcellgrowth,anddierentiation[27][28].Forinstance,theproteinofglomerular podocytenephrin1,whichisconsideredacriticalproteinofcellsofrenal;SrcFyncanphosphorylatethisproteinbyinteractingwithGrb2[29] Theproteinphosphorylationcanregulatetransductionofthesignal'sprocesssinceitcanshiftthe translocationofsubcellularofthephosphorylationbythesamemechanism[29]. PhosphorylationthatoccursonkinaseproteinSer350,particularlyonaresidueofserine/threonine ofDAPthedeath-associatedproteinleadstotheswitchingfromthecytoplasmtothenucleusof apoptosis-inducingkinase2DRAK2,whichcanpromptapoptosisincellsofTandB[30][29]. 8

PAGE 20

Phosphorylationoftheproteinisaverysignicantprocessforreactionsofenergythatarerequired byinthebiologicaleld;thusitisinvolvedmainlytoproduceandrecycleofATP[31][29]. PhosphorylationofNeurolamentproteinsisasusceptibleprocessbecausefeaturesoftheirstructuresareessentialinneuronalcytoskeleton.Theabnormalphosphorylationofneurolaments[32] cancausesomediseasesandneurodegenerativeconditionssuchasmotorneurondisease,Parkinson'sdisease,anddementia.Therefore,inordertounderstandtheneurolamentproteinsrolein neurodegenerativediseases,itishighlysignicanttodeterminetheendogenousphosphorylation ofthesesites. Itisknownthattheestimatednumberofhumanproteinsthatarephosphorylatedisone-third, andhalfofthemareinvolvedincancer-orotherdiseasesrelated[33];therefore,itisimportantthe designofbiomedicaldrugsbasedonthedeepunderstandingofthemechanismsofphosphorylation dynamicsthatidentifyphosphorylationsitesonsubstratesofproteins[34]. I.3.3.2Phosphorylationsitepredictionbyusing invivo or invitro methods Thereareseveralmethodswhichdependon invivo or invitro approachtodeterminethephosphorylationsitesofaprotein. Phosphopeptidemappingof 32 P -labeledproteinsandpeptides: Ifsomeoneisworkingwithaspecicprotein,sitesoftheproteincanbephosphorylatedbasedon sequencemotifsthatproducedbyprofessionalguessers.Then,theseresiduescanbeswitched tonon-phosphorylatableresidues,andtheresultingproteinanalyzedforlossofapeptideof phosphorylation 32 P -classiedfollowingdigestionandachromatographicandthinlayerofD two-dimensional[35][36].Thisapproachisaheavilyusedandveryexhaustivelaboratory,andit canfailtogiveprovenresultsifthesitesoftheprotein,thatarephosphorylated,areunconventional. ThesecondmethodisusingmassspectrometryMS,whichhasbeenexceedinglyapplied,to determinesitesofphosphorylationthatfollowdigestionofenzymeoftheprotein. 9

PAGE 21

I.4LiteratureReviews I.4.1PhosPred-RF:ANovelSequence-BasedPredictorforPhosphorylationSitesUsing SequentialInformationOnly Manyrecentmethodssuerfromtwotypeofthelimitations,whichare:itdoesnothelpwiththe proteinhomology,andthetime-consumingofthecomputingtheinformationwhichprobablyleadsto limittheusageofthepredictorsinpracticalapplications.Thispaperpresentsasimple,powerful,fast wayofafeaturalrepresentationalgorithm,whichsucientlyexploresthesequentialdatafrommultiple perspectivesonlybasedonprimarysequences,andsuccessfullycapturesthedierencesbetweentrue phosphorylationsitesandhboxnonphosphorylationsites.Theyusesequentialinformationandgenerate discriminativefeaturerepresentationsfrommultipleperspectives.Theauthorsofthispaperproposea randomforest-basedpredictorwhichnameisPhosPredRF.Additionally,theyobserveintheanalysis ofthefeaturethattheSkipFfeaturesandtheTOBFfeaturesforpredictingthephosphorylationsites showmorediscriminativepowerthantheothersequencebasedfeatures[37]. I.4.2PhosphoPredict:Abioinformaticstoolforpredictionofhumankinase-specicphosphorylationsubstratesandsitesbyintegratingheterogeneousfeatureselection TheauthorsofthispaperpresentPhosphoPredictsystem,whichisanovelbioinformaticstool thatcombinesproteinsequenceandfunctionalfeaturestopredictkinase-specicsubstrates.Also,the informationoftheproteinisassociatedwithphosphorylationsitesfor12humankinasesandkinase families,includingATM,CDKs,GSK-3,MAPKs,PKA,PKB,PKC,andSRC.PhosphoPredicttreats phosphorylationsitepredictionasabinaryclassicationproblemandusesanRF-basedmachine-learning approachtosolveit.Toelucidatecriticaldeterminants,theirmethodisbasedontheextractingthe mostimportantandrelatedfeaturesforeachkinasefamilythatwillbeveryusefulandinformative formakingtheprediction.Thebenchmarkingexperimentsbasedonbothve-foldcross-validation andindependenttestsindicatedthattheperformanceofPhosphoPredictiscompetitivewiththatof severalotherpredictiontools,includingKinasePhos,PPSP,GPS,andMusite[38].Theyfoundthat thephosphorylationsitepredictioncanbeimprovedwhencombiningfeaturesoftheproteinandtheir functions.Inbrief,thispaperisusefulforbiologicalresearchersbecauseitisananalysistoolfor identifyingtheessentialelementsofthekinasefamilysubstratesandphosphorylationsiteprediction[39]. 10

PAGE 22

I.4.3RF-Phos:ANovelGeneralPhosphorylationSitePredictionToolBasedonRandom Forest Theauthorsofthispaperhavedevelopedageneralphosphorylationsitepredictionmethod,named RF-Phos2.0,whichusesRandomforestRFtointegratevarioussequenceandstructure-basedattributestoidentifyphosphorylationsitesinproteinsgivenonlytheprimaryaminoacidsequenceas input.TheyclaimthatbyusingRFalgorithmallowedthemtocalculatetherelativeimportanceof eachfeature,revealingthatShannonentropyH,relativeentropyRE,quasi-sequenceorderQSO, sequenceordercouplingnumberSOCN,andcomposition,transition,anddistributionCTDaresome ofthemostcriticalfeaturesforphosphorylationsitepredictionusingtheirmethod.Theydonotuse positionspecicscoringmatricesPSSMstoextractthefeaturesbecauseofthetimeconsumingand acomputationalcostonthisalgorithm.[40]. I.4.4PredictionofphosphorylationsitesbasedonKrawtchoukimagemoments ThepaperaddressedanewschemanameKMPhosKrawtchoukimagemomentsfortheprediction processtheoreticallyofthephosphorylationsites.First,theycreatedanumericalmatrixfromthe sequenceoftheproteinfragment.Thismatrixisresultingofreplacementtheresidueoftheprotein sequencewiththechemicaldescriptorofaminoacidswhichdescribethechemicalstatusoftheprotein fragment,therebyproducingatwo-dimensionalgrayscaleimage.AftercalculatingtheKrawtchoukimage moments,thesupportvectormachinewillusetoestablishthemodeloftheprediction.Theauthorsof thispaperused10cross-validationonthetrainingsettotestthesystemandperformancetheaccuracy. Theyclaimedthattheirmethodreachesahighaccuracyontheindependenttestcomparedwithother methods. I.4.5PhosContext2vec:adistributedrepresentationofresiduelevelsequencecontextsand itsapplicationtogeneralandkinasespeccphosphorylationsiteprediction Inthispaper,theauthorsintroducedanewsequencerepresentationtosolvethepredictionproblemof bothaspecickinasandgeneralphosphorylationsites.Theyusesupportvectormachinesinclassifying thedataandmaketheprediction.Theirmethodtriestoimprovetheclassicationperformancethrough newsequencescontextrepresentationincombinationwithotherinformativefeaturesofresiduelevel features.Theygeneratedistributedrepresentationautomaticallyfromthemodelofpre_trainingfeature extraction,whichisincorporatedpatternsfromallpossiblecasesindatabasesoftheproteinsequence.A realvaluedvectorofone-dimensionalwithconstantsizerepresentthecontextualfeaturevector;therefore itiscomfortableandmoreconvenienttouseitwithotherfeaturevectors.Apropertypredictionof 11

PAGE 23

anylevelofproteinresiduethatreliesontheextractionoflocalsequencecontextscanusethevector contextualfeature.Theyemployedvectorsofthecontextualelementbasedondistributedcontextual topredictbothgeneralandkinase-specicphosphorylationsites. PhosContext 2 vec approachachieved anearlygreatperformanceforpredictingYphosphorylationsites,andalsokinasfamiliessuchasAGC PKCandCMGCCDK.Besides,theirmodelshowsapromisingperformanceforphosphorylationsites SandT,andAGCPKAandTKSRCkinasefamilies[41]. I.4.6MusiteDeep:adeep-learningframeworkforgeneralandkinase-specicphosphorylationsiteprediction Theauthorsofthispaperhaveusedtoolsofdeeplearningforthersttimetopredictthephosphorylationsiteoftheprotein.Theyareusingadeeplearningframeworkforpredictionageneralandspecickinasephosphorylation.Thismethodisanewversionoftheirpreviousapproachthattheypresentedin 2010[38].Theirdesignisdierentfromtheexistingtechniquesofphosphorylationsiteprediction,which doesnotrelyonextractingfeaturesoftheprotein,andinsteadpredictdirectlyfromtherawprotein sequences.TheyuseCNNConvolutionalneuralnetworkwithmultilayers,buttheydonotincludea poolinglayerinthearchitectureofCNN.Theyperformedvefoldcross-validationtoevaluatetheir method,andtheirapproachshowsbetterperformancethanthedeep-learningarchitectures[42]. I.5Ourcontribution Themainobjectiveofthisthesisistodesignanew insilico phosphorylationpredictionsystemthat meetsthegoalsbelow.Thespecicobjectivesofthethesisare: Designingofanewpredictionsystemthatwillovercomethedependencyofkinaseannotationsof proteinsbyincorporatingonlytheinformationofaPSSMandotherfeaturesofproteinsbecause mostexistingsystemsarerelyingonthekinase-specicinformation.Thedisadvantageoflosing thisinformationwouldbeinvaluableinpredictingthoseproteinswithoutkinaseannotations. Thesesystemsshowbiasinpredictingthosephosphorylationsiteswithaparticulartypeofkinase annotations.TheproposedmethodisconsideringonlyevolutionaryinformationPSSMproleof thephosphoproteinsratherthanusinganykinase-specicgroupings.ThePSSMprolegenerated byPSI-BLASTPosition-SpecicIteratedBasicLocalAlignmentSearchToolofNCBIforevery proteinsequence,describesthelikelihoodofaparticularresiduesubstitutionataspecicposition basedonevolutionaryinformation[43][44],anditprovidesmorecomprehensiveinformationabout proteinsthanasinglesequence. 12

PAGE 24

Designing insilico phosphorylationpredictionsystemsthatwillpredictthephosphorylationsites accuratelyfromgivenproteinsequencesinalesscostly,lesscomplicatedlaboriousandless timeconsumingmanner. Developadatabaseofallknownandunknownphosphorylationsiteoftheprotein,whichincludes informationofPDBfortheprotein.Thisdatabasewillbeusefulforstatisticalanalysisand computationallearningoffurtherresearch. Validationandunderstandthehypothesisdeeplyontheimportanceofevolutionaryinformation indesigningthenovelpredictionsystemusingmultipledeeplearningalgorithms. Acomparisonoftheproposedmethodwithsomeofthepopularexistingkinasespecicprediction systems. 13

PAGE 25

CHAPTERII PROTEINPHOSPHORYLATIONCLASSIFICATIONANDDEEPLEARNING ALGORITHMS Thesimilarmechanismsoftheorganismsatthemolecularlevelthatisbeingstudiedthoroughly showsthatalltheorganismsonthisEarthconsideredhavingthesameancestor.Thus,allthesets ofthespeciesarelinkedtoeachother.Thefastoutcomesoftheseobservationsshowtheexistenceof theevolutionaryrelationsoftheparticularmechanismsoftheorganismtoitssimilarancestralparts [45].Theserelationsofthephosphorylatedproteinscouldbeahighlyvaluableclassfortheinformation thatmayhelpfortheclassicationofthenonphosphorylatedandphosphorylatedproteinsbythe matchingoftheproteinstothisinformationofevolutionthatareknownastheprolesofevolution. Thischapterwilldiscussthathowsuchevolutionaryinformationrelatedtotheproteinsequencescould bederivedalongwithwhatcouldbethebestalgorithmfortheextractionofthisinformationaboutthe nonphosphorylatedandphosphorylatedproteinsthatcouldbeutilizedfortheclassicationofthem. Besides,thechapterwillexplainindetailsaboutthealgorithms,whichwilluseinthethesis. II.0.1ExtractingEvolutionaryinformationofPhosphoproteinsbyusingPSI-BLAST ThissectionexplainsthePSI-BLASTprogram,whichappliesthePositionSpecicScoringmatrixproleanditerationmethodstoimprovesensitivitytondhomology,especiallyforproteinsequenceswitharemoterelation. II.0.1.1BLASTAlgorithm TheBLASTthatisanalgorithmforsearchingdeterminesthematchbetweenthedatabaseandthe querysequencethenwillelongatethematchtothedierentdirections.Theoutcomesofthissearch haveconsistedofthetwosequencesthatarehighlyrelatedtodatabaseregionalongwiththechains ofthemarginnon-region,andtotheschemeofscoringfordescribingtherelatednessdegreebetween eachhitandqueryofthedatabase.ThethreePhasesofBLASTParedescribedbelowProtein-protein BLASTalgorithm: Phase 1 :BLASTcombinesthepreliminarylistforthepairwisealignmentsnamedaswordpairs. Phase 2 :Thisalgorithmthenscansthedatabaseofwordpairsbymeetingsomethresholdscore of t . Phase 3 :BLASTmayelongatethewordpairsforndingthose,whichsurpassthecutoscore only,atthepointswheresuchhitsarereportedfortheuser.Thesescoresarecalculatedbyscoring 14

PAGE 26

thematricesBLOSUM62thatareBlocksSubstitutionMatrixBLOSUM[46],whichcanbe publiclyusedtoscorethealignmentssequenceoftheproteinandxedtheofpenaltygap. FigureII.1:GraphicaldiagramoftheoriginalBLASTalgorithmadaptedfrom[5] 15

PAGE 27

II.0.1.2PSI-BLASTalgorithm Thisprogramcouldbeusedforndingthedistantlylinkedproteinstoanextendeddatabase.Initially, alistforalltherelatedproteinswillbeproduced.Suchproteinswillthencombinedforageneralsequence ofprolethatsummarizestheessentialfeaturesexistinginthequerysequences.Thequerysequence againstproteindatabasewillthenrunbyusingprole,thatwillresultintothelargerproteinsgroup. Thisextensivegroupwilluseforcreatingtheotherpattern;thenthisprocedureisiterated. PSI-BLASTishighlymoresensitive,forincludingthelinkedproteinsbysearchingandpickingupthe relationshipsofevolutionthataredistantthantothestandardBLASTofprotein-proteinBLASTP. ThesepatternsofsuchabilitiescouldbeimprovedfurtherviatheiterationprocedureofsearchconservationlikePositionSpecicScoringMatrixPSSMidentiesbyaligningthelinkedsequencesthat couldhelpinrecognitionofthesimilarities.Thisfeaturecanbeimprovedmorebytheiterationofthese proceduresofsearch,whichmeansthisabilitycanbefurtherimprovedthroughtheiterationofthe searchprocedure.Position-SpecicIteratedBLASTPSI-BLAST[43]wasdevelopedinthisrespect andmuchmorebecauseitmayhaveadvantagesofthesimplicity,speed,andautomaticprocedures. FigureII.2:AowchartexplainingthestepsofthePSI-BLASTprogramthatgenerateaPosition SpecicScoringMatrixPSSMprole. Step 1: AstandardsearchforBLASTisperformedusingthesubstitutionmatrixagainstthe databasee.g.,BLOSUM62. 16

PAGE 28

Step 2: APSSMwillformautomaticallybymanyalignmentstotheoriginalhitsoftheBLAST searchorbytheiterationoflastroundforthehomologysearching.Highlyconservedpositions willgetthehighscores,buttheweaklyconservedpositionswillreachthelowratings. Step 3 ThelatestPSSMsubstitutesaninitialmatrixe.g.,BLOSUM62orthelastroundPSSM forperformingthenextsearchofBLAST. Step 4 Repeat step 2 and step 3 ,andthenewlyfoundsequenceswillbeinvolvedforbuilding upthenewPSSM.PositionspeciciterativeBLASTPSI-BLASTlinkstothecharacteristicsof BLAST2.0bywhichtheproleorPositionSpecicScoringMatrixPSSMwillautomatically formbymultiplemethodsofalignmentforthescoringthatcouldhavethehighesthitswithinthe initialsearchoftheBLAST.ThePSSMisformedaftercalculatingthescoresofposition-specic foreverypositionwithinthisarrangement.Theproleusesforperformingthesecondetc.search ofBLASTandthustheresultsofalliterationareusedforthereneprole[5]. II.0.1.3ConstructionofPSSMfromPSI-BLAST PSI-BLASTmostlybuildsPSSMintheallroundsoftheiteration.ThePSSMwillbebasedonall theresidualfrequencieswithinaparticularpositionfromthealignmentofmultiplesequencesMSA. Thisscoredependsonthepresenceofanyspecicresiduefortheaminoacidsalignment.Theresidue willbehighlyconservedwithinthespecicpositions,theresiduewillassignarelativelyhighpositive number,andtheotherswillbeassignedhighlynegativenumbers.Alltheresidueswillgetnumbers closetozeroatthepositionofweaklyconserved.ThepowerofthisPSSMswillbeextractedfromthe twosources[43].Therstwillbetheimprovedprobabilitiesestimationoftheaminoacidscomesatthe dierentpositionsofpatternthatisleadingforamoresensitivesystemofscoring.Theotheroneisthe nearlyparticulardenitionoftheessentialmotifs. Forproteins,thereisneedofthequerysequencewithlength L andthe31 replacementmatrix withwhich20willbesubstituted,forexample,bythePSSMwithdimensionL 20. Querysequencewillbeappliedasthemastersequenceformakingthealignmentofmultiplesequences bylastcircuitoftheiteration,alongwithallthealignmentswiththevalue E )]TJ/F15 9.9626 Tf 11.158 0 Td [(belowtothethreshold isobtained. The E )]TJ/F15 9.9626 Tf 7.749 0 Td [(valueisdeterminedastheexpectednumberofnon-homologoussequenceswithascoregreater thanorequaltoascore x inadatabaseof n sequences: E x;n = n:P S x II.1 17

PAGE 29

If E )]TJ/F11 9.9626 Tf 10.335 0 Td [(value = 0.01,forinstance,thentheassumednumberofrandomhitswiththescores xis 0.01,thatmeansthis E )]TJ/F11 9.9626 Tf 10.039 0 Td [(value isexpectedbychanceoverthedatabaseonlyoncein100independent searches.Ifthe E )]TJ/F15 9.9626 Tf 11.493 0 Td [(valueofahitis5,thenvechanceshitwiths xareexpectedinsideasingle databasesearch,whichdeclaresthattherunisnotsignicant. E )]TJ/F15 9.9626 Tf 7.748 0 Td [(valuesthataretheexpectedvalues,whichwillindicatehowmuchtimestheoutcomeisgood asonethatislookingbysomeone;Thiscouldbereachedbychancealone.Thisvaluehasnospecial biologicalmeaningbyitsown.However,thelogicbehindtheusageof E )]TJ/F15 9.9626 Tf 7.749 0 Td [(valuewithintheeldof biologyisregardedasagoodoutcome,thischanceofgoodresultcouldnotbepossiblewithoutthehelp ofnature.Itwillbetterpossiblyif E )]TJ/F15 9.9626 Tf 11.069 0 Td [(valueislower. TheExtremeValueDistributionEDVisthebasethatdeterminestheresultofthescorefrom searchinginadatabaseusingaquerysequence.ScoresforthelawalignmentthenusedEDVwillbe modiedforthestatisticalnumberEvaluethatcouldmaintainthetrackofthedatabase,thisresults inthecompositionoftheaminoacid,andschemeofscoring. Afunctionofaprobabilitydensityfortheextremevaluedistributioncanbeobtainedfromparameter values =0 and =1 , y =1e )]TJ/F10 6.9738 Tf 6.227 0 Td [(e )]TJ/F9 4.9813 Tf 5.396 0 Td [(x ,where isthecharacteristicvalueand isdecayconstant.The databasesearchwillselectthebestalignmentalongwiththeoptimalalignmentofthetwosequences. ThisisahighlyvaluablefeatureoftheEVDthatkeepsthetailofrightthatcoulddropoveryslowly thantothetailofleft.TheeectofusingtheEVDcomparedtothenormaldistribution,analignment hastoscorefurtheraheadtheexpectedmeanvaluetobecomeasignicanthit. Theprobabilityofascore s tobelargerthanagivenvalue x canbeapproximatedfollowingthe EVDas: P s x =1 e )]TJ/F10 6.9738 Tf 6.226 0 Td [(e x )]TJ/F9 4.9813 Tf 5.396 0 Td [( II.2 where = lim Kmn ,and K isaconstantthatcanbecalculatedfromthebackgroundaminoacid distributionandscoringmatrix[43][44]. Theequationisusedfor ,theprobabilityfortherawalignmentscore s becomes: P s x =1 e )]TJ/F10 6.9738 Tf 6.226 0 Td [(Kmne )]TJ/F9 4.9813 Tf 5.396 0 Td [(x II.3 Practically,theprobability P s x iscomputedbyusingtheapproximation 1 e )]TJ/F10 6.9738 Tf 6.227 0 Td [(e )]TJ/F9 4.9813 Tf 5.397 0 Td [(x e )]TJ/F7 6.9738 Tf 6.227 0 Td [(1 ,which iscorrectforlargevaluesofx.Itleadstosimplifytheequationfor P s x : P s x e )]TJ/F10 6.9738 Tf 6.227 0 Td [( x )]TJ/F10 6.9738 Tf 6.227 0 Td [( = )]TJ/F11 9.9626 Tf 7.749 0 Td [(Kmne )]TJ/F10 6.9738 Tf 6.227 0 Td [(x II.4 18

PAGE 30

So,themoresignicantthescoresforagiventhresholdvaluex,thelowertheprobabilityEvalue shouldbe. Thedeterminationofthefrequencyofeveryresidueateverypositionmustbecomparedwithat whichanyresiduecanbesuspectedinarandomsequence.Thescoreisdeterminedfromtheratioofthe observedtothepredictedfrequencies.Morespecically,thelogarithmofthisratioistakenandreferred toastheloglikelihoodproportion: S ij =log 2 Q ij P i II.5 where S ij istheunitbitscoreforresidueiatpositionj, Q ij istheestimatedprobabilityforresidueito befoundatpositionjand P i isthetrainingprobabilityofresidueiinarandomsequence. II.0.1.4TheImportanceofusingPSSMprolesasFeaturesforPhosphorylationClassication. Phosphorylationoftheproteinisknownasthepost-translationalmodicationsPTM,whichoccurs inthoseproteinsthatleadtothesevereeectsoftheirfunctions.Dierentthreetypesofaminoacids couldbeaectedbythischange.Thosearetyrosine,serine,andthreonineresidues.PSI-BLAST providesvaluableinformationabouttheevolutionaryoftheproteinacrossgenerationsstartingfromthe beginningtillthepresenttime.Theevolutionproleofproteinsitesdescribesthegeneticeectsof gettingphosphorylatedonproteinsoftheirchildren.Thus,ThePSSMproteinsprolescanholdthe criticalinformationthatcouldbeveryusefulfordeterminingthepatternsofeachoneofthethreeamino acidsS,T,Yacrossthegenerations,whichclassifybothnonphosphorylationandphosphorylation sites.Ifaconservedpatternisownedbyaparticularancestorprotein,whichissurroundedbythe phosphorylationsites,andifchildrenofthisproteinhaveasimilarconservedpattern,thenitispossible topredictthatthechildrenproteinswillbeaectedbythemechanismofphosphorylationonthose sites.Thesameobservationisalsoappliedandcanbetruefornon-phosphorylationtechnique.Ifthe patternofphosphorylationandnonphosphorylationsitescanbedetectedinsomewaybyusingadeep learningapproachoranymachinelearningalgorithms,itwillbelikelyandimportantinformationfor thepredictionsystem. PSI-BLASTproducesthePSSMprole,whichisdependingonthefrequenciesofeachresidueina specicpositionofmultiplesequencealignmentofanon-redundantsetofallproteins.ThePSSMprole explainsandcapturesdistantevolutionaryrelationshipsamongproteins.InthePSSMprole,highscores willgivetotheproteinsthatmayhavetopconservedpositions,while,proteinswiththeweaklyconserved 19

PAGE 31

positionsmaygetscoresneartozeroornegative.Therefore,suchpatternsoftheconservedpositions couldbeeasilyidentiedbyusingthePSSMprolealongwithdeeplearningalgorithms. Forexample,ifPSI-BLASTisrunontheproteinP16386of27residueslongKTEPMRRSVSEAALTQPEGPLGTDSLKagainstanon-redundantdatabaseofproteinswith0.001of E value.ThenthePSSMprole createdafterthreeiterationsisshowninFigure4.6. IfvaluesofsomePSSMproleclosetothephosphorylationsitesofproteininaspecicwindoware foundtoexistinanotherprotein'sprole,thenitishighlylikelythatphosphorylationoccursinthe secondproteinaswell.Thesametechniquescanbeappliedtoidentifythenon-phosphorylationsitesof theproteins. II.0.2Deeplearning Deeplearningisarecentadvanceapproacharisefrommachinelearningtechniques.Deeplearning comesfromtheneuralphenomenonofbiologicallysystemsinspiredbyahumanbrain,anditrefersto articialneuralnetworksthatcalledperceptrons[47].Thehumanbrainhasbillionsofneurons,which arehighlycrucialoftheprocessingpart,soeachoneofthemcanreceiveandsendinformationfrom thousandsofotherneurons.Formanyyears,theamountofthelayersinsideaneuralnetworkhave beenlimitedbyacertainnumberbasedonthearchitectureofthedesigningnetwork.Thealgorithm oftheneuralnetworkleadstoseveralissuesandbecomeunstablefortrainingdatawhenincreasing theamountofthelayersandbecomemoreprofound.Therevolutionaryofhardwareinrecentyears alongwithpubliclyextensivedatasetsavailableontheinternetgiveadvantagesofusingadeeplearning approachinmanyelds.Acomputervisionisthemostsignicanteldinadeeplearningareathathas appliedsuccessfullyandhasbenetedfromonetypeofnetworks,whichiscalledtheconvolutionalneural networkCNN,ConvNet.TheCNNtendstolearnpatternsthatsimilarorclosetoahumanvisionby utilizingthepropertyoftherstlayerofthenetwork[48].Theneuralnetworkboomedanddramatically increasedusagein2012whenKrizhevskyetal.[49]wontheImageNetLargeScaleVisualRecognition CompetitionILSVRC[50]byusingtheaneuralnetworkforimageclassicationcompetitions.Since thattime,theneuralnetworkbecomesawidelyusedandprimarytoolformanyeldsindierentaspects. ConvolutionalNeuralNetworkCNNisanadvancedtypeofdeepneuralnetworksthatuseconvolutionltersalternativelyoflinearcomputation.TheCNNmadeupofperceptrons,whicharebiologically similartotheneuronsofahumanbrain.Eachperceptroninthenetworkconsistsoflearnableweights andbiases.Adotproductistakenbetweeneachinputinthesystemandtheircorrespondingweights, andtheoutputisresultingfromapplyinganon-linearfunctiononthedotoutcome.Theoutputofthe oneneuroninaparticularlayerisfedintoneuronofanotherlayer.Thisprocessiscontinuingfromone 20

PAGE 32

layertoanother,andtheweightsandbiasesofeachlayeraredramaticallylearnedduringthisprocedure.Eachlayerofaneuralnetworkappliesaspecicfunctiontoanoutputofthepreviouslayer.The computationoperationsaredonebythehiddenlayersthatexistbetweenlayersinthenetwork.The mainjobofhiddenlayersisapplyingconvolutionandpoolingoperations,andmodifytheweightsand biasesofthenetworkbasedontheinputs.Theconvolutioniscomputedfortwodimensionstogetthe outputwhichcalledafeaturemap[51].TheCNNhasthreemainfactors,whicharelocalreceptiveelds, shareweights,andpooling[52][53]. ConvolutionalNeuralNetworksCNNmainlyconsistofthefollowingdierenttypesoflayersas showninFigureII.3: FigureII.3:AowchartillustratingthemainstepsofbuildingtheConvolutionalneuralnetworkCNN. iInputLayer:ThisistherstlayeroftheCNN,whichcontainsthematrixinputofthenetwork likeimagesortherawtppixelvalues.Itisthemainentrancetoallothersegmentsinthesystem. iiConvolutionalLayer:Thislayercalculatestheactivationsofperceptronsthatareconnectedtothe receptiveeldsofthepreviouslayer.Theparametersofaconvolutionallayerofinclude: aNumberofoutputitemsaretheinputforthenextlayer. bKernelsizecontrolsthespatiallylocalregionoftheinputvolume: cStride,thepixelshiftsoftheslidingwindow.Forinstance,ifthestrideischosentobeone,it meansthelterwillbemovedto1pixelatatime.Similarly,thescreenswillshifttwopixels whenthestrideistwo,andsoon. 21

PAGE 33

dPadding,Thishelpsinsizingthelayerlterwhenitdoesnottperfectlyintheinputimage. Thisprocedurehastwooptionseitherpadsthepicturewithzerosorkeepsonlyvalidpartof theimagebydroppingthepartoftheinputimagethatthelterdoesnott. iiiPoolingLayer:Itisprincipallyusedtoresizeandaccumulatethespatialoftheinputimagewhen theimageislarge.Thelayerisdividingeverymapimageintoasetofrectanglesthatholdthe signicantinformationandapplythemaximum,minimum,average,sum,oranyotherpooling functions,oftheiroutputs.Forinstance,averagepoolingtakestheaverageofallelementsfrom therectiedfeaturemap. ivFully-connectedLayerFClayer:Thislayeristhelastlayerintheconvolutionalneuralnetwork, whichmainlyconsistsofmultiplefullyconnectedlayers.Neuronsinthislayersarefullyconnected toallactivationsofthepreviouslayer[54].Theinputofthislayerisavectorresultingfromattering thefeaturemapmatrixoftheconvolutionlayer,whichisconcatenatingallthefeaturesintheinput imagewithinasingleconcreteinputvector.Thisvectorisfeedingintoafullyconnectedlayerwith randomweightsinthersttime.Thefeaturesarecombinedtobuildthemodelofthenetwork, andthenanactivationfunctionlikesoftmaxorsigmoidwillbeappliedtoclassifytheoutputsas thenalresultsofthemodel[54]. II.0.3Summary Thischaptersummarizesconceptsofextractingevolutionaryinformationofproteinsbysequence alignmentalgorithms.Also,itexplainshowPSSMprolesareobtainedfromthePSI-BLASTprogram whichisrobustforapplyingqueryinamassivedatabaseofproteinsandgetdierencetheresultant alignmentforeachaminoacids.Additionally,thechaptergivesabriefintroductiontodeeplearningand convolutionalneuralnetworksCNN. 22

PAGE 34

CHAPTERIII SUMMARYOFTHEPROPOSEDPREDICTIONMETHOD III.1SummaryoftheProposedPredictionSystem Thischaptersummarizesthewholedesignoftheproposedpredictionsystemincludingdesigna databasefortheentiresystemusingphpMyadmin,andhowtheperformanceofthesystemwasassessed. III.2Materials Threecorematerialsareusedintheproposedpredictionsystem,whichareevolutionaryinformation ofproteins,theconventionalneuralnetworkCNN,andSupportVectorMachineSVM.Theevolutionaryinformationofproteinsformaclassicationtrainingdataforthemachine.TheCNNextract theessentialfeaturesofatwo-dimensionalarrayPSSMandbuildsahigherdimensionalhyperplaneto separatetwoclassestodeterminethephosphorylatedandnon-phosphorylatedsitesinagivenprotein sequence.TheSupportVectorMachineSVMisappliedafterextractingthefeaturesfromtheCNN. III.2.1EvolutionaryProleasFoundationforSystemDesign Theproposedsystemincorporatestheevolutionaryfeatureofphosphorylationsites,whichknownas aposition-specicscoringmatrixPSSM.Ascoreofeachofthetwentyaminoacidswillbegenerated againsteachoneofthepositionsofthetargetproteinifamultiplesequencealignmentisperformedfor thephosphorylatedproteinsagainstannrnonredundantdatasetofproteins.Thesescoresrepresent theevolutionaryconservationinformationamongthemembersofitslineageosprings.ThePSSM proleoftheproteincanservethisinformationasamatrix.ItisnoticedthatPSSMscoresacrossa particularwindowofphosphorylatedresiduesofsequencesofsomeproteinshavethesimilarevolution lineage.Consequently,thescoresthatobtainedfromthePSSMprolesofthephosphorylatedprotein canbeverybenecialandapromisedsourceofclassicationdataoftheproposedpredictionsystem[15]. ItshouldnoticethatthePSSMproleofphosphorylatedandnon-phosphorylatedproteinswasgenerated usingbothPSIBLASTdatabaseandawebservernamed3DCONSDB[55].ThePSIBLASTprogram usestocreatethePSSMprolefortheaccessionnumberoftheproteininUniProtKB[56].Ontheother hand,the3DCONSDBisusedtoretrievethePSSMprolesforalltheProteinDataBankcodesPDB idoftheprotein[57].ThereasonisusingthiskindofwebservertocreateaPSSMproleforthe proteininsteadofanyothertechniqueisthatthistooliscoveringthewholeprotein.The3DCONS-DB isadatabasethatcomputesthePSSMproleovertheentiresequenceoftheproteinthatcollected fromtheProteinDataBankwebserver.Themaindierencebetweenthismethodandotherexisting methodisthatthisdatabasecoverstheregionandnon-regionoftheprotein.Moreover,itismuchfaster 23

PAGE 35

toretrievePSSMprolethananothermethod,andisaninvaluableresourcetoavoidrecalculationof PSSMproleforPDBentries. III.2.2ConvolutionalneuralnetworkCNN TheproposedsystemusedTensorFlowLibraryfortheconventionalneuralnetworkCNN,witchis themainlibraryforCNNalongwithotherdierentlibrariesinPython 3 : 7 topredictphosphorylation sites.Thehyper-parametersthatuseineveryarchitectureofCNNforthephosphorylationsiteprediction oftheproteinsarelistedinthisstudyinatableIII.1. LayerHyperparametervalue MultiscaleCNNConvLayer1 Sizeofthelter 3 3 Numberofthelters20 Stride1 Padding1 ActivationFunctionReLU MaxPooling 2 2 Stride2 ActivationFunctionSoftMax TableIII.1:HyperParametersoftheCNN 24

PAGE 36

III.2.3HybridCNNSupportVectorMachinesSVMclassier Inthismodel,CNNworksasatrainablefeatureextractorofthePSSMproleoftheprotein, andSupportVectorMachinesSVMperformsasarecognizeroftheprotein'sfeaturestosplitthe phosphorylationsitesfromnonphosphorylationsites.Thishybridmodelautomaticallyextractsfeatures fromthelastlayeroftheCNNandmakestheprediction. III.3Methodology Thissectiondescribestheapproachesdevisedinthisstudytocompletethedesignoftheproposed predictionsystemincludingselectingavaluabledataset,designthewholedatabasethatcontentsthe wholesystem. III.3.1Datasets Phospho.ELMversion9isthemainavailablepublicsourceofthedatasetwasusedinthisstudy [58].ThePhospho.ELMwasreleasedinSeptember2010,whichisthelastversion.Thisdatasetisareal datasetfromhigh-throughputexperimentsthatdoneinthelaboratory.Phospho.ELMcontains8,718 substrateproteinstotalthatinclude31,754serine,7,449threonine,and3,370tyrosineinstances.Phospho.ELMwasthendividedintotwosubsetdatasets,whichareapositivedatasetphosphorylationsites thatconsideredasan A + dataset,andanegativedatasetNon-phosphorylationsiteswascalledasan A )]TJ/F15 9.9626 Tf 7.749 0 Td [(dataset.The A + datasetcontainsproteinentrieswithatotalofannotationsofphosphorylation sites.TheextractingprocessoftheseannotationsphosphorylatedofSerinesites,Threoninesites,and Tyrosinesiteswasannotated.Weshouldmentionthattheobjectiveofthisthesisistofocusonclassifyingthemostfrequentlyoccurredphosphorylatedsites,whichareserine,threonineandtyrosineresidues, sohistidinewasextractedfromthisexperiments.The A )]TJ/F15 9.9626 Tf 7.749 0 Td [(isrepresentingnon-phosphorylationsitesof theprotein,whichisextractingallSerine,Tyrosine,andThreonineresiduesfromtheoriginaldataset Phospho.ELMthatarenotexistingin A + dataset. TheseconddatasetwascollectedrandomlybutcarefullyfromtheoriginaldatasetPhospho.ELMversion 9.ThisdatasetwasconsideredaBdatasetwhichcontains350entirelyproteinaccessionsofUniProtKB[56],whichisasearchablewebserverofgenesandproteins.UniProtKBisacollaborationinstitutionbetweendierentorganizationsaroundtheworld,whicharetheEuropeanBioinformaticsInstitute EBI,theSwissInstituteofBioinformaticsSIB,andTheProteinInformationResourcePIR.The datasetwaschosencarefullytohavesomeoftheproteinsthatcontentsphosphorylationsitesandnon phosphorylationsites.Fortheevaluationpurposes,thedatasetwasselectedtobemoreaccurateand valuablefortrainingandtestingthemodel.Thedatasethas7265Serine,4129Threonine,and2351 25

PAGE 37

ResidueNumber S31,754 T7,449 Y3,370 TableIII.2:DatasetPhospho.ELMversion9 Tyrosine.The B datasetisclassiedintotwodatasets,whichareapositivedatasetthatrefersto B + wherethephosphorylationoccurs,andanegativedataset B )]TJ/F15 9.9626 Tf 7.748 0 Td [(wherethephosphorylationsitedoesnot takeplaceaccordingtothesourcedatasetPhospho.ELM.The B + contains2051positivesectionsincluding1421Serine,380Threonine,and250Tyrosineaminoacids.Ontheotherhand,the B )]TJ/F15 9.9626 Tf 10.925 0 Td [(dataset has5844,3749,2101serine,thyronine,andtyrosinerespectivelyasatotalis11694.Thedataset B can beseeninTableIII.3 ResiduePositveNegative S14215844 T3803749 Y2502101 TableIII.3:DatasetB III.3.2Database Relationaldatabaseshavebeenaroundforseveralyears.Theychangedtherepresentationofdata inthedatabase,andthewaytheyarebeingstored.Therelationaldatabasehasbeenusingforalong time,andwewantedtostartourstudybystoringourdatasetinthedatabase.Wewantedtolearn aboutourdatasetandhowcanweusethemin3-dimensionalformforanalysispurposes.Thedatabase isstoredmanyfeaturesoftheproteinandmappedtheaccessionIDoftheproteintoProteinData 26

PAGE 38

BankPDBtoextractthePDBID.Asaninitialideaofcreatingthedatabaseofthispurpose,we endedupusingaMySQLserverrelationaldatabasemanagementsystemusingthephpMyAdmin[59]. Atrst,theinitialpurposewastryingtocreateaproteindataapplicationandrepresentthedatain thephpMyAdminServer.Afterseveralattempts,itwasfoundoutthisdataapplicationisnotthe appropriatedataapplicationfortheprojectanddoesnotservetheobjectivegoaloftheanalysisforthe bigdata.ThenitwaschangedourdataapplicationtotheProteinDataBankTreeapplication,which isa3dimensionaldatabaseasitwasmentionedabove.Weusedthisapplicationbecauseitcanbe representedintherelationaldatabasemanagementsystem,anditcanbeusedandbeusefulforfurther studiesformanypurposesinthefuture.Also,thisdatawascreatedbymefromscratchforthisthesis. ThedesignofthedatabasefortheProteinsystem,whichisnamedPPREDV2consistsofmultipletables andcanbeexpandedmoredependingonthepurposeofthestudyandtheresearch. Theoriginaldatabasehasthefollowingtables: 1.Sesquence_table:Ithasafourthreecolumnswhichrepresenteachentityinthetable: id:whichisauniquenumberineachtable. accession:representtheidforeachprotein. seq_path:thisledrepresentsarelativepathoftheproteinsequencethatstoreintheCSV leasafastaleinthedirectoryinsidecomputer. 2.phospho_position:Itcontentstheinformationaboutthephosphorylationsiteoftheproteinfor eachresidueSerine,Tyrosine,andThroyninwiththeposition.Thistablehasfoureldswhich are: id:whichisauniquenumberforthistable. accession:representtheidthatmeansthenameofeachproteininthedataset. position:showsthelocationforeachaminoacidsofproteinresiduesS,Y,Tthathavebeen phosphorylatedintheproteinsequence. :AA:ItrepresentstheaminoacidslettersS,Y,Tforeachoneofthem,whichareSerine, tyrosine,thyronineoftheproteinsequence. 3.nonphospho_position:Itcontentstheinformationaboutthenon_phosphorylationsiteofthe proteinforeachaminoacidesSerine,Tyrosine,andThroyninwiththeirpositions.Thistable hasfoureldslikethepreviousonewhichare: id:whichisauniquenumberforthistable. 27

PAGE 39

accession:identifytheidforeachproteininthedataset. position:showsthelocationforeachaminoacidsofproteinresiduesS,Y,Twhicharenot phosphorylatedintheproteinsequence. :AA:ItrepresentstheletteroftheaminoacidsS,Y,Tforeachoneofthem,whichare Serine,tyrosine,thyronineoftheproteinsequence. 4.pdb_accession:Itcontentstheinformationthatmapstheaccessionidoftheproteininthedataset toPDBID,whichrepresentstheIDoftheproteinintheProteinDataBank.Thistablehasthree elds,whichare: id:whichisauniquenumberforthistable. accession:Itisrepresentingtheidforeachproteininthedataset. pdb_id:Itidentiestheproteincode,whichisconsistingfromthe4-characteruniqueidentierofeveryentryintheProteinDataBank.Itshouldnoticedthateachproteinaccession idshasoneormultiplePDBidswhichrepresentthestructureoftheproteininProteinData Bank. 5.structure:Thistablecontentsthesecondarystructureoftheprotein,whichrepresentsthethreedimensionalformoflocalsegmentsofproteins.Thistablehasveelds,whichare: id:whichisauniqueidentityforthistable. pdb_id:representsthenameoftheproteininProteinDataBank.PDBidentiersusually consistofthe4-lengthalphanumericidentierandtheproteinchain.Forinstance,102Lis thePDBidentier. letter_code:ItisrepresentingachainoftheproteinforeachPDBidentier,andeachPDB hasoneormultiplechainsdependingonthestructure. seq_path:ThiseldrepresentstherelativepathofthesequenceofthisPDBidentier. struct_path:Thiseldrepresentstherelativepathinthedirctoryinthecomputerthat indicatesthelesthatcontinuethesecondarystructuresequenceofthesequenceofthisPDB identier. 6.PSSM:Thistablecontentstheinformationrepresentingtheposition_specicscoringmatrix.This tablehasthreeelds,whichare: id:whichisauniquekeyinthistable. 28

PAGE 40

pdb_id:representsthenameoftheproteininProteinDataBank.PDBidentiersusually consistofthe4-lengthletteridentier. pssm_path:representstherelativepathinthedirectoryinthecomputerthatindicatesthe PSSMles,whichtheyhaveretrievedfrom3D-CONSwebserver.TheselescontentPSSM prolesforeachPDBidthatrepresentsintheProteinDataBankalongwiththeirchains. 7.PSSM_pid:Thistablecontentstheinformationabouttheposition_specicscoringmatrixfor eachoneof350proteinaccessionsthatwererepresentedbythe B dataset.Thistablehasthree elds,whichare: id:whichisauniquekeyinthistable. accession:representstheaccessionnumberoftheproteininUniProtKB.Accessionisthe singlealphanumericidentierofeachproteininUniProtKBcentraldatabase. pssm_path:representstherelativepathinthedirectoryinthecomputerthatindicatesthe PSSMles,whichtheyhaveretrievedapplyingqueryagainstPSI-BLASTdatabase.These lescontentPSSMprolesforeachof300proteinaccessionthatrepresentsBdataset. III.3.3PositiveDatasetPreparation PSSMprolesofalltheproteinsof A + datasetsweregeneratedusing3D-CONSwebserver.The A + consistsofallPDB-IDalongwithachainalphabet.TheProteinDataBankPDBistheworldwide repositoryforallatomic-resolutionproteinstructuresandisencouragedbyhundredsofthousandsof dierentvisitorsandresearchersacrosstheworldperyear.TheResearchCollaboratoryforStructural BioinformaticsProteinDataBankRCSBPDBstoreslesthatdescribethethree-dimensionalstructure ofproteinsandothermacromoleculesalongwithadditionalfeaturesandattributes.Eachlevelofthe fourstructurelevelsofaproteinmolecule,whichareprimary,secondary,tertiary,andquaternaryis representedinthePDBasthethree-dimensionalspatialarrangement.Eachproteinaccessionin A datasetcanhaveoneormultiplePDBcodesdependingonhowtocomplicatethestructureofthe proteinandthesignicantdiversityinthetypeofstructuresinthePDB.ManyofPDBidsarebelong tothesameproteinstructureoritshomologs,andmanyofthemcomefromthesameproteinfold.Every proteinaccessionhasoneormultiplePDBidsasitwasmentionedabove.Thesecodesrepresentthe sub-familiesoftheoriginalonebecauseeachproteinmoleculescanconsistofmorethanonesub-proteins initsstructure.Also,eachPDBcodehasoneormultiplechainsthatbondorconnectwithotherPDB inthestructure. 29

PAGE 41

PSSMprolesofalltheproteinsof B datasetweregeneratedusingPSI-BLASTsearchagainsta non-redundantnrdatabaseofproteinsequencesatNCBI.Firstofall,thenon-redundantdatabase ofPSI-BLASTwasdownloadedonthelab'scomputer,whichis241GBsize.ThePSSMmatrixwas producedbytheblastpgpprogramofthePSI-BLASTpackagewiththreeiterationsofsearchingat cutoE-valueof0.001fortheinclusionofsequencesinthenextiteration. ThecommandthatusedtogeneratePSSMproleoftheproteinwithaccessionA0AVK6,forinstance, isaddressedbelow: blastpgpdnri " A 0 AVK 6 :seq " j 3 h 0 : 001 Q " A 0 AVK 6 :pssm " Here,A0AVK6.seqlecontainstheprimarysequenceoftheproteinP16386asafastale.The option-j3istorunblastpgpprogramforthreeiterations.Thechoice-h0.001istorestrictincluding unrelatedsequenceswithacutoE-valueof0.001.The-QoptionredirectstheresultantPSSMprole tobestoredinalenamedA0AVK6.pssminanyfoldersinthecomputerbygivingthefullpathof thatleinthecommand.FigureIII.1showstheproleoftheprotein"A0AVK6"withthewindowsize ofthePSSMis 31 21 . FigureIII.1:APSSMproleoftheproteinA0AVK6.Itcanbeseenfromtheprolethattheresultant proleis31 20matrix.ThearrowsintheguremeanthatTyrosineresidue'sconservednessin position521and524aredierent.The P inthetopleftindicatestotheposition,andthe S isthe primarysequenceofthisfragmentoftheprotein. 30

PAGE 42

ThePSSMthathasgeneratedinbothdatasetAandBcontainedtheprobabilityoftheexistence ofeachtypeofaminoacidresiduesateachpositionalongwithinsertionordeletion.Theevolutionary informationforeachaminoacidisdescribedinavectorof L 20 dimensionalmatrix,where L isthe lengthofthegivenproteinsequence,and20isrepresentedthetwentyaminoacidsoftheprotein. III.3.4NegativeDatasetPreparation Mostoftheknownphosphorylationdatasetsdonothaveanyexperimentalannotationsfornegative aminoacids.ItmeansthatthereisnorealknowledgeofwhichofthethreeaminoacidsSerine,Threonine, Tyrosinearenotphosphorylatedinaspeciclocationoftheparticularproteins;thisisconsideredthe mostsignicantproblemwhencompilingdatasetsformachinelearningofthedesigningmodel.The previousKnowingofnon-phosphorylatedsitesofproteinswhendesignthepredictionsystemcanbe advantageous.Itisimpossibletoassureandprovesthataspecicsiteofaproteinisnegativeunderall thecircumstances.Inthisthesis,ifthenonannotatedsitesaresatisfyingsomehypotheticalcriteriaof both A and B datasets,theywereconsideredasnegativesites. 1.First:Itwasconsideredthatallnon-annotatedsitesofthethreeaminoacidsS,T,Yarenot phosphorylated,anddesigningtheproposedmodelbasedonthisassumption.Therefore,itwill bemorereliabletoinvestigateallaminoacidsbecausethereisnoannotatedresiduemissingfrom thedataset. 2.Second:Anon-annotatedresiduewasconsideredasanegativesiteifitisnotinacertaindistance fromanyphosphorylationannotatedresidueofaproteinsequence. III.3.5TrainingSystemDesign FromthePSSMprolesoftheproteinsofthe A and B datasetsthatcontainsbothnegative A )]TJ/F15 9.9626 Tf -442.251 -19.929 Td [(and B )]TJ/F15 9.9626 Tf 7.749 0 Td [(andpositive A + and B + instanceswerepreparedforeachthethreephosphorylatedresidues Serine,Threonine,andTyrosine.Tables5-1and5-2belowareshownthenumberoftrainingsamples ofeachresidueofaparticularlabelwhicharepositiveornegative. Itisobviousthatthesizeofthenegativesamplesinbothtablesaremuchlargerthanthepositive samples.Iftheproposedsystemwastrainedwiththisdierencebetweenthesamples,itwouldmostly predictnegativeresultsthatleadtocreatingthebiasinthesystem. Toovercomethisproblem,reductionofthenegativedatasetwasrequiredinthewaythatwillbe equaltothepositivedataset.Thisstudyperformsexperimentsbasedonequalityinthesizeofpositive andnegativedatasetswhenthemodelhasbeentraining.Therefore,thepaperproposestwowaysto 31

PAGE 43

dealwithunbalanceddatasetsandhavingthesamenumberofsamplesinpositiveandnegativedatasets ofpreparingdatasettoavoidthebiasinthesystem: 1.First:Themultiplicationoperationwastakenforthepositivedatasetforeachoneofthreeresidues S,T,Yinthewaythatitwillmakeeachoneofthemequaltothecorrespondingnegativedataset, respectively.Forinstance,thepositivedatasetforSerineresidueswillmultiplybyacertainnumber tobecomesimilartothenegativedatasetofSerinewhenthetrainingdatasetnumberischosenof samples,andsimilarlyfortherestofthedatasets. 2.Second:Thesamplesofnegativedatasetsarereduceduntilitequalstheinstancesofpositive datasetsineveryoneofresiduesdatasetsS,Y,T.Inthiscase,thenegativedatasetsizebecame equaltothesizeofthepositivedataset.Forexample,thenegativedatasetofserineresidueis decreasedtothenumber,whichistheasthesamenumberasinthepositivedataset. Thenaltrainingsetwaspreparedfromthemergingofnegativeandpositivedatasetstogetherfor eachresidueS,T,Yafterequalizingthesizeofbothofthedataset.Thetrainingdatasetisnally formedfordeeplearningalgorithmstobuildtheproposedmodel,trainthesystem,andpredictthe results. III.3.6TestingtheProposedMethod Toevaluatetheperformanceoftheproposedsystemforthetwodatasets A and B ,thethesis wasperformedthetestingmodelbysplittingthemergedtrainingdatasetrandomlyintothetraining andthetestingdataset.Themergedtrainingdatasetwasshuedtodistributethedataevenlywhich coversallvarietiesofdata.Then,itdividedintotwodierentsets,andofthesetwosets,80%ofthe originaldatasetwasusedfortrainingand20%oftheremainingsetfortestingthemodel.Therefore, thevalidationsplitcongurethefollowingsetting: TrainingDataset:Everytimetheratioof80%ofsamplesrandomlyselectedfromtheoriginal trainingset.Thisprocedureisrepeatedvetimesforeveryepochandbatchsizethathavebeen chosenforthemodel. TestDataset:Itisa20%ratiofromthemergingdataset,whichwasbeshuedrandomlyevery timeofthevetimeschosentoevaluateandtotestthemodel.Themodelhasneverseenthis dataduringthetrainingoperation,andithasneverbeenapartofdecidingthehyperparameters; therefore,itwillgiveustherealityofhowthemodelisperforming. 32

PAGE 44

Thesplittingprocesswasrepeatedvetimes,asitwasmentionedabove,insuchawaythateach timethetrainingandtestingdatasetwerereshuedrandomly.Thereshuedprocedureishighly recommendedtoavoidthedeep-learningtrainingmodeltodetectanypatternintheorderthatthe samplesarepresented.Therefore,thistechniquepreventsanybiasintrainingtheneuralnetworkand Supportvectormachineoftheproposedmodel.Thenalperformanceparametersareobtainedby averagingtheperformanceofalltheveresultsforeverychosenepochandbachsize.Itistobe remarkedthat,eachofthevetrainingphases,thedeeplearningalgorithmproducesaknowledgebase, whichisgoingtobeusedlatertopredictsitesoftheproteinbythesystem. III.3.7EvaluatingtheProposedPredictionSystem Mostoftheexistingpredictionsystemsareevaluatedbymeasuringtheaccuracy Ac ,sensitivity Sn ,specicity Sp ,precision prec ,Recall rec , F 1 )]TJ/F15 9.9626 Tf 7.749 0 Td [(scores,andMatthewsCorrelationCoecient MCC .Theresultsofthesystemwereassessedforeachoneoftheexperimentsinthisthesisusing classicationevaluationmethodspreviously. Supposesomesitesaretobeexaminedfortheexistenceofphosphorylationsites.Someofthemhave thetruephosphorylated,andthetestofthepredictionsystemclaimstheyarepositive.Therefore,this iscalledaTruePositivepredictionTP.Someofthemmayhavephosphorylatedasapositivemark,but thetestsaystheyarenegative.SotheyarecalledFalseNegativesFN.Ontheotherhand,Somesites donothavemarksinthepositivephosphorylation,whichmatchwiththetestingresultsofthemodel, sothisiscalledTrueNegativesTN.Finally,iftheactualsiteshavenegativelabels,whicharenonphosphorylated,andtheteststatesthattheyarepositive;therefore,theycalledFalseNegativesFP. So,theevaluationtestisaddingupallthesetestsincludingtruepositives,falsenegatives,truenegatives, andfalsepositivesto100%scoreofthetestset. Accuracy Ac :Theaccuracyistheratioofactualresults,whicharebothtruepositivesandtrue negativesinthepopulation,overthetotalnumberofsamples.Thefollowingequationcanobtainthe accuracy: Ac = TP + TN TP + FP + TN + FN 100% III.1 Sensitivity Sn :SensitivityorTruePositiveRateTPRistheproportionofpredictedsamples correctlyofallsamples.Precisely,itistheratioofphosphorylationsitesthattestedpositiveofallthe positivelyphosphorylatedsites. 33

PAGE 45

Theformulaforthismeasureis: Sn = TP TP + FN 100% III.2 Thehigherthesensitivity,thefewerpositivephosphorylationsitesthatareunpredicted. Specicity Sp :SpecicityorTrueNegativeRateTNRisthefractionofsitesthathavebeen predictednegativeoverthewholenumberofnegativesamples: Sp = TN TN + FP 100% III.3 Itcanbenoticedfromtheequationabovethatthefewernegativesitesarelabeledaspositive,thehigher thespecicitywouldbe. Precision Prec orsimply P :Precisionisthefractionoftruepositivesamplesamongallsites thattestedphosphorylationpositive. Prec = TP TP + FP III.4 Recall Prec or P :Recallistheproportionoftruepredictedpositivesamplesofphosphorylation sitesoverthetotalnumberofpositivesamples. Prec = TP TP + FN III.5 F1 )]TJ/F26 9.9626 Tf 7.749 0 Td [(score :Itistheharmonicmeanoftheprecisionandtherecall,whichtakeconsideredbothfalse positiveandfalsenegativepredictedsamples.Therefore,itisausefulandrecommendedmeasurement forthemodelwhenthedatasetiswidelydistributedandshuedeverytime. F 1 )]TJ/F11 9.9626 Tf 8.388 0 Td [(score canbemeasured bythefollowingequation: F 1 )]TJ/F11 9.9626 Tf 9.963 0 Td [(score = 2 Prec Rec Prec + Rec III.6 F1 )]TJ/F26 9.9626 Tf 7.749 0 Td [(score :Itistheharmonicmeanoftheprecisionandtherecall,whichtakeconsideredbothfalse positiveandfalsenegativepredictedsamples.Therefore,itisausefulandrecommendedmeasurement forthemodelwhenthedatasetiswidelydistributedandshuedeverytime. F 1 )]TJ/F11 9.9626 Tf 8.388 0 Td [(score canbemeasured bythefollowingequation: F 1 )]TJ/F11 9.9626 Tf 9.963 0 Td [(score = 2 Prec Rec Prec + Rec III.7 34

PAGE 46

Support :Itusestocomparescoresoftheproposedmodel;thusitisthetotalnumberofpredicted samplesofphosphorylationsitesforeachclass. Confusionmatrix :Aconfusionmatrixisatablethatrepresentshowwelltheperformanceofa classicationmodelonasetofgiventestdatafor,whichthetruevaluesarerecognized. III.4Testingthesystem FromtheworkowdiagraminFigureIII.2,sequencesofboth A and B datasetsweregivento 3 D )]TJ/F11 9.9626 Tf 10.597 0 Td [(cons webserverandPSI-BLAST'sblastpgpprogramrespectivelytogeneratePSSMproles, whicharethevaluablerepresentationoftheevolutionaryinformationoftheproteins.ThenCNNand theHybridCNNSupportVectorMachinesSVMclassierswerepreparedfromthePSSMproles. Thereweretwosetsofinstancesforbothofthedatasets A and B whererepresentpositiveandnegative phosphorylationsitesoftheproteininbothdatasets.Thepositiveandnegativesamplesof A instance setswerebalancedregardingthenumberofinstances,andbothweremergedtopasstothemodel. Similarly,the B datasetwasholdingbothnegativeandpositivesamplesof350completeproteins,the residuesofS,T,Yphosphorylationandnon-phosphorylationsiteswereextractedandmergedforevery sub-datasettofeedthemodel.Thenaverandomvalidationwasperformedonthenalmergedtraining setsusingtheCNNmodelandthehybridCNNSVMmodel.TheseparatemodellesKnowledgebase foreachofthevephosphorylatedresiduesS,T,andYforeachoftheninedierentwindowsizes ,17,19,21,23,25,27,29and31werecollectedfornextpredictionandtestingpurposes.Foreachone ofthewindowssize,theresidueofthethreeofaminoacidsS,T,Ywascentered,thenitwastakenthe samechosennumberfromthetopoftheresidueandthebottomofthereside.Besides,thebatchsize andepochwereselectedtobe,100and,20respectivelyforthemodel.Notethat,thebatchsize representsthenumberofsamplesthatwillbedeliveredthroughthenetwork,whiletheepochisasingle forwardpassandthebackwardpassofallthetrainingsamplesforjustoneepoch.Themeanandthe standarddeviationweretakenforeachveiterationexecutionsofthemodel.Forinstance,themodel wasrunvetimesofeachwindowsize,epoch,andpatchsizevalidationresultswerereportedinthe resultsection. 35

PAGE 47

FigureIII.2:FlowchartexplainstheDetailedSystemoftheproposedPredictionmodelwhiletesting. III.5SummaryoftheProposedPredictionSystem Thischapterreviewsthedesignofthecompletepredictionsystemthatwasproposedinthisthesis. Separatemodelsofthissystemsandpurposeofeachofthesealgorithmswereconsidered.Also,dierent assessmentparameterswereintroducedinthischapterthatwasusedtoevaluatethesystem.The work-owsdiagramwaspresentedtospecifyhowthesystemwastrainedandtested. 36

PAGE 48

CHAPTERIV RESULTS Thepurposeofthisthesiswasdesigningamodelofthephosphorylationsiteofproteinsthatwill beabletopredictsitesaccurately.Thedierenceproportionbetweenthepositiveandnegativesitesin bothdatasetswasabigconcernofdesignasuitablemethodbecauseofbiasandoverttingissues.The rstproblemhasbeenfacedinthisthesisisreducingthenumberofnegativesites.Afterafewinitial experimentsandliteraturereviews,thebetterwayisgoingwiththeequalratioofthemergeddataset whenthemodelwastrained.Thereasonbehindthatisifthemodelistrainingwithanunbalanceddata setatadierentrate,itisgoingtogowiththeportionthathasasignicantnumberofsamples,which leadstomispredictinthemodelmosttime.Reducingthenumberofnegativesamplesbeforemerging themwiththeactualinstanceswasdonebythewaysasitwasmentionedinthemethodologysection. Thenextpartofthethesisshowstheexperimentalresultsofhowwindowsizeandparametersofthe algorithmsaecttheperformanceofthepredictionsystem. Therstexperimentwasthebaselineexperimentofthethesis.Thisscenarioisdoneonsubsetsthe A dataset,whichhasPDBidsoftheproteins.Wecollected50000samplesofbothnegativesandpositives foreachoneofthethreeresiduesS,T,Y.Thetrainingdatasetwasmergingbetweenthepositiveand negativessamples.Itshouldbenoticedthatwedidnotrecordtheresultsinthethesisbecausewe wantedtofocusonmorethe B datasetsanda C dataset,whichthecorrespondingmatchofallthe PDBIDsinthedataset B .Afterthat,the B and C datasetwasperformedandtestedindividuallywith thesameamountofsamples.Thecomparisonwastakenbetweentwotheresultofdatasets.Finally,an independentdatasetwastestingonthebestmodeltoexaminetheperformanceofthemodelwithother existedmodels. Thissectionshowstheresultsforeachmodelofeveyexperiment.Then,themeanandthestandard deviationwastakenfortheoverallscoresofthemodelsforthepredictionsofallthesitessamplesin B and C datasets. IV.1Dataset B Inthe B dataset,obtainedfromPhospho.ELMver.9.0,therewere350completeproteinsinthis dataset.Thenumberofpositivesitesannotatedbyphospho.ELMandthenumberofnegativesites annotatedbyourassumptionforeachofthethreeresiduesS,TandYwereshowninTableIII.3.There isaPSSMleforeachsitelabeledinboth B + and B )]TJ/F15 9.9626 Tf 10.593 0 Td [(subdatasets.ThePSSMprolesoftheproteins ofthe B datasetgivesthetrainingsamplesforboththeCNNandhybridCNNSVM.Theratioofthe numberofnegativetopositivesitesissignicantandmakethetrainingdatasetunbalanced,thusit 37

PAGE 49

canbiasthemodeltrainingtopredictmostoftheunknownsitesasnegative.Tosolvethisissue,itis requiredtoincreasethenumberofpositiveinstancesinthewaythatitwillbeequaltothenegative sitesforthechosensamplesinthetrainingdataset.Fourseparateexperimentswereperformedwiththe mergingtrainingdataset.Thedatasetcontainsanequalratioofpositiveandnegativeinstances,that wastrainingandtestingontheparameteroftheCNNandHybridCNNSVMof ; 100 batchsize and ; 20 epochcombinations. IV.1.1ConvolutionalNeuralNetworksCNNModel ThemodelwastrainingandtestingrstontheCNN,whichhasdierentarchitecturesbasedonthe windowsizeofthePSSMproles,thebatchsize,andthenumberofepochs.However,therestofthe parametersinthearchitectureoftheCNNarethesameasitwasshowninTableIII.1.Thearchitecture oftheCNNisexplainedinFigureIV.1below. FigureIV.1:ThearchitectureoftheCNNthatusedintrainingthemodelwiththewindowsizeis 31 20 .Thegurealsoexplainedindetailstheinputandtheoutputparametersforeachlayerinthe CNN. 38

PAGE 50

IV.1.1.1Experiment1Epochs:10&Batch_size:50 Intherstexperiment,thetrainingdatasetwaspreparedtohaveanequalratioofthenumberofpositiveandnegativesamplesfrom B dataset.Thetotalamountforthetrainingdatasetwas1000instances, wherehalfofthemarepositivesamples,andtheotherhalfisnegativeinstances.Then,vesplitting randomvalidationwasimplementedonthismodied,mergedtrainingdataset.Separateverandom datasetwereperformedontheinstancesetforninedierentwindowsizes,17,19,21,23,25,27,29,and 31foreachofthethreeresiduesS,T,andY. Theaccuracy Ac ,precision prec ,Recall rec , F 1 )]TJ/F15 9.9626 Tf 7.749 0 Td [(scores,andConfusionmatrixwascalculated toevaluatepredictedvaluesofthemodel.TableIV.1showsthemeanresultsforverandomcross selectionusingthistypeoftrainingdataset. 39

PAGE 51

WindowSize Residue Metrics 15 17 19 21 23 25 27 29 31 S Accuracy 65.75% 68.10% 70.50% 70.42% 71.63% 69.50% 66.50% 70.70% 71.50% Precision 70.56% 70.00% 69.98% 70.54% 70.05% 70.63% 73.12% 71.26% 70.54% Recall 66.18% 68.04% 70.97% 71.06% 72.04% 70.02% 66.02% 71.02% 71.82% F1-score 64.98% 67.54% 70.34% 69.96% 71.16% 69.02% 64.78% 70.52% 71.08% Baselineerror 34.25% 31.90% 29.50% 29.58% 28.38% 30.50% 33.50% 29.30% 28.50% ExecutionTime 8.75 9.40 6.88 7.17 10.75 8.67 12.00 11.40 12.57 T Accuracy 68.67% 72.00% 71.63% 69.25% 71.75% 73.75% 73.00% 72.67% 73.25% Precision 68.88% 72.36% 72.13% 71.03% 72.06% 74.26% 75.21% 73.51% 73.98% Recall 68.68% 71.83% 71.68% 69.42% 71.70% 73.77% 73.00% 72.61% 73.39% F1-score 68.58% 71.77% 71.47% 68.64% 71.62% 73.61% 72.35% 72.38% 73.07% Baselineerror 31.33% 28.00% 28.38% 30.75% 28.25% 26.25% 27.00% 27.33% 26.75% ExecutionTime 4.67 5.00 5.75 5.00 5.00 5.00 6.00 5.50 6.25 Y Accuracy 68.50% 70.25% 69.88% 70.80% 72.00% 72.40% 72.40% 72.75% 73.17% Precision 69.80% 70.43% 70.49% 71.18% 72.55% 74.89% 73.87% 73.82% 73.82% Recall 68.65% 70.30% 69.90% 70.76% 72.05% 72.72% 72.26% 72.97% 72.97% F1-score 68.04% 70.21% 69.64% 70.60% 71.83% 71.84% 71.88% 72.52% 72.52% Baselineerror 31.50% 29.75% 30.13% 29.20% 28.00% 27.60% 27.60% 27.25% 26.25% ExecutionTime 5.20 5.50 6.00 6.20 6.25 7.00 7.20 7.33 7.82 TableIV.1:CNN1050 40

PAGE 52

IV.1.1.2Experiment2Epochs:20&Batch_size:50 Inthesecondexperiment,theepochnumberwasdoubletobecome20withthesamebatchsize, whichis50.Themodelwastrainedwiththesame B datasetasthepreviousonethathasequalpositive andnegativesamples.ThePSSMproleleswerebetween15to31oddnumbersforeachoneofthe threeresiduesS,T,Y.TheresultsofthisexperimentcanbeseenintableIV.2. 41

PAGE 53

WindowSize Residue Metrics 15 17 19 21 23 25 27 29 31 S Accuracy 70.40% 72.17% 72.00% 76.50% 71.29% 74.17% 74.13% 73.71% 73.69% Precision 71.98% 73.08% 73.38% 76.82% 72.79% 74.90% 76.20% 75.52% 74.86% Recall 70.89% 72.37% 72.41% 76.54% 71.30% 74.39% 74.57% 74.06% 73.79% F1-score 70.16% 71.95% 71.77% 76.42% 70.73% 74.03% 73.75% 73.41% 73.44% Baselineerror 29.60% 27.83% 28.00% 23.50% 28.71% 25.83% 25.88% 26.29% 26.31% ExecutionTime 8.20 7.17 7.71 10.00 8.57 11.83 4.00 5.00 4.50 T Accuracy 70.70% 72.25% 73.60% 74.88% 75.00% 76.13% 77.40% 76.25% 78.20% Precision 71.43% 72.41% 73.89% 75.23% 76.22% 76.36% 78.31% 76.62% 79.44% Recall 70.96% 72.22% 73.71% 74.76% 75.06% 76.17% 77.44% 76.19% 78.21% F1-score 70.57% 72.17% 73.54% 74.72% 74.72% 76.08% 77.22% 76.11% 77.96% Baselineerror 29.30% 27.75% 26.40% 25.13% 25.00% 23.88% 22.60% 23.75% 21.80% ExecutionTime 2.80 5.75 6.20 5.00 5.00 5.00 5.60 4.83 5.80 Y Accuracy 71.38% 71.60% 73.70% 73.75% 77.00% 76.13% 76.00% 74.83% 75.42% Precision 72.54% 71.85% 74.39% 74.29% 77.18% 76.17% 76.45% 75.26% 76.17% Recall 71.51% 71.78% 73.67% 73.79% 77.06% 76.15% 76.04% 74.96% 75.63% F1-score 71.05% 71.59% 73.46% 73.60% 76.98% 76.12% 75.88% 74.76% 75.30% Baselineerror 28.63% 28.40% 26.30% 26.25% 23.00% 23.88% 24.00% 25.17% 24.58% ExecutionTime 5.50 6.00 5.80 5.75 7.00 6.50 7.00 7.33 4.00 TableIV.2:CNN2050 42

PAGE 54

Likeintherstexperiment,theve-timeoftheexecutionthecodeforeveryresiduewithdierent windowsizesofthePSSMprolewasimplementedintheinvestigation.Then,themeanandthestandard deviationwastakenforanalyzingpurpose. IV.1.1.3Experiment3Epochs:10&Batch_size:100 Inthisexperiment,theCNNwastrainingwithdierentparameters,whichare10forthenumber ofepochs,andthebatchsizewas100.First,thetrainingmergeddataset B wassplitintoatraining datasetandtestingdataset.ThetrainingdatasetwaspassingtotheCNNwithtenepochsand100 batchsize.Thetestingdatasetwasthenimplementedtoassesstheperformanceofthemodel.This processwasrepeatedforeverythewindowsize,15,17,19,21,23,25,27,29,and31ofthePSSMprole foreveryresidueSerine,Tyrosine,andThreonine.Table4.3showstheaverageresultsofve-random cross-validationusingthe B ofthetrainingdataset.TheresultsofthisexperimentarelistedinTable IV.3. 43

PAGE 55

WindowSize Residue Metrics 15 17 19 21 23 25 27 29 31 S Accuracy 67.75% 68.13% 66.58% 68.00% 68.60% 68.00% 69.42% 68.30% 71.60% Precision 68.81% 69.31% 68.33% 70.00% 69.97% 68.68% 71.88% 69.01% 73.66% Recall 67.78% 68.46% 66.77% 68.42% 68.92% 68.16% 69.84% 68.56% 72.03% F1-score 67.30% 67.76% 65.89% 67.46% 68.22% 67.79% 68.81% 68.14% 71.15% Baselineerror 32.25% 31.88% 33.42% 32.00% 31.40% 32.00% 30.58% 31.70% 28.40% ExecutionTime 3.25 3.00 3.83 3.80 4.00 5.00 4.67 5.80 6.20 T Accuracy 67.83% 68.63% 68.63% 69.50% 69.67% 70.75% 71.50% 70.13% 70.83% Precision 69.10% 69.78% 69.34% 71.94% 71.00% 72.94% 72.08% 70.73% 72.91% Recall 67.52% 68.87% 68.77% 69.61% 69.95% 70.94% 71.60% 70.11% 71.09% F1-score 66.99% 68.30% 68.37% 68.70% 69.35% 70.14% 71.37% 69.89% 70.30% Baselineerror 32.17% 31.38% 31.38% 30.50% 30.33% 29.25% 28.50% 29.88% 29.17% ExecutionTime 3.00 3.25 3.50 3.80 5.00 5.50 5.00 5.75 6.33 Y Accuracy 67.88% 66.13% 67.25% 69.38% 68.25% 69.40% 71.25% 69.58% 67.40% Precision 71.90% 68.39% 68.81% 70.99% 70.28% 69.89% 73.65% 70.20% 71.13% Recall 68.02% 65.79% 67.34% 69.32% 68.09% 69.16% 71.18% 69.29% 67.96% F1-score 66.51% 64.80% 66.65% 68.77% 67.31% 68.96% 70.41% 69.09% 66.13% Baselineerror 32.13% 33.88% 32.75% 30.63% 31.75% 30.60% 28.75% 30.42% 32.60% ExecutionTime 3.00 4.00 3.75 4.00 4.00 4.40 5.75 6.00 4.20 TableIV.3:CNN10100 44

PAGE 56

Accuracy,Precision,F1-score,Recall,Baselineerrors,andtheelapsedtimewereshownforeachof theninedierentwindowsizesforeachofthethreeresidues.Thepercentagenumbersexplainthemean valueforeachoneofthemeasurementsthatappearsinthetable;besidestheexecutiontimeinthe secondsforthetrainingandtestingthemodel. IV.1.1.4Experiment4Epochs:20&Batch_size:100 Themergeddatasetinthisexaminationwassplitrstintothetrainingdatasetandthetesting dataset.Then,themodelwastrainingwithdierentepochnumber,whichis20withthesameamount ofbatchsize.Thevariouswindowssizewasincludedliketheexperimentsbefore.Themeasurement performancewasrecordedforeveryresidueofthethreethatinvolveinthisstudy.TableIV.4below showedtheresultsofalltheassessmentsthatuseinthismodel. 45

PAGE 57

WindowSize Residue Metrics 15 17 19 21 23 25 27 29 31 S Accuracy 68.38% 69.30% 68.38% 68.50% 68.63% 69.79% 69.88% 71.40% 71.10% Precision 69.00% 70.14% 68.70% 70.66% 69.46% 71.76% 71.35% 73.36% 71.48% Recall 68.64% 69.60% 68.51% 69.08% 68.96% 70.17% 70.36% 71.91% 70.99% F1-score 68.27% 69.11% 68.27% 67.97% 68.50% 69.31% 69.65% 71.00% 70.86% Baselineerror 31.63% 30.70% 31.63% 31.50% 31.38% 30.21% 30.13% 28.60% 28.90% ExecutionTime 3.75 3.40 4.00 4.00 4.25 4.43 5.25 7.20 6.40 T Accuracy 70.63% 71.75% 73.00% 72.00% 73.25% 74.88% 74.75% 74.33% 70.30% Precision 72.28% 72.37% 73.23% 73.03% 73.45% 75.60% 74.98% 75.08% 70.84% Recall 70.77% 71.84% 72.97% 72.21% 73.31% 74.97% 74.84% 74.34% 70.46% F1-score 70.16% 71.53% 72.89% 71.79% 73.22% 74.71% 74.73% 74.09% 70.17% Baselineerror 29.38% 28.25% 27.00% 28.00% 26.75% 25.13% 25.25% 25.67% 29.70% ExecutionTime 3.25 4.17 4.50 5.00 5.75 5.75 5.75 4.67 413.60 Y Accuracy 69.40% 70.80% 71.75% 71.25% 73.88% 74.10% 73.90% 74.58% 75.00% Precision 69.43% 71.49% 72.17% 71.79% 74.46% 74.39% 73.91% 75.10% 75.14% Recall 69.33% 70.92% 71.83% 71.26% 73.90% 74.07% 73.79% 74.77% 75.01% F1-score 69.30% 70.57% 71.64% 71.05% 73.73% 73.96% 73.80% 74.52% 74.96% Baselineerror 30.60% 29.20% 28.25% 28.75% 26.13% 25.90% 26.10% 25.42% 25.00% ExecutionTime 3.40 4.00 4.25 3.75 4.25 4.40 4.60 6.50 4.00 TableIV.4:CNN20100 46

PAGE 58

IV.1.2HybridCNNSupportVectorMachinesSVMclassier Inthisalgorithm,theSVMusedalongwiththeCNNtobuildamodelthatpredictsthephosphorylationsitesofthethreeresiduesS,T,Yofaprotein.First,theCNNappliedtotrainthemodel andadjusttheweightsoftheparameterinthenetwork.Afterthemodelwastrainedandchangeallthe settingsintheCNN,theSVMwasperformedonthevaluesthatextractfromonenetworkoftheCNN asalongvector,andbeforeitisgoingtothelastlayerandapplyingtheSoftmaxfunction.TheCNN wasusedtoextracttheessentialfeaturesofmanyfeaturesinthedataset,andapplytheclassication algorithmliketheSVM.Thismodelusesthesamedatasetintherstalgorithm,whichisthe B dataset. The B 4 datasethas1000mergingsamplesfrompositiveandnegativesdatasets B + and B )]TJ/F15 9.9626 Tf 7.749 0 Td [(where500 fromthenegativedatasetandother500fromthepositivedatasetforoneofthethreeresiduesS,T,and Y. Five-Randomcross-validationwaschosenasmentionedbeforetoevaluatethesystembyusingdifferentmeasurementssuchasAccuracy,Precision,F1-score,Recall,Baselineerrors,andtheexecution timeforeveryroundoftheexecution.Then,theaverageofallmeasurementswastakentoevaluatethe nalmeasureofthechosenparameter.Thesplittingandshuingofthetrainingandtestingdatasets wereusedtotuneandtestthesettingsofeachexperiment.FigureIV.2belowexplainsindetailsthe parametersthatusedinthismodelandhowmanyinputsandoutputsforeachlayerintheCNNthatis attingtobetheinputoftheSVM. FigureIV.2:ThearchitectureoftheSVMthatusedintrainingthemodelalongwiththeCNNthat hasinputequalsto 31 20 forthewindowsize.Thegurealsoexplainedindetailstheinputandthe outputparametersforeachlayerintheCNNmodelthatisatteningasafactorintheendtobethe inputoftheSVM. 47

PAGE 59

IV.1.2.1Experiment1Epochs:10&Batch_size:50 SVMSupportVectorMachineswasappliedfromthelastlayeroftheCNN,preciselyfromthefully connectedlayer.Inthisexperiment,theCNNwasusedtenepochnumbersand50forthebatchsize topreparethemodelfortheSVM.TheSVMwasusedonthelastlayeroftheCNNbyextractingthe featuresfromitasavector.ThemodelwasperformedondierentwindowssizesofthePSSMproles asshowninTableIV.1.2.1. 48

PAGE 60

WindowSize Residue Metrics 15 17 19 21 23 25 27 29 31 S Accuracy 63.30% 64.50% 65.54% 62.80% 63.80% 63.00% 65.91% 63.00% 65.25% Precision 70.04% 70.83% 71.40% 69.75% 69.83% 69.68% 72.22% 71.51% 72.05% Recall 64.24% 65.12% 66.39% 63.79% 64.63% 63.99% 66.64% 64.11% 65.92% F1-score 60.81% 62.02% 63.57% 59.97% 60.98% 60.48% 63.57% 59.62% 62.62% Baselineerror 36.70% 35.50% 34.46% 37.20% 36.20% 37.00% 34.09% 37.00% 34.75% ExecutionTime 8.0 8.8 5.2 6.4 10.7 11.9 6.5 12.6 12.9 T Accuracy 67.13% 66.42% 64.70% 68.07% 61.50% 68.38% 68.56% 64.00% 58.75% Precision 68.07% 71.31% 68.96% 69.41% 70.67% 70.28% 69.34% 69.20% 67.24% Recall 67.30% 66.81% 65.51% 68.22% 62.55% 68.68% 68.70% 64.42% 59.46% F1-score 66.83% 64.24% 62.92% 67.66% 57.05% 67.86% 68.31% 61.88% 52.31% Baselineerror 32.88% 33.58% 35.30% 31.93% 38.50% 31.63% 31.44% 36.00% 41.25% ExecutionTime 1.5 2.5 2.2 2.5 3.3 3.9 3.6 4.7 5.1 Y Accuracy 67.80% 69.25% 68.38% 66.40% 69.50% 70.60% 69.60% 70.50% 69.75% Precision 70.63% 69.35% 68.42% 70.43% 69.58% 70.75% 69.65% 70.61% 69.82% Recall 68.25% 69.32% 68.41% 67.25% 69.56% 70.74% 69.66% 70.54% 69.76% F1-score 66.60% 69.25% 68.37% 64.95% 69.50% 70.59% 69.58% 70.45% 69.72% Baselineerror 32.20% 30.75% 31.63% 33.60% 30.50% 29.40% 30.40% 29.50% 30.25% ExecutionTime 1.6 2.1 2.5 2.9 3.3 3.8 4.2 4.8 2.9 TableIV.5:HybridCNNSVMclassier1050 49

PAGE 61

IV.1.2.2Experiment2Epochs:20&Batch_size:50 Inthistest,theCNNwasimplementedindierentparameterstoevaluatethemodeloftheSVM. Tenand50asanepochandbatchsizeoftheCNNnetworkwereemployedtobuildthemodeland preparethefeaturesoftheSVM. 50

PAGE 62

WindowSize Residue Metrics 15 17 19 21 23 25 27 29 31 S Accuracy 65.60% 66.35% 66.10% 66.67% 67.35% 68.42% 66.90% 68.64% 68.05% Precision 69.51% 70.35% 70.63% 70.78% 72.43% 72.71% 72.26% 72.98% 72.64% Recall 66.33% 67.06% 66.87% 67.21% 68.08% 69.01% 67.72% 69.28% 68.59% F1-score 64.23% 65.03% 64.63% 65.27% 65.76% 67.12% 65.32% 67.48% 66.67% Baselineerror 34.40% 33.65% 33.90% 33.33% 32.65% 31.58% 33.10% 31.36% 31.95% ExecutionTime 8.2 5.4 6.0 10.3 6.7 11.6 2.4 3.5 3.2 T Accuracy 67.13% 66.42% 64.70% 68.07% 61.50% 68.38% 68.56% 64.00% 58.75% Precision 68.07% 71.31% 68.96% 69.41% 70.67% 70.28% 69.34% 69.20% 67.24% Recall 67.30% 66.81% 65.51% 68.22% 62.55% 68.68% 68.70% 64.42% 59.46% F1-score 66.83% 64.24% 62.92% 67.66% 57.05% 67.86% 68.31% 61.88% 52.31% Baselineerror 32.88% 33.58% 35.30% 31.93% 38.50% 31.63% 31.44% 36.00% 41.25% ExecutionTime 1.5 2.5 2.2 2.5 3.3 3.9 3.6 4.7 5.1 Y Accuracy 69.75% 69.50% 68.90% 69.50% 71.00% 72.50% 70.00% 71.17% 70.75% Precision 70.00% 69.61% 69.17% 69.62% 71.05% 72.59% 70.06% 71.82% 70.88% Recall 69.84% 69.61% 69.09% 69.56% 71.02% 72.54% 70.04% 71.25% 70.75% F1-score 69.70% 69.50% 68.88% 69.48% 70.99% 72.49% 69.97% 70.99% 70.68% Baselineerror 30.25% 30.50% 31.10% 30.50% 29.00% 27.50% 30.00% 28.83% 29.25% ExecutionTime 1.8 2.1 2.6 2.9 3.4 3.8 4.5 5.0 2.9 TableIV.6:HybridCNNSVMclassier2050 51

PAGE 63

ThetableIV.1.2.2aboveshowstheresultsofusingtheSVMalongwiththeCNNinthisexperiment. Themeanvaluewasmarkedforallthemeasurementsinapplyingrandomlyvetimesselectionsofthe trainingdatasetandtestingdataset.Thetablesexplaintheresultsinthepercentagevaluesincluding theexecutiontimeofthealgorithm. IV.1.2.3Experiment3Epochs:10&Batch_size:100 TheparametersareneededintheCNNare10forthenumbersoftheepoch,and100isthebatchsize. TheSVMwasappliedonthetopoftheCNNtoclassifythethreeresiduesS,T,Yintophosphorylation andnonphosphorylationsitesoftheprotein.Themeanvalueofallthemeasurementsmetricsareshown intableIV.1.2.3:: 52

PAGE 64

WindowSize Residue Metrics 15 17 19 21 23 25 27 29 31 S Accuracy 63.83% 63.06% 62.50% 63.65% 60.94% 63.50% 62.00% 59.90% 60.08% Precision 70.90% 69.80% 69.49% 70.16% 68.89% 71.92% 70.57% 72.69% 70.96% Recall 64.59% 63.80% 63.20% 64.34% 61.83% 64.44% 63.06% 61.24% 61.17% F1-score 60.92% 60.23% 59.40% 60.88% 57.24% 60.01% 58.34% 54.23% 54.28% Baselineerror 36.17% 36.94% 37.50% 36.35% 39.06% 36.50% 38.00% 40.10% 39.92% ExecutionTime 1.8 1.9 2.5 3.0 3.4 4.2 4.3 5.3 6.1 T Accuracy 67.13% 66.42% 64.70% 68.07% 61.50% 68.38% 68.56% 64.00% 58.75% Precision 68.07% 71.31% 68.96% 69.41% 70.67% 70.28% 69.34% 69.20% 67.24% Recall 67.30% 66.81% 65.51% 68.22% 62.55% 68.68% 68.70% 64.42% 59.46% F1-score 66.83% 64.24% 62.92% 67.66% 57.05% 67.86% 68.31% 61.88% 52.31% Baselineerror 32.88% 33.58% 35.30% 31.93% 38.50% 31.63% 31.44% 36.00% 41.25% ExecutionTime 1.5 2.5 2.2 2.5 3.3 3.9 3.6 4.7 5.1 Y Accuracy 67.50% 64.13% 65.75% 66.25% 68.63% 65.70% 65.63% 61.92% 60.30% Precision 68.11% 68.02% 67.34% 68.25% 70.07% 68.28% 70.94% 68.93% 70.39% Recall 67.66% 64.56% 65.97% 66.49% 68.81% 66.28% 66.14% 62.97% 61.56% F1-score 67.29% 62.60% 65.22% 65.56% 68.01% 64.82% 63.39% 57.80% 55.22% Baselineerror 32.50% 35.88% 34.25% 33.75% 31.38% 34.30% 34.38% 38.08% 39.70% ExecutionTime 1.6 2.0 2.5 2.8 3.2 3.7 4.2 4.7 2.8 TableIV.7:HybridCNNSVMclassier10100 53

PAGE 65

IV.1.2.4Experiment4Epochs:20&Batch_size:100 Inthislastexperiment,theCNNwaspreparedfordierentepochnumberswiththesamebatchsize before,whichare20and100respectively.TheAccuracy,Precision,F1-score,Recall,Baselineerrors, andtheexecutiontimeweretakenasanaverageforeveryoneofthethreeresiduesoftheprotein.The windowssizeofthePSSMprolewasselectedtobe,17,19,21,23,25,27,29,and31astheprevious experiments.TheresultsasshownbelowinTableIV.8. 54

PAGE 66

WindowSize Residue Metrics 15 17 19 21 23 25 27 29 31 S Accuracy 64.40% 64.67% 65.00% 65.10% 62.90% 62.94% 65.08% 65.10% 62.00% Precision 70.56% 70.00% 69.98% 70.54% 70.05% 70.63% 73.12% 71.26% 70.54% Recall 65.31% 65.27% 65.83% 65.99% 63.89% 63.74% 65.90% 66.01% 63.15% F1-score 62.29% 62.67% 63.20% 63.24% 59.98% 59.67% 62.02% 62.97% 58.41% Baselineerror 35.60% 35.33% 35.00% 34.90% 37.10% 37.06% 34.92% 34.90% 38.00% ExecutionTime 5.7 5.8 6.8 7.3 8.1 8.4 10.2 12.7 12.4 T Accuracy 68.30% 67.25% 70.00% 70.75% 70.50% 71.00% 69.88% 70.50% 70.30% Precision 68.94% 68.00% 70.13% 70.93% 70.63% 71.13% 69.97% 71.19% 70.84% Recall 68.62% 67.45% 70.09% 70.82% 70.61% 71.07% 69.92% 70.63% 70.46% F1-score 68.23% 67.07% 69.98% 70.72% 70.48% 70.99% 69.85% 70.34% 70.17% Baselineerror 31.70% 32.75% 30.00% 29.25% 29.50% 29.00% 30.13% 29.50% 29.70% ExecutionTime 1.5 1.9 2.4 3.0 3.5 4.1 4.4 4.8 4.1 Y Accuracy 68.00% 68.60% 67.88% 69.88% 69.75% 70.60% 69.30% 71.00% 70.00% Precision 68.57% 69.07% 68.42% 70.27% 70.03% 71.33% 70.25% 71.15% 69.99% Recall 68.29% 68.85% 68.02% 69.95% 69.86% 70.77% 69.72% 71.08% 69.99% F1-score 67.92% 68.54% 67.72% 69.75% 69.70% 70.33% 69.14% 70.97% 69.99% Baselineerror 32.00% 31.40% 32.13% 30.13% 30.25% 29.40% 30.70% 29.00% 30.00% ExecutionTime 1.7 2.2 2.6 2.9 3.3 3.8 4.2 4.9 2.9 TableIV.8:HybridCNNSVMclassier20100 55

PAGE 67

IV.1.2.5OptimumChoiceoftheParametersoftheCNN Fromtheexperimentalresultsoftherstalgorithm,itwasevidentthattheoptimalparameters oftheCNNwerethebatchsizeis50,andthenumberofepochsis20.Theseparametersoftengive goodresultsfortrainingandtestingthemodelforallthewindowsizesofthePSSMprole.Allthe assessmentparametersshowbettercomparewiththeotherparametersvaluesascanbefoundinthe gure.Therefore,theproposedpredictionsystemofthisstudywillsetthenextalgorithmwiththe parametersthatuseintheexperimentnumbertwoandsixfromthepreviousdataset. FigureIV.3:EectsofwindowsizeonPhosphorylatedSerine,Threonine,andTyrosinesitesprediction performanceinboththeCNNandHybridCNNSVMclassier. IV.2Dataset C Inthissection,theexperimentswereaccomplishedonjustasubsetoftheproteinsindataset B . The B datasethas350abundantproteins,weextractthePDBidsoftheseproteinsandapplythesame algorithmsthathavebeenimplementedinthe B dataset.Thisdatasetcalled C dataset,whichhasall thePDBidsofthe350proteinsinthedataset B ,whichareexistinginthemaindatasetofPDB A . Thedataset C hastwosubsets,andeachofonehasthreekindsofresidueswhichareserine,thyronine, andtyrosine.Theexperimentsinthissectionareperformedbychoosingarandomsub-datasetsfrom the C dataset.Therearethreesubsetsa,andthesizeofthesubsetscontinue2000samples,1000from eachresidueinbothphosphorylationandnon-phosphorylationsites. Sincetheparametersofexperiments 2 and 6 inboththeCNNandthehybridCNNSVMclassiers wereperformedwellintermoftheaccuracy,thesectionwasconsideredonlytheseparameterstoapply thetwosamealgorithmsintheprevioussection.Inthenextexperiments,thetrainingdatasetwas preparedtohaveanequalproportionofthenumberofpositiveandnegativesamplesfrom C dataset. Thetotalamountforthetrainingdatasetwas2000instances,wherehalfofthemispositivesamples 56

PAGE 68

andtheotherhalfisnegativeinstances.Themergeddatasetwassplitintoatrainingdatasetwith80 percentproportionandatestingdatasetwitharatioequalto20percent.Then,vesplittingrandom validationwasimplementedonthemergedtrainingdataset.ThewindowsizesofthePSSMproleswere chosen,whichare,15,17,19,21,23,25,27,29,and31.Accuracy,Precision,F1-score,Recall,Baseline errors,andtheexecutiontimeforeveryroundoftheimplementationwasimplementedtoevaluatethe model.Themeanandthestandarddeviationofallthemeasurementperformancewascalculatedafter eachverandomnumberoftheexecutionofthealgorithm. IV.2.1ConvolutionalNeuralNetworksCNNModel IV.2.1.1Experiment1Epochs:20&Batch_size:50 ThemodelwastrainedonlyontheCNNthathastheparameters20amountoftheepochandthe patchsizeequalto50.The C datasetwasusedwith2000samplesofbothpositiveandnegative.TheCNN wastrainingonthetrainingdatasetandtesttheperformanceonthetestingdataset.Theaccuracy Ac , precision prec ,Recall rec , F 1 )]TJ/F15 9.9626 Tf 7.749 0 Td [(scores,andConfusionmatrixwascalculatedtoevaluatepredicted valuesofthemodel.FigureIV.9andFigureIV.10belowshowthemeanandthestandarddeviation resultsforverandomcrossselectionusingthistypeoftrainingdataset. 57

PAGE 69

WindowSize Residue Metrics 13 15 17 19 21 23 25 27 29 31 S Accuracy 83.50% 84.50% 84.31% 86.06% 87.06% 86.33% 87.25% 88.50% 87.50% 88.70% Precision 83.73% 84.54% 84.91% 86.41% 87.23% 86.38% 87.47% 88.62% 88.31% 88.90% Recall 83.42% 84.49% 84.47% 86.23% 87.04% 86.35% 87.28% 88.58% 87.41% 88.78% F1-score 83.42% 84.46% 84.26% 86.04% 87.02% 86.30% 87.21% 88.48% 87.40% 88.67% Baselineerror 16.50% 15.50% 15.69% 13.94% 12.94% 13.67% 12.75% 11.50% 12.50% 11.30% ExecutionTime 8.75 9.25 9.50 9.00 8.25 9.00 8.75 9.00 4.00 9.60 T Accuracy 90.00% 92.38% 92.44% 93.25% 94.31% 94.25% 93.69% 94.19% 93.94% 94.44% Precision 89.98% 92.45% 92.48% 93.26% 94.34% 94.25% 93.71% 94.18% 93.95% 94.44% Recall 90.02% 92.35% 92.42% 93.31% 94.36% 94.24% 93.79% 94.24% 93.98% 94.48% F1-score 89.99% 92.35% 92.42% 93.24% 94.30% 94.24% 93.69% 94.18% 93.93% 94.43% Baselineerror 10.00% 7.63% 7.56% 6.75% 5.69% 5.75% 6.31% 5.81% 6.06% 5.56% ExecutionTime 3.54 8.25 8.50 7.75 7.00 6.60 6.75 7.50 7.50 8.50 Y Accuracy 90.00% 92.38% 92.44% 89.00% 94.31% 90.69% 93.69% 94.19% 93.94% 94.44% Precision 88.25% 88.32% 89.69% 90.68% 89.86% 90.69% 90.09% 90.97% 90.80% 91.45% Recall 88.14% 88.30% 89.56% 90.69% 89.73% 90.72% 90.05% 91.01% 90.80% 91.48% F1-score 88.10% 88.29% 89.52% 90.67% 89.67% 90.67% 90.05% 90.90% 90.68% 91.37% Baselineerror 9.88% 7.69% 8.45% 9.31% 5.30% 9.31% 6.92% 5.08% 6.31% 5.63% ExecutionTime 8.25 8.25 9.00 8.75 7.00 7.25 7.33 7.33 7.75 7.75 TableIV.9:ThemeanofallthemeasurementsintheCNNclassier2050 58

PAGE 70

WindowSize Residue Metrics 13 15 17 19 21 23 25 27 29 31 S Accuracy 0.008 0.0234 0.009 0.006 0.006 0.010 0.011 0.013 0.015 0.02 Precision 0.007 0.023 0.006 0.007 0.004 0.010 0.007 0.013 0.009 0.021 Recall 0.007 0.023 0.008 0.005 0.005 0.010 0.012 0.013 0.015 0.022 F1-score 0.008 0.023 0.009 0.006 0.006 0.011 0.011 0.013 0.015 0.023 Baselineerror 0.008 0.023 0.009 0.006 0.006 0.010 0.011 0.013 0.015 0.023 ExecutionTime 0.004 0.004 0.005 0.007 0.004 0.008 0.008 0.007 0 0.008 T Accuracy 0.009 0.010 0.008 0.004 0.006 0.006 0.006 0.007 0.002 0.005 Precision 0.006 0.008 0.007 0.004 0.006 0.006 0.006 0.008 0.003 0.005 Recall 0.011 0.0104 0.008 0.005 0.006 0.006 0.005 0.007 0.002 0.005 F1-score 0.010 0.010 0.008 0.004 0.006 0.006 0.006 0.007 0.002 0.005 Baselineerror 0.009 0.010 0.008 0.004 0.006 0.006 0.006 0.007 0.002 0.005 ExecutionTime 0 0.004 0.011 0.008 0.007 0.005 0.004 0.005 0.005 0.011 Y Accuracy 0.017 0.016 0.012 0.017 0.016 0.013 0.008 0.021 0.013 0.013 Precision 0.018 0.017 0.013 0.018 0.017 0.014 0.009 0.022 0.013 0.013 Recall 0.016 0.016 0.011 0.017 0.015 0.013 0.009 0.020 0.012 0.012 F1-score 0.017 0.016 0.012 0.018 0.016 0.014 0.009 0.021 0.013 0.013 Baselineerror 0.017 0.016 0.012 0.018 0.016 0.014 0.009 0.021 0.013 0.013 ExecutionTime 0.004 0.008 0.009 0.011 0.006 0.008 0.005 0.005 0.004 0.004 TableIV.10:ThestandarddeviationofalltheassessmentsintheCNNclassier2050 59

PAGE 71

IV.2.2HybridCNNSupportVectorMachinesSVMclassier Inthisalgorithm,theSVMusedalongwiththeCNNtoestablishamodelthatpredictsthephosphorylationsitesofaprotein.First,theCNNappliedtotrainthemodelandmodifytheweightsof theparameterinthenetwork.AfterthemodelwastrainedandpreparedfortheSVM,theSVMwas performedonthevaluesthatextractfromthelastlayeroftheCNN,andbeforeapplyingtheSoftmax function.TheCNNwasemployedtoextracttheimportantfeaturesfromthePSSMproleofoneof thethreeresidues,andimplementtheclassicationalgorithmliketheSVM.Thismodelusesthesame datasetintherstalgorithm,whichisthe C dataset.The C datasethas2000mergingsamplesfrom positiveandnegativesdatasets C + and C )]TJ/F15 9.9626 Tf 7.749 0 Td [(,where1000fromthenegativedatasetandother1000from thepositivedatasetforoneofthethreeresiduesS,T,andY. Five-crossvalidationwasappliedtoevaluatethesystembyusingvariousmeasurementssuchas Accuracy,Precision,F1-score,Recall,Baselineerrors,andtheexecutiontimeforeveryroundofthe execution.Then,theaverageofallmeasurementswastakentoevaluatethenalmeasureofthechosen parameter.Thesplittingandshuingofthetrainingandtestingdatasetswereusedtotuneandtest theparametersofeachexperiment. IV.2.2.1Experiment1Epochs:20&Batch_size:50 Inthisexperiment,theepochnumberwasthesamethepreviousexperiment,whichis20,andthe batchsizewas50.Themodelwastrainedwiththesame C datasetasthepreviousonethathasequal positiveandnegativesamples.ThePSSMproleleswerebetween13to31oddnumbersforeachone ofthethreeresiduesS,T,Y.TheresultsofthisexperimentcanbeobservedfromTableIV.11. 60

PAGE 72

WindowSize Residue Metrics 13 15 17 19 21 23 25 27 29 31 S Accuracy 89.00% 93.94% 93.69% 94.19% 81.00% 94.31% 92.38% 90.00% 80.35% 91.95% Precision 78.42% 80.43% 80.99% 80.48% 82.16% 80.44% 82.20% 80.78% 81.65% 82.47% Recall 78.10% 79.81% 80.52% 79.32% 81.34% 79.08% 81.15% 79.12% 80.53% 81.83% F1-score 77.82% 79.52% 80.20% 78.91% 80.91% 78.40% 80.69% 78.49% 80.17% 81.56% Baselineerror 11.00% 15.50% 15.69% 13.94% 12.94% 13.67% 7.63% 10.00% 19.65% 8.05% ExecutionTime 12.125 13.225 14.1225 14.1075 13.7575 15.21 15.585 16.28 10.688 18.376 T Accuracy 90.50% 90.56% 89.50% 90.38% 89.19% 90.20% 90.00% 89.19% 89.06% 90.13% Precision 90.50% 90.53% 89.60% 90.36% 89.22% 90.22% 90.01% 89.31% 89.25% 90.18% Recall 90.46% 90.59% 89.53% 90.38% 89.20% 90.21% 89.97% 89.14% 88.99% 90.09% F1-score 90.48% 90.55% 89.48% 90.36% 89.17% 90.18% 89.97% 89.15% 89.01% 90.09% Baselineerror 10.00% 7.63% 7.56% 6.75% 5.69% 5.75% 6.31% 5.81% 6.06% 5.56% ExecutionTime 4.56 11.5075 12.3875 12.1325 11.7275 11.956 13.3525 14.09 14.525 16.1875 Y Accuracy 85.38% 85.38% 87.00% 86.94% 87.65% 88.19% 87.33% 87.92% 88.56% 88.44% Precision 85.38% 85.52% 87.00% 86.93% 87.66% 88.18% 87.33% 87.92% 88.62% 88.52% Recall 85.41% 85.42% 87.02% 86.94% 87.58% 88.13% 87.25% 87.83% 88.46% 88.34% F1-score 85.36% 85.35% 86.98% 86.91% 87.60% 88.15% 87.28% 87.86% 88.51% 88.38% Baselineerror 14.63% 14.63% 13.00% 13.06% 12.35% 11.81% 12.67% 12.08% 11.44% 11.56% ExecutionTime 11.1125 11.69 13.07 13.3875 12.226 13.00 13.64 14.279967 15.43 17.8925 TableIV.11:ThemeanofallthemeasurementsinHybridCNNSupportVectorMachinesSVMclassier2050 61

PAGE 73

Likeintherstexperiment,wecalculatedthemeanandthestandarddeviationoftheve-timeof theexecutionthecodeforeveryresiduewithdierentwindowssizeofthePSSMprole. Thestandarddeviationofallassessingperformanceswastakentoevaluatetheresultsandseeifthere areanydeviationsbetweenthem.Thegurebelowillustratestheresultsfortheevaluationperformed inTableIV.12. 62

PAGE 74

WindowSize Residue Metrics 13 15 17 19 21 23 25 27 29 31 S Accuracy 0.012437 0.026955 0.008478 0.014197 0.021723 0.022016 0.012793 0.031869 0.031448 0.018 Precision 0.008757 0.022959 0.004201 0.006227 0.013297 0.013255 0.006199 0.022301 0.026893 0.014023 Recall 0.009796 0.02506 0.004914 0.01662 0.01619 0.015984 0.007839 0.027452 0.028811 0.013898 F1-score 0.012483 0.027298 0.008511 0.01727 0.021879 0.022189 0.012713 0.032743 0.031941 0.017692 Baselineerror 0.007906 0.023385 0.009079 0.006219 0.005694 0.010475 0.010897 0.013346 0.017068 0.02255 ExecutionTime 0.067268 0.065 0.09909 0.126763 0.143592 0.120277 0.230814 0.102713 1.672398 0.205387 T Accuracy 0.011402 0.012038 0.009843 0.01293 0.010807 0.008124 0.012374 0.002724 0.004801 0.002795 Precision 0.011451 0.012475 0.008422 0.013502 0.010827 0.008655 0.012459 0.002759 0.005268 0.003088 Recall 0.010434 0.011856 0.009878 0.012718 0.010636 0.007943 0.012715 0.002571 0.006152 0.003495 F1-score 0.011425 0.012201 0.009862 0.013116 0.010936 0.008206 0.012565 0.00252 0.005377 0.003148 Baselineerror 0.00886 0.010078 0.007578 0.003953 0.005962 0.005701 0.005962 0.007369 0.002073 0.004801 ExecutionTime 0.037736 0.042647 0.078859 0.157063 0.085257 0.232775 0.159902 0.118954 0.180347 0.138812 Y Accuracy 0.016724 0.019804 0.013134 0.016618 0.022282 0.015039 0.012304 0.016499 0.01874 0.017533 Precision 0.01736 0.02048 0.013868 0.01718 0.02274 0.015382 0.012338 0.016887 0.018147 0.017267 Recall 0.016136 0.018705 0.012596 0.016551 0.022932 0.015739 0.013529 0.017517 0.020334 0.019204 F1-score 0.016948 0.019874 0.013223 0.016925 0.022839 0.015579 0.013037 0.017143 0.019577 0.01839 Baselineerror 0.016817 0.016044 0.011554 0.017354 0.016386 0.01339 0.00825 0.02085 0.013273 0.013405 ExecutionTime 0.042647 0.10247 0.096954 0.133112 0.213691 0.105594 0.08165 0.055578 0.139284 0.04603 TableIV.12:ThestandarddeviationofalltheassessmentsintheCNNclassier2050 63

PAGE 75

IV.3Details IV.3.1Details IV.4EectofWindowSize Therandomcross-validationperformanceofthedataset B ondierentwindowsizeforeachofthe threeresiduesarefurtherillustratedinFigureIV.4.Theanalysisresultsrevealthefactthatifmore featuresareincludedinasingleaninstanceoftrainingdatawhichmeansincreasingwindowsize,the predictionaccuracyintheCNNandtheHybridCNNSupportVectorMachinesSVMmodels.Although thereissomedropintheaccuracyespeciallywhentheHybridCNNSupportVectorMachinesSVM classierisapplied. FigureIV.4:EectsofwindowsizeonPhosphorylatedSerine,Threonine,andTyrosinesitesprediction performanceinboththeCNNandHybridCNNSVMclassierintermoftheaccuracyassessment. However,ifthewindowsizeofthePSSMisincreased,whichmeanstherearemorefeaturesareadded totrainthemodelandmakeaprediction,thenthemodelwasmorestableintermoftheprediction oftheunknownresidue.ThecomputationalcomplexityandrequiredtimeforthehybridCNNSVM trainingandpredictionincreaseexponentially.Sointhisthesis,windowsize31wasconsideredtobe themaximumthreshold. FromFigureIV.4above,itcanbeobservedthatthemeanoftheaccuracyintheCNNalgorithmis betterthantheaverageinthehybridCNNSVMclassier.Precisely,themeanofaccuracyintheCNN 64

PAGE 76

modelisbetween73.69%to78.20%asthebestaccuracyperformance.Ontheotherhand,theaccuracy's meanforthehybridCNNSVMclassierisintherangeof64.00%to70.75%asthebestaccuracyin thismodelforthewindowsizeofthePSSMis31.Also,thebestaccuracyofallthethreeresiduesS, T,Yinbothmodelsis T residueintheCNNsystem,whereitreachedupto78.20%.Surprisingly,the worseaccuracybetweenallresiduesinbothalgorithmsisthe T residueintheSVM,whichisnearly 58%.Theresidue T intheCNNismorestableandperformsbettercomparedwiththesameresidue inthehybridCNNSVMclassier,whichtendedtobeunstableandbehavesdierentlyinmostlikely everywindowsizeforthePSSMprole.Whythisgurehasexplicitlysuchaperformancedropfor T residueinthehybridCNNSVmmodelisstillamystery. AnotherfactorthatexplainedmoreclearlytherelationshipbetweentheCNNandthehybridCNN SVMillustratedinFigureIV.5,whichisthebaselineerror. FigureIV.5:ThemeanofthebaselineerrorsinpercentagesforalltheresiduesofproteinphosphorylationsitesinbothmodelstheCNNandHybridCNNSVM.ThewindowsizeofthePSSMproleis ,17,19,21,23,25,27,29,31forthethreeresiduesS,T,Yoftheprotein. Mostofthetimes,TyrosineandThyronineresidueshavefewererrorsintheCNNmodelcompared withthecorrespondingresiduesinthehybridCNNSVMwhichhavemoreerrors.Ifwecanlookclosely, 65

PAGE 77

itisevidentthatthebestwindowsizeforthethyronineresiduesinthehybridCNNSVMwas27,while theworstwindowsizeforitwas31,wherethebaselineerrorwas41.5%. Ifwecantakeeveryresidueonebyoneandcheckwhatthebestwindowsizeforitinbothmodels,it wasclearfrombothguresIV.4andIV.5thatSerineintheCNNperformedbetter,whenthewindow sizeis21,while25isthebestchoicefortheserineinthehybridCNNSVMclassier.Ontheother hand,thebestwindowsizeforThreonineintheCNNis31withthebaselineerroris21.80%,while 21,25,and27areconsideredafairchoiceforThreonineinthesecondmodel.Finally,21isconsidered theoptimalwindowsizeforTyrosineintheCNNmodel,whileinthehybridCNNSVMclassier,the windowsizeisequalto25,thebestselectionfortyrosinewiththebaselineerrornearly27%.Overall, thewindowsize25istheoptimaloptionforbothmodelswithoutconsiderclassifyingthem. IV.5TheElapsedTimeResults Theexecutiontimeisthetimeofexecutingthemodelinboththetrainingdatasetandthetesting datasettillmeasuretheperformance.Theexecutiontimesfollowthepatternweexpect:TheCNN modeltendedtobefasterthanthehybridCNNSVMasitisshowninFigureIV.6.Thereasonwhythe hybridCNNSVMisperformedslowlyistheneedfortheCNNalgorithmtonishtrainingthemodel andfetchthedatatoSVMs,thatmeansthehybridCNNSVMclassiertakestwotimes,oneisthetime oftheCNN,andtheotheristhetimeoftheSVMtotraintheCNN'sfeaturesandtestingthemodel. FigureIV.6:ThemeanofElapsedTimeresultsforallwindowsizesWithbatchsizeis50,andepoch is20usingbothmodelstheCNNandthehybridCNNSVM. 66

PAGE 78

ThetimedegradationrelatedtorawthePSSMprole,boththeCNNandthehybridCNNSVM,is notasbad.ThiscanbeinferredfromtheFigureIV.6. Dependingonthemodel,SerineresiduewilllikelyperformmuchslowerthanThreonineandTyrosine, andthelastoneismostlyslowerthanThreonineinthehybridCNNSVMmodel.Thesameobservation isappliedintheCNNmodel.Thefasterexcutiontimebetweenallresiduesinbothmodelsis2.80sfor ThreonineintheCNNmodelbyapplyingthewindowsize15,whichmakessense. Aspreviouslydiscussed,thesefactswilllikelynotimpactofdevelopingapredictionsystemforthe phosphorylationsiteoftheproteinbecausetechniquesofthelaboratoryareslower,timeconsuming, andexpensive. IV.6Comparisonwiththeexistingmethods Tocomparetheperformancebetweentheproposedsystemandotherexistingsystems,itwasconductedtoexaminethebestmodelthatwegotfromtheexperiments,whichistheCNNwithparameters: thebatchsizeis50,thenumberoftheepochsis20,andthewindowsizeis31 20.Wechoserandomly 50proteinsentriesfrom297proteinsexistinginanarticle[60].Thearticleevaluatedtheve insilico predictionmethods,whichareScansite2.0[61],NetPhosK[62],DISPHOS[63],KinasePhos[64],and PPSP[65]withabenchmarkdataset.Thecollectionof50proteinsiscalledaDdataset,whichwas selectedrandomlybutpreciselytohasmanyphosphorylatedsitesinthechosenproteins.TheDdataset has68,79and70positiveserine,threonineandtyrosinephosphorylatedsitesrespectivelyonthose 50proteins.Wecreatedanindependentdataset,whichcontains140sitesfromeachoneofthethree residuesS,T,Y,wheretheSdatasethas60positiveand80negativelocations,whilebothTandY datasetshaveanequalnumberofpositiveandnegativeresidues. WindowSize Method'sName Soutof140 Toutof140 Youtof140 ppred[15] 72 78 82 ams[66] 85 54 64 disphos[63] 73 85 87 gps[67] 74 81 78 kinasephos[68] 75 89 92 netphos[62] 60 74 82 phosida[69] 82 89 87 ppsp[34] 74 82 77 Scansite2.0[61] 67 92 96 TheProposedSystem 99 102 104 TableIV.13:Thecomparisonoftheproposedsystemwithsomeoftheexistingsystemsregarding predictionscores. 67

PAGE 79

TableIV.13aboveshowsthetestresultsofthecomparisonbetweentheproposedsystemandother predictionmethods,whichprovideevidencethattheproposedmodelismuchbetterthanalltheother methodsinthetable.Theproposedmethodcanaccuratelypredict99,102and104phosphorylatedsites ofserine,thyronine,andtyrosinerespectivelyoutof140totalannotatedforeachoneofthemofthe independentdataset. 68

PAGE 80

CHAPTERV DISCUSSION V.1SummaryofThesis Mostoftheexistingphosphorylationsitepredictionsystemsusekinase-specicinformationofphosphorylatedsites.Inthatcircumstance,itisrequiredtolteroutthoseproteinswithoutkinaseannotationsfromthephosphorylation-positivedatasetfoundtodatefromPhospho.ELM.Itcanbeseenfrom thepresentupdateofthisdatasetthatonly30%oftheactualphosphorylationsitescontainkinasenotes, whichmeansmorethan70%ofthedatasetareeliminatedinthedesignoftheexistingkinasespecicpredictionsystems.Thissignicanttruncationignoressomeessentialpropertiesofphosphorylationsites, suchastheevolutionaryconservationthatwouldbeusefulinclassifyingphosphorylationsites.The evolutionaryinformationofproteinswhichcalledPSSMprolesofprotein,usuallyobtainedthrough amultiplesequencealignmentofanextensivedatabasethatcontainsalargenumberofproteinsencapsulatespositionspecicaminoacidconservationsinbothphosphorylatedandnonphosphorylated sites.Thus,thisevolutionaryinformationmaintenanceperformedasignicantroletoidentifythesites intheproposedsystem.Thisinformationiswidelyusedsuccessfullyinthepredictionofprotein-protein interactionsites[70],predictionofDNAbindingsitesinproteins[71],thesecondarystructure,oreven ndingmotifs[72]. Theproposalinthisthesisovercomesthelimitationofkinasespecicpredictionmethodsbydevelopingapredictionsystemthatwastrainedonlywiththeevolutionaryinformationofphosphorylated proteins.Thistime,thesystempurposelyignoredthekinasespecicinformationofthephosphorylation sitestoexplaintheimportanceoftheevolutionaryprolesalonetopredictphosphorylationsites.The predictionresultsshowthattheproposedsystemcanclassifyphosphorylatedandnonphosphorylated sitesfromgivenprimarysequencesofproteininboththeaccessionoftheproteinorthePDBformaccuratelyenoughtobeusedwithanyexistingsystem.Theresultsshowpiecesofevidencethatevolutionary informationofproteinscanbeusedinbothmachinelearningordeeplearningalgorithmtoclassifythe phosphorylatedandnon-phosphorylatedsites. Allserine,threonine,andtyrosineresidueswhichwerenotlabeledasphosphorylatedwereconsidered asnegativesitesinthedesigningthesystem.Thenumberofnegativesiteswaslargerthanthepositive sitesinbothdatasets.Therefore,itispossiblethatthenumberofnegativesitesaddsbiastothe measurementofpredictionaccuracy.Ifthesystemwastrainingwithdierentratioofsites,wherethe numberofnegativessiteslargerthanpositivesites,thentheresultsshowthatthepredictionsystem 69

PAGE 81

statesmostofthesitesasnegative.Sotoachieveagoodpredictionaccuracy,theequalratioofpositive andnegativesitesinthetrainingdatasetwasrequired. Inthisstudy,separateexperimentalresultsshowthatiftheperformanceofthemodelonthe C datasetachieveshighaccuracythaninthe B dataset.Also,TheCNNmodelwasperformedmuch betterthanthehybridCNN-SVMintermofmeasurementtheAccuracy,Precision,F1score,Recall, andBaselineerrors.Furthermore,thenumberofsitesinserine,threonine,andtyrosinewerenotalso equal.Sothreeseparatepredictionmoduleswerebuiltintheproposedsystemforidentifyingthethree phosphorylatedresiduesS,T,andY. V.2ConclusionandFutureWork UltimatelythegoalofthisthesiswastohaveadeeperunderstandingoftheconceptofthePSSM proleandhowarchitecturalphosphorylationsitesoftheproteinoccurtobuildarobustsystemto predicttheproteinphosphorylationsites.Inordertounderstandtheseexploitsitsrequiredtohavea rmgraspoftheunderlyingarchitectureofdeeplearningandmachinelearningalgorithms. WerstidentiedtheunderlyingdrawbacksofusingonlyKinaseinformationtobuildaprediction systemforphosphorylationsitesofaprotein.Thedatabasewasdevelopedthatcontainstheessentialfeaturesoftheprotein.Then,aphosphorylationsitepredictionsystemwithnovelapproachwasintroduced thatincorporatedevolutionaryinformationofproteinsofbothphosphorylatedandnonphosphorylated classes.BothConvolutionalneuralnetworkCNNandHybridCNNSupportVectorMachinesSVM classierswereusedinthepurposeofpredictphosphorylationsitesoftheprotein.Thesystemshows betterpredictionperformancethansomeoftheexistingnongeneralizedversionofthepredictionsystem.Also,thecomparisonoccursinbothdatasetsthatusetheaccessionoftheproteinasaninputof themodelorusingthePDBform.ThemodelshowsthatusingthePSSMproleofthePDBidofthe proteininsteadofthePSSMofthewholeproteingivesthemuchbetterperformanceandreachthehigh accuracy.TheresultsalsoindicatethatthesignicanceofusingtheCNNalgorithminclassication theevolutionaryinformationofbothphosphorylatedandnonphosphorylatedproteinsfortheproposed system. Therearemanyscopestoimprovethedesignpresentedinthisthesisandalsotouseasimilarmodel forotherresearches.Thelistoffutureresearchdirectionsareshownbelow: Moreinformationcanbeincluded:Ifevolutionaryinformationofproteinswereusedinconjunction withotherfeaturesofthephosphorylatedsitesoftheproteintotrainseparatemachinesordeep learningtechniquesanddenerulestoselecttheconclusionthatmajorityofthesystemsprovide, amorerobustpredictionthesystemcanbedeveloped. 70

PAGE 82

AnotherareaoffutureworkinterestwouldbetomodifythestructureoftheCNNandturnmore theparametersoftheCNNtoreachhighaccuracypredictions. UsingothermachinelearningapproachessuchasHiddenMarkovModelswithprolingandcan comparethepredictionperformanceoftheproposedsystemfordierentmachinelearningapproacheswhichwouldrevealwhichmachinelearningmethodisbetterforclassifyingphosphorylationsites. Webservicedevelopment:Anwebservicecanbedevelopedtoassistusersfromaroundtheworld tosubmitanyproteinsequenceoftheirinterestforphosphorylationsiteannotationsinit. TheresultsofPDBaresurprising,whichleadstoaninterestingfurtherexploretheimpactofusing PDBIDstomakethepredictioninsteadofusingtheproteinaccession. ImprovingPredictionbydesigningaparallelcodetoextractthePSSMprolesquicklyfromthe PSI-BLASTprogram. ApplyingtheNonNegativeMatrixFactorizationonthePSSMproleoftheproteinalongwith theotherfeaturesthatwehavecollectedinthedatabase. 71

PAGE 83

REFERENCES [1]A.Lehninger,D.Nelson,andM.Cox,Bioenergeticsandmetabolism,principleofbiochemistry, 2ndPreprint,CBSPublisherandDistribution ,pp.313,1987. [2]N.H.G.R.I.NHGRI.Primarystructureofaprotein.[Online].Available: https://commons.wikimedia.org/wiki/File:Protein-primary-structure.png [3]M.M.Gromiha, Proteinbioinformatics:fromsequencetofunction .AcademicPress,2010. [4]T.S.C.Q.DavidSecko.Additionofaphosphatetoanaminoacid.[Online].Available: http://www.scq.ubc.ca/protein-phosphorylation-a-global-regulator-of-cellular-activity/ [5]J.Pevsner, Bioinformaticsandfunctionalgenomics .JohnWiley&Sons,2015. [6]S.KlumppandJ.Krieglstein,Phosphorylationanddephosphorylationofhistidineresiduesin proteins, Europeanjournalofbiochemistry ,vol.269,no.4,pp.1067,2002. [7]A.J.Cozzone,Proteinphosphorylationinprokaryotes, AnnualReviewsinMicrobiology ,vol.42, no.1,pp.97,1988. [8]P.Cohen,Theroleofproteinphosphorylationinhumanhealthanddisease. Europeanjournalof biochemistry ,vol.268,no.19,pp.5001,2001. [9]E.H.FischerandE.G.Krebs,Conversionofphosphorylasebtophosphorylaseainmuscleextracts, JournalofBiologicalChemistry ,vol.216,no.1,pp.121,1955. [10]S.B.Ficarro,M.L.McCleland,P.T.Stukenberg,D.J.Burke,M.M.Ross,J.Shabanowitz,D.F. Hunt,andF.M.White,Phosphoproteomeanalysisbymassspectrometryanditsapplicationto saccharomycescerevisiae, Naturebiotechnology ,vol.20,no.3,p.301,2002. [11]A.Krupa,G.Preethi,andN.Srinivasan,Structuralmodesofstabilizationofpermissivephosphorylationsitesinproteinkinases:distinctstrategiesinser/thrandtyrkinases, Journalofmolecular biology ,vol.339,no.5,pp.1025,2004. [12]J.Hu,Y.-F.Zhao,andY.-M.Li,Theeectsofreversiblephosphorylationonpeptideandprotein localstructure, Phosphorus,Sulfur,andSilicon ,vol.183,no.2-3,pp.249,2008. [13]L.A.PinnaandM.Ruzzene,Howdoproteinkinasesrecognizetheirsubstrates? Biochimicaet BiophysicaActaBBA-MolecularCellResearch ,vol.1314,no.3,pp.191,1996. [14]Y.Xue,F.Zhou,M.Zhu,K.Ahmed,G.Chen,andX.Yao,Gps:acomprehensivewwwserverfor phosphorylationsitesprediction, Nucleicacidsresearch ,vol.33,no.suppl_2,pp.W184W187, 2005. [15]A.K.Biswas,N.Noman,andA.R.Sikder,Machinelearningapproachtopredictproteinphosphorylationsitesbyincorporatingevolutionaryinformation, BMCbioinformatics ,vol.11,no.1, p.273,2010. [16]E.G.KrebsandJ.A.Beavo,Phosphorylation-dephosphorylationofenzymes, Annualreviewof biochemistry ,vol.48,no.1,pp.923,1979. [17]R.E.Bryant,E.A.Adelberg,andP.T.Magee,Propertiesofanalteredrnapolymeraseiiactivity froman -amanitin-resistantmousecellline, Biochemistry ,vol.16,no.19,pp.4237,1977. [18]E.Buxbaum, Fundamentalsofproteinstructureandfunction .Springer,2007,vol.31. [19]T.Li,P.Du,andN.Xu,Identifyinghumankinase-specicproteinphosphorylationsitesbyintegratingheterogeneousinformationfromvarioussources, PloSone ,vol.5,no.11,p.e15411, 2010. 72

PAGE 84

[20]A.Ensminger,M.Ensminger,J.Konlande,andJ.Robson,Food&nutritionencyclopedia.crc, press, Inc.,BocaRaton,FL ,1994. [21]J.S.WhiteandD.C.White, Proteins,PeptidesandAminoAcidsSourceBook .SpringerScience &BusinessMedia,2002. [22]F.Vazquez,S.Ramaswamy,N.Nakamura,andW.R.Sellers,Phosphorylationoftheptentail regulatesproteinstabilityandfunction, Molecularandcellularbiology ,vol.20,no.14,pp.5010 5018,2000. [23]E.G.KrebsandE.H.Fischer,Thephosphorylasebtoaconvertingenzymeofrabbitskeletal muscle, Biochimicaetbiophysicaacta ,vol.20,pp.150,1956. [24]Y.Shi,Serine/threoninephosphatases:mechanismthroughstructure, Cell ,vol.139,no.3,pp. 468,2009. [25]P.Cohen,Theoriginsofproteinphosphorylation, Naturecellbiology ,vol.4,no.5,p.E127,2002. [26]E.W.SutherlandandW.D.Wosilait,Inactivationandactivationofliverphosphorylase, Nature , vol.175,no.4447,p.169,1955. [27]M.A.LawlorandD.R.Alessi,Pkb/akt:akeymediatorofcellproliferation,survivalandinsulin responses? Journalofcellscience ,vol.114,no.16,pp.2903,2001. [28]T.Hunter,Thecroonianlecture1997.thephosphorylationofproteinsontyrosine:itsrolein cellgrowthanddisease, PhilosophicalTransactionsoftheRoyalSocietyofLondonB:Biological Sciences ,vol.353,no.1368,pp.583,1998. [29]F.Ardito,M.Giuliani,D.Perrone,G.Troiano,andL.LoMuzio,Thecrucialroleofprotein phosphorylationincellsignalinganditsuseastargetedtherapy, Internationaljournalofmolecular medicine ,vol.40,no.2,pp.271,2017. [30]H.Kuwahara,M.Nishizaki,andH.Kanazawa,Nuclearlocalizationsignalandphosphorylationof serine350specifyintracellularlocalizationofdrak2, Journalofbiochemistry ,vol.143,no.3,pp. 349,2007. [31]O.RosenandJ.Erlichman,Reversibleautophosphorylationofacyclic3':5'-amp-dependentproteinkinasefrombovinecardiacmuscle. JournalofBiologicalChemistry ,vol.250,no.19,pp. 7788,1975. [32]J.C.Betts,W.P.Blackstock,M.A.Ward,andB.H.Anderton,Identicationofphosphorylationsitesonneurolamentproteinsbynanoelectrospraymassspectrometry, JournalofBiological Chemistry ,vol.272,no.20,pp.12922927,1997. [33]G.Manning,D.B.Whyte,R.Martinez,T.Hunter,andS.Sudarsanam,Theproteinkinasecomplementofthehumangenome, Science ,vol.298,no.5600,pp.1912,2002. [34]Y.Xue,A.Li,L.Wang,H.Feng,andX.Yao,Ppsp:predictionofpk-specicphosphorylationsite withbayesiandecisiontheory, BMCbioinformatics ,vol.7,no.1,p.163,2006. [35]L.SteinkeandR.G.Cook,Identicationofphosphorylationsitesbyedmandegradation,in ProteinSequencingProtocols .Springer,2003,pp.301. [36]X.Zhang,C.J.Herring,P.R.Romano,J.Szczepanowska,H.Brzeska,A.G.Hinnebusch,andJ.Qin, Identicationofphosphorylationsitesinproteinsseparatedbypolyacrylamidegelelectrophoresis, Analyticalchemistry ,vol.70,no.10,pp.2050,1998. [37]L.Wei,P.Xing,J.Tang,andQ.Zou,Phospred-rf:anovelsequence-basedpredictorforphosphorylationsitesusingsequentialinformationonly, IEEEtransactionsonnanobioscience ,vol.16, no.4,pp.2407,2017. 73

PAGE 85

[38]J.Gao,J.J.Thelen,A.K.Dunker,andD.Xu,Musite:atoolforglobalpredictionofgeneraland kinase-specicphosphorylationsites, Molecular&CellularProteomics ,pp.mcpM110,2010. [39]J.Song,H.Wang,J.Wang,A.Leier,T.Marquez-Lago,B.Yang,Z.Zhang,T.Akutsu,G.I.Webb, andR.J.Daly,Phosphopredict:Abioinformaticstoolforpredictionofhumankinase-specicphosphorylationsubstratesandsitesbyintegratingheterogeneousfeatureselection, ScienticReports , vol.7,no.1,p.6862,2017. [40]H.D.Ismail,A.Jones,J.H.Kim,R.H.Newman,andD.B.Kc,Rf-phos:anovelgeneral phosphorylationsitepredictiontoolbasedonrandomforest, BioMedresearchinternational ,vol. 2016,2016. [41]Y.Xu,J.Song,C.Wilson,andJ.C.Whisstock,Phoscontext2vec:adistributedrepresentationof residue-levelsequencecontextsanditsapplicationtogeneralandkinase-specicphosphorylation siteprediction, Scienticreports ,vol.8,2018. [42]D.Wang,S.Zeng,C.Xu,W.Qiu,Y.Liang,T.Joshi,andD.Xu,Musitedeep:adeep-learning frameworkforgeneralandkinase-specicphosphorylationsiteprediction, Bioinformatics ,vol.33, no.24,pp.39093916,2017. [43]S.F.Altschul,T.L.Madden,A.A.Scher,J.Zhang,Z.Zhang,W.Miller,andD.J.Lipman, Gappedblastandpsi-blast:anewgenerationofproteindatabasesearchprograms, Nucleicacids research ,vol.25,no.17,pp.3389,1997. [44]S.F.Altschul,J.C.Wootton,E.M.Gertz,R.Agarwala,A.Morgulis,A.A.Scher,andY.-K. Yu,Proteindatabasesearchesusingcompositionallyadjustedsubstitutionmatrices, TheFEBS journal ,vol.272,no.20,pp.5101,2005. [45]R.Durbin,S.R.Eddy,A.Krogh,andG.Mitchison, Biologicalsequenceanalysis:probabilistic modelsofproteinsandnucleicacids .Cambridgeuniversitypress,1998. [46]S.HenikoandJ.G.Heniko,Aminoacidsubstitutionmatricesfromproteinblocks, Proceedings oftheNationalAcademyofSciences ,vol.89,no.22,pp.10915919,1992. [47]J.J.Hopeld,Articialneuralnetworks, IEEECircuitsandDevicesMagazine ,vol.4,no.5,pp. 3,1988. [48]J.G.Daugman,Completediscrete2-dgabortransformsbyneuralnetworksforimageanalysis andcompression, IEEETransactionsonacoustics,speech,andsignalprocessing ,vol.36,no.7,pp. 1169,1988. [49]A.Krizhevsky,I.Sutskever,andG.E.Hinton,Imagenetclassicationwithdeepconvolutional neuralnetworks,in Advancesinneuralinformationprocessingsystems ,2012,pp.1097. [50]O.Russakovsky,J.Deng,H.Su,J.Krause,S.Satheesh,S.Ma,Z.Huang,A.Karpathy,A.Khosla, M.Bernstein etal. ,Imagenetlargescalevisualrecognitionchallenge, InternationalJournalof ComputerVision ,vol.115,no.3,pp.211,2015. [51]I.Goodfellow,Y.Bengio,A.Courville,andY.Bengio, Deeplearning .MITpressCambridge,2016, vol.1. [52]C.Szegedy,W.Liu,Y.Jia,P.Sermanet,S.Reed,D.Anguelov,D.Erhan,V.Vanhoucke,and A.Rabinovich,Goingdeeperwithconvolutions,in ProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition ,2015,pp.1. [53]M.A.Nielsen, Neuralnetworksanddeeplearning .DeterminationpressUSA,2015,vol.25. [54]F.-F.Li,A.Karpathy,andJ.Johnson,Cs231n:Convolutionalneuralnetworksforvisualrecognition, UniversityLecture ,2015. 74

PAGE 86

[55]R.Sanchez-Garcia,C.O.S.Sorzano,J.M.Carazo,andJ.Segura,dcons-db:Adatabaseof position-specicscoringmatricesinproteinstructures, Molecules ,vol.22,no.12,p.2230,2017. [56]R.Apweiler,A.Bairoch,C.H.Wu,W.C.Barker,B.Boeckmann,S.Ferro,E.Gasteiger,H.Huang, R.Lopez,M.Magrane etal. ,Uniprot:theuniversalproteinknowledgebase, Nucleicacidsresearch , vol.32,no.suppl_1,pp.D115D119,2004. [57]H.M.Berman,T.Battistuz,T.N.Bhat,W.F.Bluhm,P.E.Bourne,K.Burkhardt,Z.Feng, G.L.Gilliland,L.Iype,S.Jain etal. ,Theproteindatabank, ActaCrystallographicaSectionD: BiologicalCrystallography ,vol.58,no.6,pp.899,2002. [58]H.Dinkel,C.Chica,A.Via,C.M.Gould,L.J.Jensen,T.J.Gibson,andF.Diella,Phospho.elm: adatabaseofphosphorylationsitesupdate2011, Nucleicacidsresearch ,vol.39,no.suppl_1, pp.D261D267,2010. [59]phpMyAdmincontributors.phpmyadminbringingmysqltotheweb.[Online].Available: https://www.phpmyadmin.net/ [60]A.R.SikderandA.Y.Zomaya,Analysisofproteinphosphorylationsitepredictorswithanindependentdataset, Internationaljournalofbioinformaticsresearchandapplications ,vol.5,no.1, pp.20,2009. [61]J.C.Obenauer,L.C.Cantley,andM.B.Yae,Scansite2.0:Proteome-widepredictionofcell signalinginteractionsusingshortsequencemotifs, Nucleicacidsresearch ,vol.31,no.13,pp. 3635,2003. [62]N.Blom,T.Sicheritz-Pontn,R.Gupta,S.Gammeltoft,andS.Brunak,Predictionofposttranslationalglycosylationandphosphorylationofproteinsfromtheaminoacidsequence, Proteomics ,vol.4,no.6,pp.1633,2004. [63]L.M.Iakoucheva,P.Radivojac,C.J.Brown,T.R.O'Connor,J.G.Sikes,Z.Obradovic,andA.K. Dunker,Theimportanceofintrinsicdisorderforproteinphosphorylation, Nucleicacidsresearch , vol.32,no.3,pp.1037,2004. [64]S.R.Eddy,Prolehiddenmarkovmodels. BioinformaticsOxford,England ,vol.14,no.9,pp. 755,1998. [65]F.Diella,S.Cameron,C.Gemnd,R.Linding,A.Via,B.Kuster,T.Sicheritz-Pontn,N.Blom, andT.J.Gibson,Phospho.elm:adatabaseofexperimentallyveriedphosphorylationsitesin eukaryoticproteins, BMCbioinformatics ,vol.5,no.1,p.79,2004. [66]S.BasuandD.Plewczynski,Ams3.0:predictionofpost-translationalmodications, BMCbioinformatics ,vol.11,no.1,p.210,2010. [67]Y.Xue,J.Ren,X.Gao,C.Jin,L.Wen,andX.Yao,Gps2.0,atooltopredictkinase-specic phosphorylationsitesinhierarchy, Molecular&cellularproteomics ,vol.7,no.9,pp.1598, 2008. [68]Y.-H.Wong,T.-Y.Lee,H.-K.Liang,C.-M.Huang,T.-Y.Wang,Y.-H.Yang,C.-H.Chu,H.-D. Huang,M.-T.Ko,andJ.-K.Hwang,Kinasephos2.0:awebserverforidentifyingproteinkinasespecicphosphorylationsitesbasedonsequencesandcouplingpatterns, Nucleicacidsresearch , vol.35,no.suppl_2,pp.W588W594,2007. [69]F.Gnad,S.Ren,J.Cox,J.V.Olsen,B.Macek,M.Oroshi,andM.Mann,Phosidaphosphorylationsitedatabase:management,structuralandevolutionaryinvestigation,andpredictionof phosphosites, Genomebiology ,vol.8,no.11,p.R250,2007. [70]M.Kakuta,S.Nakamura,andK.Shimizu,Predictionofprotein-proteininteractionsitesusing onlysequenceinformationandusingbothsequenceandstructuralinformation, Informationand MediaTechnologies ,vol.3,no.2,pp.351,2008. 75

PAGE 87

[71]S.AhmadandA.Sarai,Pssm-basedpredictionofdnabindingsitesinproteins, BMCbioinformatics ,vol.6,no.1,p.33,2005. [72]J.Zhou,H.Wang,Z.Zhao,R.Xu,andQ.Lu,Cnnh_pss:protein8-classsecondarystructure predictionbyconvolutionalneuralnetworkwithhighway, BMCbioinformatics ,vol.19,no.4, p.60,2018. [73]J.-T.Du,Y.-M.Li,W.Wei,G.-S.Wu,Y.-F.Zhao,K.Kanazawa,T.Nemoto,andH.Nakanishi, Low-barrierhydrogenbondbetweenphosphateandtheamidegroupinphosphopeptide, Journal oftheAmericanChemicalSociety ,vol.127,no.47,pp.16350351,2005. [74]M.AudagnottoandM.DalPeraro,Proteinpost-translationalmodications:Insilicoprediction toolsandmolecularmodeling, Computationalandstructuralbiotechnologyjournal ,vol.15,pp. 307,2017. [75]S.Marsland, Machinelearning:analgorithmicperspective .ChapmanandHall/CRC,2011. [76]X.Wang,M.L.Xu,B.Q.Li,H.L.Zhai,J.J.Liu,andS.Y.Li,Predictionofphosphorylationsites basedonkrawtchoukimagemoments, Proteins:Structure,Function,andBioinformatics ,vol.85, no.12,pp.22312238,2017. [77]X.Xu,A.Li,andM.Wang,Predictionofhumandisease-associatedphosphorylationsiteswith combinedfeatureselectionapproachandsupportvectormachine, IETsystemsbiology ,vol.9,no.4, pp.155,2015. [78]Q.Chen,Y.Wang,B.Chen,C.Zhang,L.Wang,andJ.Li,Usingpropensityscorestopredictthe kinasesofunannotatedphosphopeptides, Knowledge-BasedSystems ,vol.135,pp.60,2017. [79]Y.Dou,B.Yao,andC.Zhang,Predictionofproteinphosphorylationsitesbyintegratingsecondary structureinformationandotherone-dimensionalstructuralproperties,in PredictionofProtein SecondaryStructure .Springer,2017,pp.265. [80]H.Mohabatkar,P.Rabiei,andM.Alamdaran,Newachievementsinbioinformaticspredictionof posttranslationalmodicationofproteins. Currenttopicsinmedicinalchemistry ,vol.17,no.21, pp.2381,2017. [81]B.Narins, TheGaleEncyclopediaofNursingandAlliedHealth .Gale,CengageLearning,2013. [82]S.F.Altschul,W.Gish,W.Miller,E.W.Myers,andD.J.Lipman,Basiclocalalignmentsearch tool, Journalofmolecularbiology ,vol.215,no.3,pp.403,1990. [83]H.-D.Huang,T.-Y.Lee,S.-W.Tzeng,andJ.-T.Horng,Kinasephos:awebtoolforidentifying proteinkinase-specicphosphorylationsites, Nucleicacidsresearch ,vol.33,no.suppl_2,pp. W226W229,2005. 76

PAGE 88

APPENDIXA TABLEANDABBREVIATION A.1TwentyAminoAcids AminoacidsThreelettercodeOnelettercode AlanineAlaA ArginineArgR AsparagineAsnN AsparticacidAspD CysteineCysC GlutamicacidGluE GlycineGlyG HistidineHisH IsoleucineIleI LeucineLeuL LysineLysK MethionineMetM PhenylalaninePheF ProlineProP SerineSerS ThreonineThrT TriptophanTrpW TyrosineTyrY ValineValV TableA.1:Alistoftwentykindsofaminoacidsthatsupportthehumanbody,eachofthemhasits functions.Thecombinationofthetwentyaminoacidsmakesupanyprotein. A.2Abbreviations ADP AdenosineDiphosphate. 77

PAGE 89

ATP AdenosineTriphosphate. BLAST BasicLocalAlignmentSearchTool. BLOSUM BLOcksofAminoAcidSUbstitutionMatrix. BLASTP Protein-proteinBLAST. CNN aconvolutionalneuralnetwork cAMP CyclicAdenosineMonophosphate. DNA Deoxyribonucleicacid. EBI EuropeanBioinformaticsInstitute. HSP HighScoringsegmentPair. mRNA MessengerRNA. MSA MultipleSequenceAlignment. PSSM PositionSpecicScoringMatrix. PSI-BLAST PositionSpecicIteratedBLAST. PTM PostTranslationalModication. PDB ProteinDataBank RNA RibonucleicAcid. RCSBPDB TheResearchCollaboratoryforStructuralBioinformaticsProteinDataBank. S Serine. SVM SupportVectorMachines. SIB SwissInstituteofBioinformatics. SIB ProteinInformationResourcePIR. T Threonine. Y Tyrosine. 78