Citation
Using deep learning on EHR data to predict diabetes

Material Information

Title:
Using deep learning on EHR data to predict diabetes
Creator:
Garske, Thomas
Place of Publication:
Denver, CO
Publisher:
University of Colorado Denver
Publication Date:

Thesis/Dissertation Information

Degree:
Master's ( Master of science)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Electrical Engineering, CU Denver
Degree Disciplines:
Electrical engineering
Committee Chair:
Connors, Dan
Committee Members:
Liu, Chao

Record Information

Source Institution:
Auraria Library
Holding Location:
Auraria Library
Rights Management:
Copyright Thomas Garske. Permission granted to University of Colorado Denver to digitize and display this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.

Downloads

This item is only available as the following downloads:


Full Text

PAGE 1

USINGDEEPLEARNINGONEHRDATATOPREDICTDIABETES by THOMASGARSKE B.S.,ElectricalEngineering,UniversityofSt.Thomas,2010 B.A.,Physics,UniversityofSt.Thomas,2010 Athesissubmittedtothe FacultyoftheGraduateSchoolofthe UniversityofColoradoinpartialfulllment oftherequirementsforthedegreeof MasterofScience ElectricalEngineering 2018

PAGE 2

ThisthesisfortheMasterofSciencedegreeby ThomasGarske hasbeenapprovedforthe ElectricalEngineeringProgram by DanConnors,Advisor ChaoLiu TimLei September10,2018 ii

PAGE 3

Garske,ThomasM.S.,ElectricalEngineeringProgram UsingDeepLearningonEHRDatatoPredictDiabetes ThesisdirectedbyProfessorDanConnors ABSTRACT Type2DiabetesMellitusT2DMisaleadingcauseofdeathanddisabilityintheUnited Stateswithhighprevalence,growingincidence,andseriousadverseoutcomes.Amajor challengeofaddressingchronichealthconditionssuchasT2DMisthatoftendiagnosis andtreatmentarenotestablisheduntilafterthediseasehasprogressedandpatientsare alreadysuering.Traditionalstrategiestopredictfutureriskofdiabeteshavegenerally useddemographicandclinicaldatafromprospectivecohortstudiesandstatisticalmodels andriskscores.Suchstrategiesappeartoperformmoderatelywell,but,ingeneral,have limitedsuccesswhenappliedtodierentpatientpopulations[13]. Advancementsinmachinelearninginmedicinehaverecentlymadeheadlinesinthearea ofdiseasepredictionandcareprescription.Modelingandpredictivetechniquesworkbest whenvarietiesofhealthdatacanbecompiledandtrainedon.Thegenerationofelectronic recordkeepinghasgreatlyimpactedtheabilityfornewmachinelearningtechniquestobe appliedtohealthcaredata.ThisabilitywasacceleratedbytheAordableCareAct,which incentivizedtheproperuseandadoptionofelectronichealthrecordsEHRs[17].Combining machinelearningtechniqueswithEHRdataprovidesanuniqueopportunitytoimprovecare managementatthepopulationlevelbypredictingdiseaseonsetandprogression.When caremanagementcanassessapatient'scurrentstatus,predictfuturetrends,andassignthe mostprescriptiveactions,patientoutcomescanbeimprovedwhilesimultaneouslylowering healthcarecosts. DeepLearningisasubeldofthebroadereldofmachinelearning.Mostdeeplearning methodslearnthroughanarticialneuralnetwork,otherwiseknownasaDeepNeuralNetworkDNN.Neuralnetworksaremadeupofseverallayersofconnectedprocessorscalled neurons.Indeeplearning,eachlayertransformsitsinputintoaslightlymoreabstract iii

PAGE 4

representationandisaboutaccuratelyassigningcreditacrossmanysuchlayers[28].Other deeplearningmodelstoforecastdiabetesonsetinfutureyearswerelargelylimitedbysmall populationsizesandhadanineectivelevelsuchas73%accuracy[24]. TheobjectiveofthisthesisistocreateadiabetesmachinelearningdatasetfromEHR data,anddevelopanoptimizeddeeplearningmodeltoidentifypatientsatriskofreceiving anewdiagnosisofT2DMduringthefollowing24monthtimeperiodforageneralpopulation.AdiabetespredictionmodelwassuccessfullydevelopedusingaDNNClassierfrom TensorFlowandoptimizedbydevelopingaframeworkforparametersearchoverdierent optimizationfunctions,activationfunctions,samplingmethods,andscalingmethods.The resultingdiabetesmachinelearningdatasetcontainsafeaturesetof22attributesfor149,050 patientsandrelatedclasslabelsforadiabetesdiagnosiswithin6,12,18,and24monthtime periods.Thebestmodelforpredictionofdiabeteshad83%truepositiveand88%true negativeaccuracyscoresforpredictionwithin24months.Earlyidenticationoftheriskof T2DMinthesepatientsprovideopportunitiesforcareinterventiontopromotebetterlong termoutcomes. Theformandcontentofthisabstractareapproved.Irecommenditspublication. Approved:DanConnors iv

PAGE 5

TABLEOFCONTENTS CHAPTER I.INTRODUCTION................................1 II.BACKGROUND.................................4 Type2DiabetesMellitus............................4 T2DMComplicationsandRiskFactors..................4 MachineLearning................................5 Supervised,Semi-supervised,andUnsupervisedLearning........6 DeepLearning................................7 PredictionModelOptimizationParameters..................7 OptimizationFunctions...........................7 ActivationFunctions............................8 SamplingMethods..............................8 ScalingMethods...............................9 MachineLearningDataset............................9 III.APPROACH...................................11 Runningtheexperiments............................13 IV.EXPERIMENTALRESULTS..........................14 MachineLearningDiabetesDataset.......................14 DiabetesPredictionModel............................17 ModelValidation..............................19 ModelFeatureBreakdown.........................19 ModelOptimizationParameterSearch.....................23 V.CONCLUSION..................................29 REFERENCES.......................................32 v

PAGE 6

CHAPTERI INTRODUCTION Type2DiabetesMellitusT2DMisaseriouspublichealthissueintheUnitedStates. IncidenceofT2DMisgrowingatanunprecedentedrate,andpreviousestimatestopredict thegrowthofthediseasehavedrasticallyfallenshort.In2005,itwasestimatedthatthe globalnumberofadultswithT2DMwouldgrowto380millionby2025[15];however,that predictionhadbeensurpassedbefore2015[8].Itisnowestimatedthatthereare425million adultslivingwithdiabetesglobally.TheU.S.prevalenceratesfordiabetesamongadultswho are20yearsorolderisnowover30million.8%,withanadditionalestimated11.5million adultswithundiagnoseddiabetes[9].Diabetesisthesixthoverallleadingcauseofdeathin theUnitedStateswithnearly80,000deathsin2015[5].Often,T2DMisasymptomaticin earlystagesofthedisease,andcanremainundiagnosedforyears.Diabetesmanagementis mostsuccessfulinreducingtheburdenofdiabetesandrelatedcomplicationsthroughearly diagnosisandintervention[2]. Advancementsinmachinelearninginmedicinehaverecentlymadeheadlinesinthearea ofdiseasepredictionandcareprescription.Modelingandpredictivetechniquesworkbest whenvarietiesofhealthdatacanbecompiledandtrainedon.Thegenerationofelectronic recordkeepinghasgreatlyimpactedtheabilityfornewmachinelearningtechniquestobe appliedtohealthcaredata.ThisabilitywasacceleratedbytheAordableCareAct,which incentivizedtheproperuseandadoptionofelectronichealthrecordsEHRs[17].The medicalrecordsinEHRsareoftenextensiveandcontaindetailedclinicalinformationfrom outpatientvisits,includingdataforallergies,appointments,medications,labresults,patient demographics,diagnoses,vaccines,andvitals.Machinelearningtoolsandtheiroptimized deploymentareavaluabletoolinanalyzinglargedatasetsandcanbeappliedtopatient EHRdata;providinganuniqueopportunitytoimprovecaremanagementatthepopulation levelthroughinsightintodiseaseonsetandprogressionwithinvariouspatientcohorts.When caremanagementcanassessapatient'scurrentstatus,predictfuturetrends,andassignthe 1

PAGE 7

mostprescriptiveactions,patientoutcomescanbeimprovedwhilesimultaneouslylowering healthcarecosts. Traditionalstrategiestopredictfutureriskofdiabeteshavegenerallyuseddemographic andclinicaldatafromprospectivecohortstudies,andstatisticalmodelsorriskscores.In general,suchstrategiesappeartoperformwellbuthavelimitedsuccesswhenappliedto dierentpatientpopulations[13].PriordiabetespredictionstudieshavedemonstratedfeasibilityinusingEHRdataforpredictionofT2DMinpatients,buttheresultswerelimited bythenumberofpatientsavailablefortraining[19].WithaccesstoAllscripts'de-identied datalakecontainingover50millionpatientlives,thisresearchhastheuniqueopportunity tocreateadiabetesmachinelearningdatasetwithalargeanddiversepatientcohort. TheworkofthisthesiswastoevaluateanEHRbaseddeeplearningmodeltopredict T2DMdiagnosisinageneralpopulation.TheobjectiveistodeliveradiabetesriskforecastingmodelusingEHRsforpredictionofT2DMtoallowearlyinterventiontreatment. Thisworkexpandsdiabetespredictionmodelsbyincreasingthenumberofpatientsand utilizingamuchbroaderpatientdatasetthanwaspreviouslyavailabletoresearchers.The contributionofthatworkincludes: 1.Developingavalidateddeeplearningmodeltopredictthetimelineofanewdiagnosis ofT2DMwithinthefollowing6,12,18,and24monthtimeperiodsfrom12monthsof apatient'sclinicalEHRhistory.Themodelwasdevelopedbytrainingadeeplearning articialneuralnetworkusingstatisticallearningonfeaturesselectedthroughdomain knowledgeofcommonriskfactorsforT2DM. 2.Buildingaframeworkfordevelopmentandevaluationofoptimizedclassiersusing aparametersearchoverselectedoptimizationfunctions,activationfunctions,scaling methods,andsamplingmethods.Thisframeworkwasmadetobegeneralpurpose andcanbeusedonanydeepneuralnetbinaryclassicationproblem.Afterreading thisthesisyouwillhaveanunderstandingofsomeinsightsintoparametersprovided bythisframework.Forexample,forthedatasetgeneratedinthisresearch,theeects 2

PAGE 8

ofdatasetimbalanceareapparentwhencomparingsamplingmethods.Additionally, someoptimizationfunctionsaremorelikelytogetstuckonanincorrectsolution. 3.GeneratingadiabetesmachinelearningdatasetthatincludesrelevantEHRdataacross ageneraldistributionoftheUnitedStates.Thisexpandsthepotentialusabilityofthis predictionmodelasaclinicaltoolbyincreasingthenumberofpatientsandutilizing amuchbroaderpatientdatasetthanwaspreviouslyavailabletoresearchers. Thisthesisisorganizedasfollows:ChapterIIcoversthebackgroundofT2DM,EHR data,developmentofmachinelearningmodels,andtheoptimizationofthosemodels.ChapterIIIexaminestheapproachtothediabetesmachinelearningdatasetcreation,prediction modeldevelopment,andmodelparameteroptimization.Theexperimentalresultssection, ChapterIV,showsdetailsandspecicsontheresultingmachinelearningdataset,performancedataforthepredictionofvariousdiagnosistimelines,validationofthoseprediction models,andtheresultsofthemodelparameteroptimizationsearch.Finally,ChapterVwill summarizetheseresultsanddiscusspotentialfurtherexperiments. 3

PAGE 9

CHAPTERII BACKGROUND Type2DiabetesMellitus Diabetesisadiseasethatimpairsthebody'sabilitytorespondtoinsulin,ahormone neededtoallowglucosetoentercells,resultinginhigherlevelsofglucoseintheblood.While Type2DiabetesMellitusT2DMisthemostcommonformofdiabetes,thereareother chronicdiabeticconditionsincludingprediabetes,Type1DiabetesMellitusT1DM,and gestationaldiabetes[22].InprediabetesandT2DM,thepancreasisincreasinglyunableto produceenoughinsulinascellsgraduallybecomeresistanttotheaectsofinsulin.T1DMisa conditioninwhichthebodycannoteectivelyregulatebloodsugarasthepancreasproduces littletonoinsulin.Whilethecauseisunknown,T1DMisbelievedtobegenetic.Some womendevelopgestationaldiabetesduringpregnancy,butitusuallyresolvesafterdelivery. IndividualsthathavehadgestationaldiabetesareathigherriskofdevelopingT2DM,and sometimesgestationaldiabetesdiagnosedduringpregnancyisactuallyT2DM[4]. T2DMComplicationsandRiskFactors Long-termcomplicationsofT2DMdevelopgradually,butthosecomplicationscanbe disablingorlife-threatening.Potentialmajorcomplicationsincludecardiovasculardisease, neuropathy,nephropathy,retinopathy,footorlegamputation,bacterialandfungalinfections,hearingimpairment,dementia,anddepression.Theriskofcomplicationsincreases withpoorbloodsugarcontrolandthelengthoftimewithT2DM[21].Itisnotfully understoodwhatmakessomeindividualsmoresusceptibletodevelopingT2DM.However, evidenceindicatesthatcertainfactorsincreasetheriskofdevelopingthediseaseincluding: Obesity.HighamountsofbodyfatcancausetheendoplasmicreticulumER,asystem ofcellularmembranes,tosuppressthesignalsofinsulinreceptorsandleadtoinsulin resistanceincells[23]. Familyhistory.Individualswithrst-degreerelativeswhohaveT2DMareatapproximately3.5timesasmuchriskofT2DMcomparedtothegeneralpopulation[25]. 4

PAGE 10

Multiplelargescalepopulationstudieshavedemonstratedtheexistenceofgenetic susceptibilityvariantslinkedtoT2DMrisk[31,27,33]. Race.CertainracialandethnicbackgroundshavebeenshowntocontributetovariationsinglycemiccontrolandoverallriskofdevelopingT2DMforbothadultandyouth populations[11,26]. Age.WhileincidenceofT2DMinyouthisrisingatanalarmingrate,diabetesis moreprevalentinadultandagingpopulations.Muchofthiscanbeattributedto compoundingyearsofpoordietandinsucientphysicalactivityleadingtochronic insulinresistance[12]. Gender.TheriskofdevelopingprediabetesandT2DMisincreasedforwomenthat werepreviouslydiagnosedwithgestationaldiabetes[4].Womenthathavegivenbirth tobabiesweighingoutsideofthenormalweightdistributionbothlowweightand highweightexhibitahigherriskofdevelopingT2DM[10].Havingpolycysticovary syndromeincreasestheriskofT2DM[18]. Highbloodpressure.Highbloodpressurereadingsofover140/90millimetersofmercurymmHgishighlycommonamongindividualswithT2DM. Abnormalcholesterol.Abnormalcholesterolandtriglyceridelevelsoftenaccompanies T2DM.Riskincreaseswithhighlevelsoftriglyceridesandlowlevelsofhigh-density lipoproteinHDL[3]. MachineLearning Machinelearningistheprocessofequippingcomputerswiththeabilitytolearnwithout explicitprogrammingthroughcomputerprogramsthatcanchangeoradaptwhenexposed tonewdata.Therearefourgeneralcategoriesofmachinelearning. Classication,predictingacategoryorlabel.Classicationinvolvestrainingonobservationstocreateamodelthatwillattempttopredictwhetheranobservationbelongs 5

PAGE 11

toagivencategory.Forexample,classifyinganimageasacat"ordog".Amodel wouldtrainonlabeledimagesofdogsandcats,andpredictthesimilaritytoadogor acatforapreviouslyunseenphoto. Regression,predictingavalue.Regressionproblemstrytopredictthevalueofacontinuousvariablebasedonpreviousinformation.Regressionissimilartoclassication, however,thegoalistondarealvalueratherthanthecategoryofanobservation. Clustering,ndingthehiddenstructureofdata.Clusteringisalearningmethodthat attemptstopredictwheregroupsofsimilarobservationsarelocated.Thismethod attemptstogroupasetofobjectsbysimilaritysothatobjectsinonegrouparemore similartoeachotherthantoobjectsinanothergroup. Dimensionalityreduction,obtainingasetofprincipalvariables.Dimensionalityreductionistheprocessofreducingdimensionalitybyobtainingasetofprincipalvariables. Removingdatathatisdis-informativecanallowamodeltondmoregeneralclassicationregionsandyieldbetterperformance[32]. Eachofthesecategoriescanbedescribedbythreefundamentalcomponents:representation,whichishowthedataisstructured;evaluation,orscoringofhypotheses;and optimization,whichistheprocessofgeneratingmodels.Allmachinelearningmethodsarea combinationoftheseelements,andthesecomponentsrepresenttheframeworktounderstand anymachinelearningalgorithm[14]. Supervised,Semi-supervised,andUnsupervisedLearning Insupervisedlearning,itisimportantthatlabelsareavailabletothealgorithmduring thetrainingofthefunctionsothelabelsofactualvaluescanbecomparedtothepredicted labelstoevaluatehowthetrainedmodelisperforming.Labelingisoftenatimeintensive andmanualprocess. Techniquesthatdonotrequirelabeleddata,suchasclustering,areknownasunsupervisedlearningmethods.Evaluatingtheperformanceofunsupervisedlearningmodelsis 6

PAGE 12

morecomplicatedthansupervisedlearningmodelsbecausetherearenotruelabelstotest against. Semi-supervisedlearningtechniquesoverlapbetweensupervisedandunsupervisedlearningmethodsandcouldincludebothlabeledandunlabeledobservations.Forexample,clusteringcanbeusedtogroupthedatabysimilarity.Labelscanthenbeassignedtothe unlabeledobservationsbycombininginformationaboutthelabelsandinformationabout theclusters.Thisprocesswouldprovidemorelabeledinformationforasupervisedlearning method[29]. DeepLearning DeepLearningisasubeldofthebroadereldofmachinelearninginwhichmost methodslearnthroughaDeepNeuralNetworkDNN.Neuralnetworksaremadeupof severallayersofconnectedprocessorscalledneurons.Theoutputofanactiveneuronis determinedbyitsactivationfunction.Inputneuronsareactivatedbytheinputfeatureset, whilehiddenlayerneuronsareactivatedthroughweightedconnectionstopreviouslyactive neurons.Indeeplearning,eachlayertransformsitsinputintoamoreabstractrepresentation andthegoaloftheneuralnetistoaccuratelytransformthedataacrossmanysuchlayers[28]. TensorFlowisanopensourcelibrarydevelopedbyresearchersfromtheGoogleBrain teamforhighperformancecomputation.Tensorowprovidessupportforbothmachine learninganddeeplearning,andisusedacrossmanyscienticdomains[1].Themodelin thisstudywasbuiltusingTensorow'sDeepNeuralNetworkClassierDNNClassier,a pre-builtdeeplearningestimator. PredictionModelOptimizationParameters OptimizationFunctions OptimizationalgorithmsarewhatthemodelusestominimizeormaximizeanError function:afunctionusedincomputingthetargetpredictions.ThisstudyevaluatedthefollowingoptimizationfunctionsasapartoftheDNNClassierparameteroptimizationsearch 7

PAGE 13

framework:AdagradOptimizer,GradientDescentOptimizer,AdadeltaOptimizer,MomentumOptimizer,AdamOptimizer,ProximalGradientDescentOptimizer,ProximalAdagradOptimizer,andRMSPropOptimizer. Adaptiveoptimizerfunctionscanchangelearningrate,ahyper-parameterthatcontrols howmuchtoadjusttheweightsofthenetworkwithrespecttothelossgradient,whiletraining.Momentumallowstheoptimizationfunctiontorememberthechangeoftheprevious stepforeachiteration,andproducethenextupdatethroughacombinationofthegradient andthatlastchange.Proximaloptimizerfunctionsuseageneralizedformofprojectionto considerproximityofthefeatureswhenadjustingweights. ActivationFunctions Inneuralnetworks,theactivationfunction,sometimesreferredtoasthetransferfunction,ofanodereturnstheoutputvalueofthatnodegivenasetofinputs.Thisfunction determinesiftheoutputofanodewillbea1ora0.Activationfunctionsevaluatedinthis thesis:RectiedLinearUnitRelu,thedefaultsettingfortheDNNClassierandmostcommonlyusedfunction;ExponentialLinearUnitelu;ScaledExponentialLinearUnitSelu; Softplus,asmoothapproximationofRelu;tanh;andsoftsign,asmoothapproximationof tanh. SamplingMethods Oversamplingandundersamplingareroughlyequivalentandoppositetechniquesindata analysisthatadjustthedistributionofclassesforadataset.Combinedsamplersperforma combinationofunder-andover-samplingoeringthebestperformance,buttakealonger timetoprocessthedata.Scikit-learnsamplingmethodsevaluatedinthisprojectare:RandomOverSampler,whichrandomlyduplicatesmembersfromtheunder-representedclass; RandomUnderSampler,whichrandomlydropsmembersfromtheover-representedclass; andSyntheticMinorityOver-samplingTechniqueSMOTE,whichcreatesnewrecordsthat representsminoritygroupsintheunder-representedclassandattemptstollingaps[6]. 8

PAGE 14

FigureII.1:Illustrationofexploredsamplingmethods. FigureII.1showshowthedierentsamplingmethodsmodifyanexampledataset. ScalingMethods Scalersareusedtonormalizethedatatoreducesizeandimproveperformance.The scalingmethodsusedintheparameteroptimizationare:MinMaxScaler,normalizesthe datafrom-1to1;StandardScaler,normalizesthedataandcentersonthecalculatedmean from-1to1;RobustScaler,doesthesameasStandardScalerbuteliminatesthebottomand topquartilestodropoutliers;andMaxAbsScaler,whichscaleseachfeaturebyitsmaximum absolutevalue. MachineLearningDataset Inhealthcare,anElectronicHealthRecordEHRisbroadlydenedaslongitudinal datathatrepresentsthefullhealthrecordofmedicalservicesprovidedtoapatient[7].As EHRsaregainingtractionandtheapplicationsbecomemoreadvanced,thequalityofthe dataimproves.Forthisthesis,thesourceEHRsystemforthedatasetwasacombination ofAllscriptsTouchworksEHR,acloud-basedsolutionformidsizetolargepractices,and 9

PAGE 15

AllscriptsProfessionalEHR,asolutionforsmalltomidsizepractices.BothoftheseEHRs aredesignedfortheOutpatientapatientwhoreceivesmedicaltreatmentwithoutbeing admittedtoahospitalsetting,anddoesnotcontainInpatientapatientwhostaysina hospitalwhileundertreatmentinformation. TheAllscriptsclinicaldatawarehousecontainsde-identiedProtectedHealthInformationPHIforover50millionpatientsovera10-yearperiod.TableII.1showsthetables informationavailableintheAllscriptsclinicaldatawarehouse. TableII.1:Allscripts'clinicaldatawarehousetables. TableNameDescription AllergiesAllergyinformationperpatient AppointmentsPatientappointmentsscheduledandattended EncountersAnytimearecordisupdatedoraccessed MedicationsMedicationprescriptionsandrelatedproblemse.g.backpain OrdersOrderssubmittedforlabresults PatientsDemographicinformationperpatient ProblemsDiagnosiscodese.g.ICD9andICD10codesfordiabetes ProvidersHealthcareproviderphysiciansandhospitalsinformation ResultsResultsforlabsHbA1c,cholesterollevels,bloodwork,etc. VaccinesVaccineinformationforpatients VitalsVitalsrecordsforpatientsbloodpressure,BMI,etc. 10

PAGE 16

CHAPTERIII APPROACH ThepythonandSQLscriptstobuildoutthediabetesmachinelearningdataset,run theparametersearchexperiments,andvalidatethediabetespredictionmodelsareina JupyterNotebook:aninteractivedocumentformatforpublishingcode,results,andexplanations[16].Thisexperimentrequiresthecreationofalocalvirtualenvironmentusinga bashsetupscripttocreatethelocalenvironment,installthepipmodules,andstartthe Jupyternotebookserver.TherststepoftheJupyternotebookincludestheimportof relevantpythonmodulesandiPythonextensionssetstheexperimentalparameters. Therststepinbuildingaclassieristoperformfeatureengineering,theprocessofusing algorithmsordomainknowledgetocreatefeatures,onthedataset.DeepNeuralNetworks DNNcanonlyoperateonoatvaluesaseverynodecanperformmultiplicationoraddition operations.Asaresult,evenifthesourcedataiscategoricalbynature,amachinelearning modelrepresentsallfeaturesasnumbers.Ifaparameteriscategorical,itcanbeconvertedto indicatorcolumnfeaturesthroughgenerationofdummyvariables.Continuousparameters canbenormalizedandscaled,orsimplybinnedtocreateindicatorcolumns. FigureIII.1showsprocessforcreatingthemachinelearningdatasetandbuildinga diabetespredictionclassier.Tocreatethemachinelearningdataset,thecodeconnectsto theAllscripts'SQLDataWarehouseusingthelatestOpenDatabaseConnectivityODBC driverforSQLServer.Tablescontainingtheserializeddatawerebuiltforeachoftheselected featuresfrompatientdemographics,problemdiagnoses,vitalsandlabresultstables.This datawasaggregatedandjoinedintoaatrowperpatienttableandexportedtoan externalleusingapyodbcconnector.Adatapreparationclassloadsthedatale,excludes anyunknownornullvalues,andoutputsXandYtensorsandtheresultingdatasets. DNNClassiersexpectdatatobeinXandYtensorsofoats;whereXrepresents thefeaturesetofpredictorsandYrepresentsthetargetclassinformationtobepredicted. AfterithasbeensplitintoXandYtensors,thedataisfurtherdividedintoTrain,Test 11

PAGE 17

FigureIII.1:Buildingadiabetesmachinelearningdatasetandclassier. andValidationdatasets.Acommonapproachistouse70%ofthedatasetfortrainingand reservetheremainderfortestingandvalidation.Itiscriticaltosplitthedatasetbeforetting scalers,samplers,orclassierstoensurethatthemodeldoesnottrainonanycalculated valuesofmeanorstandarddeviationfromthetestdata.Atthisstagethedatasetissampled toaddressanyimbalancedclassimbalance,andscaledtoimprovetheperformanceoftraining themodel. TheDNNClassierisbuiltusingtheselectedmodelparameters,andtrainedonthe trainingdataset.Avalidationmonitorlookstomaximizetheprecisionrecallvalueand minimizethelossduringtrainingbyevaluatingmodelperformanceonthevalidationdataset. Oncethosevaluesstopchangingsignicantlythemonitorstopsthemodelfromtraining further.Thisvalidationmonitoriscriticalinmaximizingtheeciencyoftheparameter optimizationsearchframework.Themodelisthenevaluatedusingthetestingdatasetby 12

PAGE 18

comparingthepredictedlabelstotheactuallabelsthroughavarietyofmethodsincluding accuracy,precision-recall,andtheconfusionmatrix. Runningtheexperiments Ahelperclasswasdevelopedtoiteratethroughallselectedsamplingmethods,scaling methods,optimizerfunctions,andthenactivationfunctions.Afunctionbuilds,trains,and evaluatesadiabetespredictionmodelforeachconguration.Aseachmodelisevaluated,a functioncapturestheresultingaccuracyandperformancemetricsandappendstheresults toaPandasdataframeinpython.Thedataframeisserializedtoatextleaseachresultis appended.Thissetupallowstheexperimentstobepausedandresumed;abenecialfeature asitcantakedaystocomplete. Theseresultsareplottedusingmatplotlibtodisplayrelevanttrendsandtoallowconclusionstobedrawnfromthedierentaspectsoftheparameteroptimizationsearch.After thebestinputshavebeenfound,thescriptusesthesuggestedparameterstotrainandoptimizeadiabetespredictionmodel.Thisselectedmodelisevaluatedfurtherbycomparing theresultswithatraditionaldiabetesriskscoremethod. 13

PAGE 19

CHAPTERIV EXPERIMENTALRESULTS MachineLearningDiabetesDataset UsingtheAllscripts'ClinicalDataWarehousewithEHRhistoryofover50millionpatients,thisprojectwassuccessfulincuratingadiabetespredictionmachinelearningdataset consistingof22attributesfromencountersgatheredfromJune2015throughJune2016,and diabetesdiagnosistargetvaluesfromJune2016throughJune2018.Thetargetvalue,a futurediagnosisofdiabetes,wasdeterminedbyprevalenceofInternationalClassicationof Diseases,NinthRevisionICD-9andInternationalClassicationofDiseases,TenthRevisionICD-10codes.WhileICD-10codeswereimplementedinOctober2015,thecodeswere notimmediatelyadopteduniversally,soanystudyrelyingonthesecodesneedstosearchfor diagnosisinbothofthecode-sets[30]. TableIV.1showstheSQLtablesinthedatawarehouseandwhichfeatureswereextracted fromeach.AnypatientthathadapriordiagnosisofT2DM,gestationaldiabetes,orpersonal historyofmetabolicdiseasebeforeJune2016wasexcluded.Patientsthatdidnothaveat leastoneactiveencountersinallofthepredictiontimeframes,andatleastonenewproblem diagnosisinanyofthetimeframes,wereexcludedduetoinactivityintheEHRsystem.The diabetesmachinelearningdatasetthatwascreatedcontainsfeaturesfor149,050patients. FigureIV.1showsthediabetesandnon-diabetesimbalanceforeachofthetargetsforthose patients. 14

PAGE 20

TableIV.1:Diabetesmachinelearningdatasetdimensions. TableDimensionFeatureType PatientDemographicsAgeContinuous PatientDemographicsGender:MaleIndicator PatientDemographicsRace:WhiteIndicator PatientDemographicsEthnicity:HispanicIndicator ProblemDiagnosesAbnormalBloodPressureIndicator ProblemDiagnosesFamilyHistoryofDiabetesIndicator LabResultsMinimumHemoglobinA1cHbA1cContinuous LabResultsMaximumHemoglobinA1cHbA1cContinuous LabResultsAverageHemoglobinA1cHbA1cContinuous LabResultsMostRecentHemoglobinA1cHbA1cContinuous VitalsMinimumBodyMassIndexBMIContinuous VitalsMaximumBodyMassIndexBMIContinuous VitalsAverageBodyMassIndexBMIContinuous VitalsMostRecentBodyMassIndexBMIContinuous VitalsMinimumDiastolicBloodPressureContinuous VitalsMaximumDiastolicBloodPressureContinuous VitalsAverageDiastolicBloodPressureContinuous VitalsMostRecentDiastolicBloodPressureContinuous VitalsMinimumSystolicBloodPressureContinuous VitalsMaximumSystolicBloodPressureContinuous VitalsAverageSystolicBloodPressureContinuous VitalsMostRecentSystolicBloodPressureContinuous 15

PAGE 21

FigureIV.1:Diabetestargetlabeldistribution. 16

PAGE 22

DiabetesPredictionModel Accuracyistheoverallpercentofpredictionsthatarecorrect,andwhileuseful,can bemisleadingwhenthedatasetisimbalanced.Precision,recall,andconfusionmatrices areallmethodsthatprovideinsightintomodelperformanceforbothpositiveandnegative classicationandaccuracies.Inbinaryclassication,precisionisthefractionofpositive valuesubjectsamongtheevaluatedsubjects.Recallisthefractionofpositivevaluesubjects thathavebeenevaluatedoverthetotalamountofpositivevaluesubjects.Theprecisionand recallcurveofamodelprovidesinsightintohowwellthemodelperformedinpredictinga positivevalueprecisionoverthenumberoftestsubjectsthathadapositivevaluerecall. FigureIV.2demonstrateshowtoreadaconfusionmatrixandthedenitionsofaTrue PositiveTP,aTrueNegativeTN,aFalsePositiveFP,andaFalseNegativeFN whencomparingpredictedandtruelabels. FigureIV.2:Confusionmatrixtruthtable. Thebestmodelforpredictingdiabeteswouldbethemodelwiththebestprecision,and thehighestTPandTNaccuracies.FigureIV.3showsthenormalizedconfusionmatrices forpredictingdiabeteswithin6,12,18,and24months.Thesegraphsshowthatthemodel performsreasonablywellonpredictingonsetofdiabetesforalltargets,andthelongertimeframeshaveincreasingimprovementontrue-negativepredictionoftheabsenceofdiabetes. 17

PAGE 23

FigureIV.3:Confusionmatricesforpredictingdiabeteswithintargettime-frames. TableIV.2showstheoptimizedcongurationandtheperformancemetricsofthediabetespredictionmodelforeachofthetargettime-frames.Inthisresearch,thebestmodel foreachtargetwasdevelopedusingadierentcombinationofparameters. TableIV.2:Modelcongurationsforbestmodelresults. TargetModelCongurationPrec.Acc.TPTN 6monthssoftsign,graddesc,naiveover,robust10%65%80%64% 12monthselu,momentum,naiveover,robust23%66%82%64% 18monthssoftplus,proxadagrad,smote,standard57%74%84%70% 24monthselu,momentum,smote,robust91%86%83%88% 18

PAGE 24

ModelValidation Thedeeplearningdiabetespredictionmodeloutperformedtraditionalmethods.This modelwasvalidatedbycomparingresultstotheAmericanDiabetesAssociationType2 DiabetesRiskTest[20].TheADARiskTestconsistsofsevenmetricsbasedonage,gender, gestationaldiabetes,familyhistory,physicalactivity,andBMI.Theresearchwasableto produceascoreusing5ofthe7metrics.Gestationaldiabeteswasexcludedduetolow diagnosisrates < 0.1%,andphysicalinactivityisnotavailableinEHRdatasets.Ascoreof 5orhigherfromtheabovemetricsisa`highrisk'-considereda`prediction'ofdiabetesdiagnosisforthepurposeofvalidatingthismodel.FigureIV.4showsthenormalizedconfusion matricesforpredictingdiabeteswithin6,12,18,and24monthscomparedwiththeresultsof theAmericanDiabetesRiskTest.Thedeeplearningdiabetespredictionoutperformedthe risktestinaccuratelypredictingbothtruepositivesandtruenegativesforalltimeframes. ModelFeatureBreakdown Oneofthecriticismsofusingdeeplearningpredictionmodelsisthelackoftransparency intohowthemodelisttingtothefeatures.Inhealthcare,physiciansneedthisinsightto eectivelyimplementanymachinelearningmodelinaclinicalsetting.Asallofthemodels weretrainedonthesamepatientdataset,the24monthpredictionmodelwasselectedfor furtheranalysisasithaddemonstratedthebestaccuracyandprecisioninpredictingdiabetes.FigureIV.5showsthebreakdownforeachindicatorcolumnforcorrectandincorrect predictions.Broadly,thesegraphsshowthatthemodelisttingtothedatasetwell.They alsoprovideinsightintowhichfeaturesmayrequirefurtherexploration,ordevelopmentof newmodelsspecicallytailoredtothosefeatures,forimprovedspecicity.FigureIV.6shows thebreakdownforeachcontinuouscolumnforcorrectandincorrectpredictions.Themodel performedbetteronthehigherandlowerrangesforHbA1cvalues,andonhigherdiastolic andsystolicbloodpressurevalues. 19

PAGE 25

FigureIV.4:Confusionmatricesforpredictingdiabeteswithintargettime-frames. 20

PAGE 26

FigureIV.5:Indicatorcolumnbreakdownforcorrectandincorrectpredictions. 21

PAGE 27

FigureIV.6:Continuouscolumndistributionforcorrectandincorrectpredictions. 22

PAGE 28

ModelOptimizationParameterSearch Thenalgoalofthisthesisisthedevelopmentofamodelparameteroptimizationframeworktoidentifytheoptimizationfunction,activationfunction,scalingmethod,andsampling methodforthebestmodelperformance.Theframeworkdevelopedasaresultofthisresearchprovidedastrongbasisformodeloptimizationbybuilding,training,andevaluating 768congurationsoftheDNNClassierforeachofthe4diabetestargettime-frames.The outputofthisframeworkisalistofcongurationstoexploreandfurtherimprovethrough learningratetuningandhiddenlayeradjustment.Whiletraining,eachmodelmonitorsthe lossfunctionandevaluatesmodelaccuracy;stoppingtrainingwhenthemodelissuciently ttedtothedata.Thisvalidationmonitorallowstheframeworktominimizetimespent trainingeachmodel. FigureIV.7andgureIV.8showthedistributionofaccuracyandprecisionforallcongurationsgeneratedbythemodelparameteroptimizationsearchforeachofthetarget time-frames.Thisguredemonstrateshowdatasetimbalancecanaecttraining.Asthe targetclassesbecomelessimbalanced,trainingislesslikelytolockontooneclassandclassify allobservationsasdiabeticornon-diabetic. FigureIV.9showstheresultsofthescalingmethodmodeloptimization.Thisanalysis isusefulinthatitshowsthatthescalingmethod,whilecritical,doesnotnegativelyimpact thetrainingresults. FigureIV.10showstheresultsofthesamplingmethodmodeloptimization.Asthe 6and12monthtargetclassesarehighlyimbalanced,thisgureshowsthatselectinga samplingmethodisacriticaldecisionforasuccessfulmodel.Interestingly,SMOTEsampling performedbestontheleastimbalanceddataset,andunderperformedcomparedtonaive methodsonstronglyimbalancedtargets. FigureIV.11showstheresultsoftheoptimizerfunctionmodeloptimization.Forthis dataset,RMSPropagationandtheAdaptiveMomentumoptimizerswerelikelytooveradjust 23

PAGE 29

FigureIV.7:Rangeofresultsforeachmodelovertheparameteroptimizationsearch. FigureIV.8:Allresultsforeachmodelovertheparameteroptimizationsearch. 24

PAGE 30

FigureIV.9:Eectofscalingmethodparameteronmodelperformance. andlockontooneofthetargetclassesforallpredictions.Thisinsightisusefulinsuccessfully trainingmodelsforthisdatasetbecauseitallowstheresearchertoavoidoptimizerfunctions thatdonotproperlyttothefeatureset. FigureIV.12showstheresultsoftheactivationfunctionmodeloptimization.While thereisminorvariancebetweenmethods,activationfunctionhadtheleastimpactonmodel performance. 25

PAGE 31

FigureIV.10:Eectofsamplingmethodparameteronmodelperformance. 26

PAGE 32

FigureIV.11:Eectofoptimizerfunctionparameteronmodelperformance. 27

PAGE 33

FigureIV.12:Eectofactivationfunctionparameteronmodelperformance. 28

PAGE 34

CHAPTERV CONCLUSION TheworkofthisthesiswastoevaluateanEHRbaseddeeplearningmodeltopredict T2DMdiagnosisinageneralpopulation.TheobjectiveistodeliveradiabetesriskforecastingmodelusingEHRsforearlypredictionofT2DMtoallowforearlyinterventionand treatment.ThecontributionofthatworkincludesthecreationofanEHRbasedmachine learningdataset,thedevelopmentofadeeplearningdiabetespredictionmodel,andthe developmentofaframeworkforbuildingoptimizedclassiers. ThediabetesmachinelearningdatasetcreatedrepresentsEHRdataacrossageneral distributionofthepopulationoftheUnitedStates.Thisexpandsthepotentialusabilityof thispredictionmodelasaclinicaltoolbyincreasingthenumberofpatientsandutilizinga muchbroaderpatientdatasetthanwaspreviouslyavailabletoresearchers.Theuseofalarge clinicaldatasettocreatemachinelearningdatasetsprovidessignicantvaluebecauseofthe vastavailabilityofdatasetsforvariouspopulations.Thereisopportunityforfutureresearch tousethesameEHRdatasetforpredictionofotherchronicdiseases.Furtherinvestigation intoT2DMpredictionshouldevaluatethefeaturesetcreation.Whilethereisoveradecade ofEHRdataavailable,thisstudywasrestrictedto12monthsofclinicalhistoryperpatient. Futurestudiesshouldexploreexpandingthefeaturesettoincludemultipleyearsofclinical history.Additionally,thisstudyreliedonICD9andICD10diagnosiscodesforthetarget labelsofdiabetesdiagnosis.Thesediagnosiscodesarepotentiallyanunreliablelabeland futureresearchshouldvalidatediabetesdiagnosislabelsthroughothermeanssuchasHbA1c scoresorthepresenceofdiabetesmedicationprescriptions. Thenextcontributionwastodevelopandevaluateadeeplearningmodeltopredict thetimelineofanewdiagnosisofT2DMwithinthefollowing6,12,18,and24monthtime periodsfrom12monthsofapatientsclinicalEHRhistory.Themodelwasdevelopedthrough statisticallearningandusingfeaturesselectedthroughdomainknowledgeofcommonrisk factorsfordevelopingT2DM.ValidationofthismodelseegureIV.4demonstratesthatthe 29

PAGE 35

diabetespredictionmodeldevelopedthroughthisresearchoutperformstheADADiabetes RiskTestwhenappliedtoageneralEHRpopulation.Thefeatureexplorationofthediabetes predictionmodelprovidesinsightintohowfurtherresearchmightimproveaccuracythrough variousfeatureengineeringmethods.BasedontheresultsofgureIV.6,futureresearchmay considerbinningsomeofthecontinuousfeaturecolumnstoprovideimprovedcontexttothe model.Forexample,binningBMIvaluesinto`normal,'`overweight,'or`obese'categories couldfunctiontoimprovemodelaccuracy.Similarly,FigureIV.5showsthatimproved contextcouldbeachievedbyapplyingseparatemodelstoseparatefeaturecategoriesfor improvedspecicityofthedataset. Theworkofthisthesisincludeddevelopmentofaframeworkfordevelopingoptimized classiersthroughaparametersearchoverselectedoptimizationfunctions,activationfunctions,scalingmethods,andsamplingmethods.Thisframeworkwasmadetobegeneral purposeandcanbeusedonanydeepneuralnetbinaryclassicationproblem.Asdemonstratedinthisresearch,themodeloptimizationframeworkprovidesinsightintohowdierent modelparametersperformed.Forexample,forthedatasetgeneratedinthisresearch,the eectsofdatasetimbalanceareapparentwhencomparingsamplingmethods.FigureIV.10 showsthat,asthe6,and12monthtargetsarehighlyimbalanced,selectingasampling methodiscriticalforasuccessfulmodel.Additionally,FigureIV.11showsthatsomeoptimizationfunctionsaremorelikelytogetstuckonanincorrectsolution.Thereviewofthe modeloptimizationframeworkdemonstratesagoodfoundationtoassistfutureendeavors todevelopmodelsforotherdiseasecohorts.Futureworktoexpandtheframeworkshould considerincludingagridsearchoverthehiddenunitsoftheneuralnetwork.Thissearch shouldbeeitheragrowthmethodofstartingsmallandbuildingoutoraparingmethodof startingwithalargenetworkandreducing.Ifthegoalisperformance,theformershouldbe used,ifthegoalistopreventover-training,thelatter. TheoutcomeofbuildingadeeplearningmodelonEHRdatademonstratesthatthereis anapplicationformachinelearninginhealthcare,andthefeasibilityofamodelspotential 30

PAGE 36

eectivenessasapredictiontoolinaclinicalsetting.Anythingthatcanbedonetoidentify dangerousdiseasesearlyinpatientsandprovideopportunitiesforcareinterventionisvaluableforpopulationhealthmanagementandimprovinghealthcare.Earlyidenticationof theriskofT2DMinthesepatientsprovidesopportunitiesforcareinterventiontopromote betterlong-termoutcomes.Withthecontinuedwidespreadadoptionofelectronichealth records,deeplearningpredictionmodelswillundoubtedlycontinuetobeasignicantarea ofadvancementinhealthcare. 31

PAGE 37

REFERENCES [1]MartnAbadi,PaulBarham,JianminChen,ZhifengChen,AndyDavis,JereyDean, MatthieuDevin,SanjayGhemawat,GeoreyIrving,MichaelIsard,etal.Tensorow: asystemforlarge-scalemachinelearning.In OSDI ,volume16,pages265{283,2016. [2]AmericanDiabetesAssociationetal.Screeningfortype2diabetes. DiabetesCare , 23:S20,2000. [3]JohnBeilby.Denitionofmetabolicsyndrome:reportofthenationalheart,lung, andbloodinstitute/americanheartassociationconferenceonscienticissuesrelatedto denition. TheClinicalBiochemistReviews ,25:195,2004. [4]LeanneBellamy,Juan-PabloCasas,AroonDHingorani,andDavidWilliams.Type 2diabetesmellitusaftergestationaldiabetes:asystematicreviewandmeta-analysis. TheLancet ,373:1773{1779,2009. [5]CDC.Nationalcenterforhealthstatistics,Mar2017. [6]NiteshVChawla,KevinWBowyer,LawrenceOHall,andWPhilipKegelmeyer.Smote: syntheticminorityover-samplingtechnique. JournalofArticialIntelligenceResearch , 16:321{357,2002. [7]MartinRCowie,JuusoIBlomster,LesleyHCurtis,SylvieDuclaux,IanFord,Fleur Fritz,SamanthaGoldman,SalimJanmohamed,JorgKreuzer,MarkLeenay,etal.Electronichealthrecordstofacilitateclinicalresearch. ClinicalResearchinCardiology , 106:1{9,2017. [8]InternationalDiabetesFederation.Idfdiabetesatlas-7thedition. [9]InternationalDiabetesFederation.Idfdiabetesatlas-8thedition. [10]ThomasHarder,ElkeRodekamp,KarenSchellong,JoachimWDudenhausen,andAndreasPlagemann.Birthweightandsubsequentriskoftype2diabetes:ameta-analysis. AmericanJournalofEpidemiology ,165:849{857,2007. [11]MaureenIHarris,RichardCEastman,CatherineCCowie,KatherineMFlegal,and MarkSEberhardt.Racialandethnicdierencesinglycemiccontrolofadultswithtype 2diabetes. DiabetesCare ,22:403{408,1999. [12]FrankBHu,JoAnnEManson,MeirJStampfer,GrahamColditz,SiminLiu,CarenG Solomon,andWalterCWillett.Diet,lifestyle,andtheriskoftype2diabetesmellitus inwomen. NewEnglandJournalofMedicine ,345:790{797,2001. 32

PAGE 38

[13]AlkaMKanaya,ChristinaLWasselFyr,NathalieDeRekeneire,RonaldIShorr,AnnV Schwartz,BretHGoodpaster,AnneBNewman,TamaraHarris,andElizabethBarrettConnor.Predictingthedevelopmentofdiabetesinolderadults:thederivationand validationofapredictionrule. DiabetesCare ,28:404{408,2005. [14]JohnDKelleher,BrianMacNamee,andAoifeD'Arcy. Fundamentalsofmachine learningforpredictivedataanalytics:algorithms,workedexamples,andcasestudies . MITPress,2015. [15]ErwinPKleinWoolthuis,WimJCdeGrauw,WillemHEMvanGerwen,HenkJM vandenHoogen,EloyHvandeLisdonk,JobFMMetsemakers,andChrisvanWeel. Identifyingpeopleatriskforundiagnosedtype2diabetesusingthegp'selectronic medicalrecord. FamilyPractice ,24:230{236,2007. [16]ThomasKluyver,BenjaminRagan-Kelley,FernandoPerez,BrianEGranger,Matthias Bussonnier,JonathanFrederic,KyleKelley,JessicaBHamrick,JasonGrout,Sylvain Corlay,etal.Jupyternotebooks-apublishingformatforreproduciblecomputational workows.In ELPUB ,pages87{90,2016. [17]RobertKocher,EzekielJEmanuel,andNancy-AnnMDeParle.Theaordablecare actandthefutureofclinicalmedicine:theopportunitiesandchallenges. Annalsof InternalMedicine ,153:536{539,2010. [18]RichardSLegro,AllenRKunselman,WilliamCDodson,andAndreaDunaif.Prevalenceandpredictorsofriskfortype2diabetesmellitusandimpairedglucosetolerance inpolycysticovarysyndrome:aprospective,controlledstudyin254aectedwomen. TheJournalofClinicalEndocrinology&Metabolism ,84:165{169,1999. [19]SubramaniMani,YukunChen,TomElasy,WarrenClayton,andJoshuaDenny.Type2 diabetesriskforecastingfromemrdatausingmachinelearning.In AMIAAnnualSymposiumProceedings ,volume2012,page606.AmericanMedicalInformaticsAssociation, 2012. [20]PayalHMarathe,HelenXGao,andKellyLClose.Americandiabetesassociation standardsofmedicalcareindiabetes2017. JournalofDiabetes ,9:320{324,2017. [21]DavidMNathan.Long-termcomplicationsofdiabetesmellitus. NewEnglandJournal ofMedicine ,328:1676{1685,1993. [22]AbnerLouisNotkins.Thecausesofdiabetes. ScienticAmerican ,241:62{73,1979. [23]Umut Ozcan,QiongCao,ErkanYilmaz,Ann-HweeLee,NealNIwakoshi,Esra Ozdelen,GurolTuncman,CemGorgun,LaurieHGlimcher,andGokhanSHotamisligil. Endoplasmicreticulumstresslinksobesity,insulinaction,andtype2diabetes. Science , 306:457{461,2004. [24]ManaswiniPradhanandRanjitKumarSahu.Predicttheonsetofdiabetesdiseaseusing articialneuralnetworkann. InternationalJournalofComputerScience&Emerging TechnologiesE-ISSN:2044-6004 ,2,2011. 33

PAGE 39

[25]StephenSRich.Mappinggenesindiabetes:geneticepidemiologicalperspective. Diabetes ,39:1315{1319,1990. [26]ArlanLRosenbloom,JennieRJoe,RobertSYoung,andWilliamEWinter.Emerging epidemicoftype2diabetesinyouth. DiabetesCare ,22:345{354,1999. [27]RichaSaxena,BenjaminFVoight,ValeriyaLyssenko,NoelPBurtt,PaulIWdeBakker, HongChen,JereyJRoix,SekarKathiresan,JoelNHirschhorn,MarkJDaly,etal. Genome-wideassociationanalysisidentieslocifortype2diabetesandtriglyceride levels. Science ,316:1331{1336,2007. [28]JurgenSchmidhuber.Deeplearninginneuralnetworks:Anoverview. Neuralnetworks , 61:85{117,2015. [29]SchouwenaarsandSaaibi.Introductiontomachinelearning. [30]StefanSchulz,AlbrechtZaiss,RalphBrunner,DanielSpinner,andRudigerKlar.Conversionproblemsconcerningautomatedmappingfromlcd-10tolcd-9. MethodsofInformationinMedicine ,37:254{259,1998. [31]LauraJScott,KarenLMohlke,LoriLBonnycastle,CristenJWiller,YunLi,WilliamL Duren,MichaelRErdos,HeatherMStringham,PeterSChines,AnneUJackson, etal.Agenome-wideassociationstudyoftype2diabetesinnnsdetectsmultiple susceptibilityvariants. Science ,2007. [32]GuandongXu,YuZong,andZhengluYang. AppliedDataMining .CRCPress,2013. [33]EleftheriaZeggini,MichaelNWeedon,CeciliaMLindgren,TimothyMFrayling, KatherineSElliott,HanaLango,NicholasJTimpson,JohnRBPerry,NigelWRayner, RachelMFreathy,etal.Replicationofgenome-wideassociationsignalsinuksamples revealsrisklocifortype2diabetes. Science ,316:1336{1341,2007. 34