Citation
Qualitatively robust stochastic analysis of genomic, transducer-based and population data

Material Information

Title:
Qualitatively robust stochastic analysis of genomic, transducer-based and population data
Creator:
LaHeist, Charles F. ( author )
Language:
English
Physical Description:
1 electronic file (71 pages). : ;

Subjects

Subjects / Keywords:
Probabilities -- Data processing ( lcsh )
Algorithms ( lcsh )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Review:
This thesis applies a specific statistical robust methodology to three diverse domains. The pedigree of this technique is drawn from Neyman-Pearson hypothesis testing, with the a priori constraints removed, which is powerful in certain situations where a priori probabilities are unavailable. The first and most remarkable application will be to genomic data, a domain more familiar with variants of Bayesian analysis. The application is perceivably more appropriate as the a priori information typically used is not agreed upon within the microbiology discipline. It is believed that the technique will, as decision theory, identify if a "signal" is present; or whether the data is overwhelmed with "noise". The technique is then applied to a known-noisy transducer-based data set, and a common population set for contrast. It was shown that qualitative robustness is advantageous for the analysis of both the transducer and population data, and had approximately four percent greater detection of outliers on the summary-statistics genomic files despite greater applicability to unprocessed data. It is clear that microarray data, historically described as noisy, should be analyzed with robust strategies, and that qualitatively robust techniques are a clear improvement over the more routine Bayesian analysis.
Thesis:
Thesis (M.S.)--University of Colorado Denver.
Bibliography:
Includes bibliographic references.
System Details:
System requirements: Adobe Reader.
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by Charles F. LaHeist.

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
911204908 ( OCLC )
ocn911204908

Downloads

This item is only available as the following downloads:


Full Text

PAGE 1

QUALITATIVELYROBUSTSTOCHASTICANALYSISOFGENOMIC, TRANSDUCER-BASEDANDPOPULATIONDATA by CHARLESF.LAHEIST BachelorofScience,ComputerScienceandEngineering,2007 Athesissubmittedtothe FacultyoftheGraduateSchoolofthe UniversityofColoradoinpartialfulllment oftherequirementsforthedegreeof MasterofScience ComputerScience 2015

PAGE 2

ThisthesisfortheMasterofSciencedegreeby CharlesF.LaHeist hasbeenapprovedforthe DepartmentofComputerScienceandEngineering by TomAltman,Advisor TitsaPapantoni,Co-Advisor BorisStilman TamVu April8,2015 ii

PAGE 3

LaHeist,CharlesF.M.S.,ComputerScience QualitativelyRobustStochasticAnalysisofGenomic,Transducer-basedandPopulationData ThesisdirectedbyProfessorTomAltman ABSTRACT Thisthesisappliesaspecicstatisticalrobustmethodologytothreediversedomains.ThepedigreeofthistechniqueisdrawnfromNeyman-Pearsonhypothesis testing,withtheaprioriconstraintsremoved,whichispowerfulincertainsituations whereaprioriprobabilitiesareunavailable.Therstandmostremarkableapplicationwillbetogenomicdata,adomainmorefamiliarwithvariantsofBayesian analysis.Theapplicationisperceivablymoreappropriateastheaprioriinformation typicallyusedisnotagreeduponwithinthemicrobiologydiscipline.Itisbelieved thatthetechniquewill,asdecisiontheory,identifyifasignal"ispresent;orwhether thedataisoverwhelmedwithnoise".Thetechniqueisthenappliedtoaknown-noisy transducer-baseddataset,andacommonpopulationsetforcontrast.Itwasshown thatqualitativerobustnessisadvantageousfortheanalysisofboththetransducer andpopulationdata,andhadapproximatelyfourpercentgreaterdetectionofoutliers onthesummary-statisticsgenomiclesdespitegreaterapplicabilitytounprocessed data.Itisclearthatmicroarraydata,historicallydescribedasnoisy,shouldbeanalyzedwithrobuststrategies,andthatqualitativelyrobusttechniquesareaclear improvementoverthemoreroutineBayesiananalysis. Theformandcontentofthisabstractareapproved.Irecommenditspublication. Approved:TomAltman iii

PAGE 4

DEDICATION Thisthesisisdedicatedtomywife,Oyun-UndramNamnandorj,andmychildren LillianandJames. iv

PAGE 5

ACKNOWLEDGMENT Thisthesiswouldnothavebeenpossiblewithoutthegeneroustimeandpatienceof ProfessorTitsaPapantoni.Iwouldalsoliketothankmyadvisor,ProfessorAltman, andmycommitteemembersProfessorsBorisStilmanandTamVu.Aspecialdebt ofgratitudeisowedMrs.HannaAltman,whoshepherdedthisthesisfromdraftto nalwithherselesscorrections. v

PAGE 6

TABLEOFCONTENTS Figures.......................................ix Chapter 1.Introduction...................................1 1.1Objective.................................1 2.MathematicalBackground...........................2 2.1Overview.................................2 2.2TheDetectionProblem.........................2 2.3Necessarytermsandconcepts.....................5 2.3.1Parametricversusnonparametric................5 2.3.2Compositehyphothesis......................5 2.4Neyman-Pearson.............................6 3.QualitativeRobustness.............................11 3.1Introductiontoconcepts........................11 3.1.1Qualitativerobustness,denition................13 3.2Hypothesistesting............................14 3.3Procedureoutlinefordatasources...................14 3.3.1Procedureprefaceforgenomicdata...............14 3.3.2Procedure.............................14 4.DataUnderStudy...............................16 4.1Introduction...............................16 4.2Genomicdata..............................16 4.2.1Genomics.............................18 4.2.1.1Genes...........................18 4.2.1.2DNA,RNA,cDNAandmRNA.............18 4.2.1.3Geneexpression......................19 4.2.1.4Hybridizationassays...................20 vi

PAGE 7

4.2.2Microarrays............................20 4.2.2.1Bioinformatics.......................21 4.2.2.2Usageexperimenttypes................22 4.2.2.3OligonucleotideversuscDNAmicroarrays........23 4.2.2.4Amicroarrayexperiment.................23 4.2.3Knownsourcesofdatanoise"andirregularity........26 4.3Personalhealthmonitoringdevicedata................27 4.3.1HeartratemonitorHRM...................28 4.3.1.1Heartphysiologyandelectrocardiography........29 4.3.1.2Signalacquisition.....................29 4.3.2Knownsourcesofdatanoise"andirreguarity.........30 4.4Populationdata.............................30 4.4.1Scoring..............................30 4.4.2Knownsourcesofdatanoise"andirreguarity.........30 5.Development...................................31 5.1Microarray-basedgenomicdata.....................31 5.1.1Introduction...........................31 5.1.2Domainresources.........................31 5.1.2.1Aymetrixlibraryles..................31 5.1.3MATLABversionsandtoolboxes................33 5.1.4Experimentaldata........................34 5.1.5Development...........................34 5.2Transducer-basedHRMdata......................37 5.2.1Introduction...........................37 5.2.2Domainresources.........................37 5.2.3Experimentaldata........................37 5.2.4Development...........................38 vii

PAGE 8

5.3Populationdata.............................39 5.3.1Introduction...........................39 5.3.2Domainresources.........................39 5.3.3Development...........................39 6.Results......................................40 6.1Genomicdata..............................40 6.2Transducer-baseddata.........................41 6.3Populationdata.............................43 7.Conculsion....................................44 7.1Lessonslearned.............................44 7.2Futurework...............................45 References ......................................46 Appendix A.Code.......................................48 A.1Example-MATLABcodeforHT HG-U133Amicroarray......48 A.2Example-MATLABCodeforheartratemonitordata........54 A.2.1Driver...............................54 A.2.2FZerosolverfunction.......................55 A.3Example-MATLABcodeforpopulationdata............55 B.FileFormats...................................58 B.1Aymetrixleformats.........................58 B.1.0.1IntensitylesCEL...................58 B.1.0.2CMAPexperimentlist..................59 B.2HRMleformats............................60 B.3Populationleformat..........................61 B.3.1Experimentaldata........................61 viii

PAGE 9

FIGURES Figure 2.1Arepresentationofasimpledetectionproblem.[12]............4 2.2Bestrecreativeeortforadiagramwithin[23],page87.aComposite hypothesistestingproblemforsingle-parameterexample.bComposite hypothesistestingproblem..........................6 2.3SimplevisualNeymanPearsonexampleusingnormalPDFs,one =0 ; theother = 1 2 ................................7 2.4Threedierentversionsofashiftedthreshold.Thetopshiftsthethreshold totheleftbeyond H 0 'smean,resultinginaverysmallfalsenegative region,butverylargefalsealarm.Themiddleshiftsthethresholdtothe rightto H 1 'smean, 1 2 ,wherethefalsenegativeismuchlarger,butthefalse alarmissmallerthaneventhestartingcase.Thebottomversionshiftsthe thresholdtotherightextremeeliminitingfalsealarm,butdetectibility aswell.....................................10 3.1Theeectanoutliercanhaveonadistribution.Solidgreenrepresentsa commonGaussiandistributionwhiledashedpurpleisasimilardistributionwithanoutliernearthevalue`5'along-taileddistribution.....12 4.1Microarraydataanalysisprocess;adaptedfrom[1].............24 4.2AblockdiagramofcDNAmicroarrayimageanalysis[25].........27 4.3Ageneralizedversionofthemulti-stageampliercommontoaheartrate monitoringcheststrapgeneralizedfromthedesigncitedin[24].....29 5.1Apossibleapproachtomicroarrayintensityanalysis:focustheobservationscopetosinglegenes.Dierentgenesareindicatedbycolorinthis simpliedillustration,withprobesscatteredphysicallyacrossthearray, asisthecaseontheactualdevice......................35 ix

PAGE 10

5.2Anotherapproach,thistimenormalizingaprobeorprobesetacrossmultipleexperimentsCELles.Indicatorsa,bandcallshowthe sameprobeinthreedierentmicroarrayexperiments,andatvariousintensitylevelsduetoeithervariationsintheuorescentmarkerorthe adjustmentsforimagecapture........................36 5.3Anexampleofthedatafromtheset;redisthemeasuredheartrate, whilethegraybarsagpointswherethedevicelostsignal.Forthis interpretation,linearinterpolationwasusedforsignallossperiods....38 6.1Outliersidentiedbyappliedtechniquenotaggedbythenativemicroarrayalgorithm.Experimentnumberalongthex-axis,numberofoutlier probesalongy-axis..............................41 6.2OutputsamplesforHeartRateMonitordata,withthresholdsinblueand green......................................42 6.3Outputsamplesforpopulationdata,withthresholdsinblueandgreen..43 x

PAGE 11

1.Introduction 1.1Objective Thisthesisappliesthequalitativelyrobuststochasticanalysistechniqueoutlined inProfessorTitsaPapantoni'spublishedwork[18,7,6,5,2,17,16,14]tothree domains,oneofwhichisbelievedauniqueapplication.Thepedigreeofthistechnique isdrawnfromNeyman-Pearsonhypothesistesting,withtheconstraintspertinentto aprioriprobabilitiesremoved,whichispowerfulincertainsituationswheretraining dataisunavailable.Therst,andmostremarkableapplicationwillbetogenomic data,adomainmorefamiliarwithvariantsofBayesiananalysis;andtheapplication isperceivablymoreappropriateastheaprioriinformationtypicallyusedisnot agreeduponasanindustry.Itisbelievedthatthetechniquewill,asdecisiontheory, identifyifasignal"ispresent;orwhetherthedataisoverwhelmedwithnoise".The techniqueisthenappliedtoaknown-noisytransducer-baseddataset,andacommon populationsetforcontrast. Abackgroundchapteringenomicsisprovided,togivesomeunderstandingofthe biologicalmechanismsbehindthedataused,andtotrackthroughprocessesthatlikely introducenoiseorirregularities.Itisnotintendedtobeexhaustiveinidentifyingthese processes,butshouldcategoricallyincludethelargestpotentialsources. Thedeliverableintentistopresentamethodfordecidingifasetofgenomic experimentaldataisvaluableforitspurpose.Thiscanbeespeciallyimportant toeldswithinmedicalbiology,suchascancerresearch,thathavebecomehighly dependentoninformationderivedfrommicroarrays. 1

PAGE 12

2.MathematicalBackground 2.1Overview Inthischapter,aprogressionthroughconceptsin detectiontheory willleadto theimplementedmethodologyexercisedinthisthesis,thatof qualitativerobustness describedindetailinthesubsequentchapter.First,theconceptofa detection problem ispresented,asasubcasewithin hypothesistesting andessentiallya decision problem ofmatchingagivenobservationtoadistribution.Theabilitytoparameterize thedistributionsthatrepresentthedecisionswillthenbeexamined,withthe conceptof compositehypotheses drivingtoward Neyman-Pearson andtheprinciple of uniformlymostpowertests .TheapplicationofNeyman-Pearson-basedanalysis, ratherthanBayesian,representsthedivestmentfromthetypicaltechniquesused withingenomicstatisticalstudy[25,15,22,1,3],andameasureofitsapplicability istheunderlyingpurposeofthiswork. 2.2TheDetectionProblem Thesimplestdetectionproblemisthatofasignal,eitherpresentornot,amid noise.Anexample,morecommonlyrelatable,wouldbethatofhearingacompanion speakinaroomfullofotherconversations;theirvoicewouldbethesignal,andthethe otherswouldbenoisethisexamplealsohastheaddedadvantageofdemonstrating thatsometimesboththesignalandthenoisebehavesimilarly,inthiscasestarting andstoppingsporadically.Fromtheanalyst'sperspectivetheproblemistoisolate betweentwopossiblemodelsforthecurrentcondition:eitheryourfriendisspeaking inthenoisyroom,ortheyarenot;andduringthepassageoftimethiswillvacillate. Formalizing,therearethentwohypotheses: H 0 ,thefriendisnotspeaking,thereforetheearsarereceivingonlybackground chatterasnoise;or H 1 ,thefriendisspeaking,thereforetheearsarereceivingthefriend'svoice embeddedinthebackgroundchatter. 2

PAGE 13

Thisisnottosaythatdetectiontheoryislimitedtoonlytwohypotheses;there couldbemanysourcesofsignalandadesiretodeterminethatsource.Withinthe contextofourexample,eavesdroppingonotherconversationswouldaddadditional hypothesesforotherspeakers. Considerthen,theconstructionofamathematicalmodelaroundtheconcept. UsingwhiteGaussiannoise N ; 2 ;! t [11]asamodelforthebackground chatter,thescalar`1'torepresentthecompanion'svoicebeingheard,andan observation x t asampleoftheconversation,thehypothesesbecome: H 0 ; x t = t noiseonly,or H 1 ; x t =1+ t signalwithnoise. Ifthenoise t isnonexistentornegligiblysmall,theproblemiseasily decidable.Ifthecompanionisspeaking,theobservationwouldbe`1',else`0'. Consideringsignicantnoise,itwouldbereasonabletointroduceathresholdof 1 2 fortheboundarybetweenthetwohypotheses,asitisequidistantfromtheextremes. Therefore,adecisionruleisformulated: If x t < 1 2 ,thenonlythenoiseispresent;else If x t > 1 2 thereisboththesignalandthenoise. A probabilitydensityfunctionPDF representsthelikelihoodofarandom variablebeingaspecicvalue.InFigure2.1,thesolid-linebellshapedcurveis representativeofthePDFforaGaussiandistributioncenteredatzero,withthe highestprobabilityvaluesbeingatornearzero.Thecurveslopesdownwardforboth positiveandnegativemovementfromzero,representingthedecliningprobabilitythat anobservationfromtherandomvariablerepresentedbythedensityfunctionwilltake onthosevalues. 3

PAGE 14

Figure2.1:Arepresentationofasimpledetectionproblem.[12] ExaminingtheentireFigure2.1,itisclearthatthedecisionruledirectlybisects theoverlappingprobabilitydensityfunctions,whichisfairlyeectivebutfarfrom perfect.ThereisstillplentyofthePDFfor x t = t thatfallswithinSignal andNoise"whichcouldcauseafalsealarm.Thisisarecurringthemeindetection theorythatspeaksdirectlytoitsutility:shouldtherebeacostassociated,orabias suchthatthechancesofincorrectlydetectingafalsepositiveisminimized?Forthis exampleitwouldbeanawkwardconversation,butinapplicationssuchasradarand sonar,realconsequencescouldcomeofseeingatargetthatisnotthere,orperhaps worse,missingonethatwas.Also,as x t isrepresentativeofastochasticprocess, andthe H 1 hypothesisisrepresentativeofit.Theanalystwouldworkwiththe observedbehavior, x t ,whichisarealizationoftheunderlyingstochasticprocess thecompanion,whoatanygivenmomentcouldbeeithertalkingornot,butnot bothinsometimeinterval.Andperhapsnally,itcanbeinferredcorrectlythatif thesignalweremoreintensethantheassumedscalar`1'orthevarianceofthenoise reduced,amoreeectivedecisionrulecanbedesignedsignalintensityovernoise variancewouldbethesignal-to-noiseratio. 4

PAGE 15

Itshouldnowbenotedthatthemodelcanbecomemoremathematicallyrigorous initsdescription:GiventhatthePDFforournoise"iswellknown: p x = 1 p 2 2 exp )]TJ/F15 11.9552 Tf 16.402 8.088 Td [(1 2 2 x )]TJ/F19 11.9552 Tf 11.955 0 Td [( 2 ;
PAGE 16

TheimpacttothehypothesistestingproblemisillustratedinFigure2.2b,where theoutputofthesourceis ,asetofparameters.Again,themappingforallknown valuesof in shouldbeknown.Thecautionisthatthe decision istheoutput,and istheonlypurposeforthehypothesistest;anydeterminationof M or isunwanted post-decision. Figure2.2:Bestrecreativeeortforadiagramwithin[23],page87.aComposite hypothesistestingproblemforsingle-parameterexample.bCompositehypothesis testingproblem. 2.4Neyman-Pearson Extendingtheexamplefromthebeginningofthischapter,supposethata realizationisobservedandtwopossiblehypothesesexist:onedescribedbyaGaussian distributionof0-meanandunitvariance,andanotherofthesametype,butwith26

PAGE 17

meanandunitvarianceeither N ; 1or N = 2 ; 1.Thisessentiallysetsupthe question:fromwhich doestherealizationoriginate? Specically,thersthypothesis H 0 =0,whichisthe null or nonemphatic hypothesis,withaparameterizedvectorofasingleelementforasingleparameter thetwovaryonlybymeanthatdescribesaGaussiandistribution.Thesecond hypothesis H 1 isthe alternate or emphatic hypothesis,similarlyparameterizedfor meanandalsoaGaussiandistribution.AnillustrationoftheiroverlappingPDFsis Figure2.3. Figure2.3:SimplevisualNeymanPearsonexampleusingnormalPDFs,one =0 ; theother = 1 2 Withidenticaldistributiontypesandidenticalvarianceitwouldbequite diculttodeterminetheactinghypothesissimplyfromthelocationofasamplein eitheroneofthetwoerrorregions.Thesametechniqueasbeforecouldbeemployed, settingarulesetbasedonalinethistimeat1/4anddeningrealizations:if x t > 1 = 4,then H 1 ;else H 0 .`Errorregion1'wouldbethecasewhere H 0 was chosen,but H 1 wastrueasintheactiveprocess,and'errorregion2'issimilarly where H 1 waschosen,when H 0 wastrue.Thenotationwillbe P decides H 0 ;H 1 true orsimply P H 0 j H 1 ;and P decides H 1 ;H 0 trueorP H 1 j H 0 .Formally, P H 1 j H 0 = Pr x t > 1 4 ; H 0 and P H 0 j H 1 = Pr x t < 1 4 ; H 1 : 7

PAGE 18

`Errorregion2'anditsassociatedprobabilityhasanimportantimplication, especiallygivenhowtheproblemhasbeenstated.The falsealarm probabilityis thatofchoosingthe alternative hypothesiswhenthe null hypothesisistrue.Ifthis werearadarsystem,thiswouldbetheprobabilityofdetectingaircraftwhenthere arenone.InFigure2.3,theprobabilityofdetectingsomethingpresentwhenitisnot, isequaltotheprobabilityofdetectingnothingwhenthereissomethingthere.Taking intoconsiderationhowmuchtheseregionsrepresentthePDFsofthetwohypotheses, additionalworkneedstobedone. Letusrstconsiderwhatshiftingthethresholdwilldo.InFigure2.4,the thresholdismovedwithconsequencesdescribed.Twothingsbecomeobvious:it isnotpossibletosimultaneouslyreducebotherrorprobabilities,andtominimize thefalsealarmprobabilityeventuallybecomescounterproductiveasitsimultaneously diminishesdetectability.TheNeyman-PearsonLikelihoodRatiostatesthatto maximizetheprobabilityofdetectionforagivenfalsealarmprobability ,decide H 1 if L x = p t j H 1 p t j H 0 >; wherethethreshold isdeterminedby = Z f t : L x > g p t j H 0 dt: Then,beingmorespecic,theprobabilityofdetection, P detection = P H 1 j H 1 = Pr f x t > j H 1 g = Z 1 1 p 2 exp )]TJ/F15 11.9552 Tf 10.494 8.088 Td [(1 2 t )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 2 dt; 8

PAGE 19

istobemaximized,whilethefalsealarmprobability, P falsealarm; = P H 1 j H 0 = Pr f x t > j H 0 g = Z 1 1 p 2 exp )]TJ/F15 11.9552 Tf 10.494 8.087 Td [(1 2 t 2 dt istobekeptbounded. TheNeyman-Pearsonformalizationrequirestheminimumnumberofassets:a parametrically-knownorcompositehypothesisnull,awell-knownornoncomposite hypothesisalternate,arealizationtowhichwewillassociatealikelihoodasa matchtothealternatehypothesis,toafalsealarmlimit,andthefalsealarmrate .Intheexamplesusedinthissection,hypothesesdistinguishedbysimplyashift inlocationparameterwereconsidered,forthebenetofvisualdemonstrations.The robust modelarisesasaspecialcaseofnonparametricallydescribedhypotheses. 9

PAGE 20

Figure2.4:Threedierentversionsofashiftedthreshold.Thetopshiftsthethreshold totheleftbeyond H 0 'smean,resultinginaverysmallfalsenegativeregion,butvery largefalsealarm.Themiddleshiftsthethresholdtotherightto H 1 'smean, 1 2 ,where thefalsenegativeismuchlarger,butthefalsealarmissmallerthaneventhestarting case.Thebottomversionshiftsthethresholdtotherightextremeeliminitingfalse alarm,butdetectibilityaswell. 10

PAGE 21

3.QualitativeRobustness 3.1Introductiontoconcepts Statisticalinferencesarederivedfromobservationsandassumptions. Robust statistics isprimarilyconcernedwiththepragmaticrationalizationsmadeaboutthose assumptions,usuallyforthepurposeofsimplifyingormakingproblemsmathematicallyconvenient[10].Intheearlierexample,whiteGaussiannoisewasassumedtobean appropriatemodelforthebackgroundconversationsobscuringtheabilitytoheara companionspeak.ThatassumptionledtotheuseofaGaussianPDFs,andasystem modelwascreatedforsolution.Atitsessence,thedecisionproblemis: anobservationassumedderivedfromaunderlyingdistribution; twohypotheses,whichthemselvesrepresentdistributionslikelyassumedtobe somethingmathematicallyconvenient,suchasGaussian; anattempttomatchthepresumeddistributionoftheobservationtothemodel distributionsofthehypotheses;and thedecision;thematchinghypothesisistrue". Ifavaluewithintheobservationisextremeanddistinct,enoughsothatitmight beconsideredan outlier ,thenthatvaluecancausetheobservation'sdistribution tobecomeunmatchablewithintheproblemmodel.Forexample,inFigure3.1a standardGaussiandistributioniscomparedtoonethathasbecomeskewedduetoa low-probabilityeventanoutliernearthevalue`5'lengtheningitstail.Robustnessas atreatmentisappliedtomaketheshapeoftheobservationdistributionlesssensitive tooutliersandbetterttheassumedmodelsspecicallyatthelowprobabilitytails. Themostbasicexampleofthepropertyof robustness isthecomparisonof mean and median .Forexample,givenasetofdata ; 3 ; 4 ; 3 ; 1,themeanwouldbe2 : 6and themedian3.Ifanelementdrasticallyunliketheotherswereintroducedintotheset, suchas125,themeanbecomes23whilethemedianremains3.Themedianwasnot 11

PAGE 22

Figure3.1:Theeectanoutliercanhaveonadistribution.Solidgreenrepresentsa commonGaussiandistributionwhiledashedpurpleisasimilardistributionwithan outliernearthevalue`5'along-taileddistribution. assusceptibletochangeasthemeanwiththeadditionoftheoutlier.Whetherthe valuerepresentsfactisnotrelevant,asonlyamathematically-supportablemethod fordealingwithitisneeded;ouronlyintentionistoavoidlowprobabilityevents fromdominating. Therearetwocommonlyacceptedmethodsthatcanbeapplied[8,10]: 1. Discretionarilymodifyingasmallsubsetoftheobservation. For example,asimplethresholdasavalidatorformembershiptotheobservation setcouldbeapplied,suchasdisallowingvaluesinexcessof106 Fforahuman bodytemperature,knowntobelife-threatingandunlikely. 2. Marginallymodifyingtheentireobservation. Applyingthenaturallog toallvaluesintherstsetwehave : 301 ; 0 : 477 ; 0 : 602 ; 0 : 477 ; 0withamean of0 : 371 e x =1 : 449.Addingtheelement125intothesetasbeforeproduces ameanof0 : 659 e x =1 : 933,arelativelysmallchangetothemean. Robustness istheinsensitivitytosmalldeviationsfromtheassumptions,and robuststatisticsisthestatisticsofapproximateparametricmodels.Robuststatistics isabodyofknowledge,partlyformalizedintotheoriesofrobustness,relatingto 12

PAGE 23

deviationsformtheidealizedassumptionsinstatistics"[8].Itspurposepragmatic:in veryfewsituationsdoesthedatasetunderobservationconformtotheassumptions applied,astheyaremadefortractabilitygivencomputationalortheoreticallimits [13].Statisticalinferences,therefore,areonlyinpartformedbythedata;the assumptionsthemselvesareequallyimportant.Ineverycase,explicitorimplicit assumptionsaremadethatareacceptablyinexactorfalse,butactasconvenient rationalizationsofanoftenfuzzyknowledgeorbelief"[10].Itisappliedinsituations wherethereisadesiretoidentifyastructurethatbesttsmostofthedata,andthe quick,reliable,identicationofoutliersforfurtheranalysis,ifnecessary. 3.1.1Qualitativerobustness,denition Hampelsdenitionofqualitativerobustness[8]wasformalizedin[14]and[18]. Accordingtothelatterformalization,anestimatecorrespondingtoafunction T isqualitativelyrobust,if T iscontinuouswithrespecttothemetrics d 1 and d 2 ; thatis,if,given "> 0,thereexists > 0,suchthat d 1 F;G < implies d 2 T F ;T G <" .Thechoiceofthe d 1 and d 2 metricscontrolsthespecic characteristicsofthequalitativelyrobustproperty:aweakmetric d 1 togetherwith astrongmetric d 2 induceastrongqualitativelyrobustoperation T .Fromagiven observation, x 1 ;:::;x n andtheempiricaldistributionfunction F n t = 1 n n X i =1 1 f x i
PAGE 24

Iftheobservationsetconvergesindistributiontodistribution F ,andif T issuch that, T F =lim n !1 T F n ; then T is consistent at F 3.2Hypothesistesting Thejusticationforthetechniqueappliedhereincanbegeneralizedtosome degree:ifgiventwofromalibraryofstochasticprocessesnon-parametric,classes ofparametric,etc.,eachbeingidentiableanddistinctashypothesis,describinga stochasticprocessorprocesses,thentheassumptionusedisthatduringanygiven timeinterval[0 ;T ]justonehypothesisistrue"active.InNeyman-Pearson formalization,thedecisionastowhichhypothesisisactingisdeterminedbythe performancecriteriondeployedanditiscoined thedecisionrule 3.3Procedureoutlinefordatasources 3.3.1Procedureprefaceforgenomicdata 1.Estimatestandarddeviationfromthecontroldatabygene. a.estimatemeanbytakingthemedian m b.computevariation j )]TJ/F19 11.9552 Tf 11.956 0 Td [(m j ,andcomputemedian 2.For ,computethemedianofthecontrolversustreatmentdierence. 3.3.2Procedure 1.Forgivencontrolvalue ,select : j j = j j 2.Select in,0.5. 3.Usenoisestandarddevision 14

PAGE 25

4.Find d from, G ;d =1 )]TJ/F19 11.9552 Tf 11.955 0 Td [( )]TJ/F19 11.9552 Tf 9.298 0 Td [(d + + exp 2 d )]TJ/F19 11.9552 Tf 16.728 8.087 Td [( 2 2 2 d =1 : 5.If x isthecontrolversustreatmentdierence,truncateas: z x = 8 > > > > < > > > > : d 1 ;x d 1 x;d 1 0,map z x to1;if z x andmap z x to0;if z x < 8.Given x< 0,nd z 1asabove,aswellas ,andmap z x to )]TJ/F15 11.9552 Tf 9.298 0 Td [(1;if z 1 ;map z x to0. 15

PAGE 26

4.DataUnderStudy 4.1Introduction TheprimarypurposeofthisthesiswastotesttheapplicabilityofaNeymanPearson-derivedqualitativelyrobustmethodologytoadatasettypicallyservedby Bayesianderivatives,andfurtheridentifyanyadditionalbenetforhavingdoneso. Thatdatasetisgenomics,especiallyasitisusedtodocancerandpharmacology research,andisespeciallypopularasdevicessuchasmicroarrayshaverevolutionized andsomewhatdemocratizedthedataavailableinthiseld. Inthecourseoftestingonthegenomicsdataset,italsobecameclearthat applicabilitytootherkindsofdatashouldbeveried,sotwoadditionaltypesofdata setswereadded:transducer-basedandpopulationdata.Thetransducer-baseddata wouldbeanin-eldapplication,inthesensethattheappliedPapantonimethodis fromelectricalengineeringcommunicationtheory.Populationdatasetsaretypically analyzedwithsimplestatisticsmean,standarddeviation,etc.. 4.2Genomicdata Theprinciplefocusofthisworkisgenomic,chosenforanumberofreasons, rstofwhichisthecommoninterestasgenomicresearchisandhasbeenubiquitous inscienticwritingincludingmoregeneralpublicnewspublications.Theother reasonistheavailabilityofsourcesandtoolsintheeld,includingspecialized librariesandprocedures.TherearefreeandopensourcetoolsetssuchasDNAMR URL:http://www.rci.rutgers.edu/cabrera/DNAMRfortheRstatisticallanguage andBioconductorURL:http://www.bioconductor.orgforawidearrayoflanguages. Also,therearecommercialpackagesavailableforlanguagessuchasMathWorks R MATLAB R usedinthisthesis,SAS R JMP R ,DataDescriptionInc.'sDataDesk R ,andMINITAB R [1].Finally,therearenumberofsourcesofcomprehensivedata, genomicdatabasessuchaswhatisusedinthisthesisBroadInstitutesCMAP,URL: 16

PAGE 27

https://www.broadinstitute.org/cmap/,whichoerplenteousopportunityforthe techniquetobeapplied,andtheconclusiontested. Thepracticeofgenomicresearchwithinwhichthisworkwouldbeconsidereda memberis exploratorydataanalysis EDA[1].Principallyinterestedinattempting toidentifysystemstructuresorpatternswithingenomicdata,EDAwouldbepattern recognitionasinducedbydetectiontheory,whereintheidenticationofanomalous dataasoutliersiswell-documentedasnecessary.Specically,astagepriortopattern recognitionisknownasthe conrmatorydataanalysis CDAphaseandisinclusiveof descriptivemean,median,mode,standarddeviation,variance,etc.andinferential methodscondence,hypothesistesting,p-value,etc.;itisadecisionand/ordata cleansingphasedominatedbypriorassumptions[1].Thequalityofthedatamust beevaluated,andthisthesisoersatechniquebasedonPapantoni'sparameter estimationapproach[12],whichisspecicallydesignedforthatpurpose:deciding whetheradatasetisworthfurtherscrutinyforinformation. Fromageneralizedpoint-of-view,high-dimensionaldatacollectionandreporting noisecanbeintroducedandpropagatedreadilyintotheanalysisandmayleadto incorrectconclusions.Thisnoisecouldleadtoamisunderstandingofwhichgeneswere expressedwithinaparticularexperiment,whichinturnwouldcausetheresearcher todrawincorrectdeductionsfromanycoexpressionpresumablycorrect,butcould wellalsobeerroneousandanannotationdatabase.Exhaustivesearchesusing toolssuchasannotationdatabases,pathwaydiagrams,anddocumentedlevelsof expressionbasedongenefunction,representgreatcost[1]thatshouldberemediated byearlysignaldetectionamidcollectionandreportingnoise.Moredetailsofgenomic experimentproceduresandhownoiseisintroducedinthecollectionandreporting processesarediscussedinsubsequentsectionsinthischapter. 17

PAGE 28

4.2.1Genomics Inthissectionanessentialunderstandingandexplanationoftheprinciplesof genomicsrelevanttothisthesisarepresented.Simply, genomics isthescientic studyofbiologicalprocessesfromtheperspectiveofthewholegenome[21]."Structural genomicsisthesequencingoftheentiregenomeaswasthefamoussubjectofthe HumanGenomeProjectwiththepurposeofgeneratinganannotationdatabasea sequencedictionaryofsorts;evolutionarygenomicsisthecomparativeanalysisof similaritiesanddierencesbetweenindividualsorspecies;andfunctionalgenomics attemptstorelatesequencestofunction[21]. 4.2.1.1Genes Theaggregatecomponentsofagenomearegenes,identiedsequencesofchemical bondswithasignicancetowardsthecodingofanorganism,thatarederivedfrom thecellular deoxyribonucleicacidDNA ,consideredanalogicallytheblueprint"for thatorganism. 4.2.1.2DNA,RNA,cDNAandmRNA DNAisthehereditarymaterialcontainedwithinthenucleusofacellasa componentofachromosome;containsthegeneticinformationnecessaryforthe structure,functionanddevelopmentofanorganism;andiscapableofreplication withalowrateofmutationamutationbeingsequencevariationbetweengenes [21]. DerivedfromDNAis ribonucleicacidRNA ,representativeofthesequence ofanunzipped"DNAstrandandembodyingthegenomicinformationusedin transcriptionandtranslation; transcription beingthenecessarymechanismforcellular reproductionand translation isthemeansbywhichproteinsareproduced. 18

PAGE 29

Experimentationforgeneexpressionexploitsthenatureof messengerRNA mRNA,proteincodingRNAbycreatingacomplementarysequencethroughreverse transcriptiontowhichauorescingmoleculeisadded.Thismodiedreversetranscribedversionofthesequenceisknownas componentDNAorcDNA ,andis morestableforuseinexperimentation.Theuorescentmarkerisusedtolabelthe cDNAfordetectioninaprocessknownas hybridization [9]. 4.2.1.3Geneexpression Thesequencesofnoteingeneswithinchromosomeswithingenomesareconstructed ofwhatarecalledbasepairs.ItiswellknownthattheallegoricalrungsoftheDNA ladderare nucleobase chemicalcompoundsthatarecomplementaryandbindtogether: Aadenine pairswith Tthymine and Gguanine with Ccytosine .Thesequence thenisthepatternofthesenucleobasechemicalcompoundsalongone`rail'asunder translationortranscriptiontheintegrityofthepatternismaintainedthroughthe pairing.InmRNAadeninescomplementarypairis uracilU ,anunmethylatedform ofthymine[22,19,9]. TripletswithinmRNAcodeforoneof20possibleaminoacidsthatmakeup proteins,orthestartorstop codons .Thestartandstopcodonsdelimitthesequence called,withthecodons,an openreadingframe ORF[1]. Thexedpairingofthenucleobasesandtheknowndelimitersmakeitpossibleto derivetestablesequences.Genes,usingmRNA,theneithercausetheconstructionof aminoacidsthatinachainthenbuildknownproteins,orgeneratefunctionalRNA, inaprocesscalled geneexpression Apreceptofgeneexpressionexperimentationisthenotionthateachcellnucleated eukaryoticcell,specicallywithinanorganismcontainsallthechromosomeswithin itsnucleustoexpressanygene,andageneexpressionexperiment,atitsmost essential,isthedeterminationoftheamountofmRNAinagivensample[9]. 19

PAGE 30

Geneexpressionexperimentstypicallydonotworkwithwholegenes,butinstead subsequencesofafewhundredbasepairsofagenecalledan expressedsequencetag EST,orevensmallersubsequences-50basepairscalled oligonucleotides .ESTs haveanadvantagebeyondbeingmoremanageableinsizeastheyrepresentagene byincludingonly exon ,orcoding,basepairsratherthan intron [1]. 4.2.1.4Hybridizationassays Theessentialprincipleofassaysisderivedfromtheinformationsofarpresented, leadingtotheconclusionthatastrandofmRNAwillhybridizewithaDNA strandwithcomplementarybasepairing.Itshouldbenoted,however,thatsince thehydrogenbondsbetweenadenineandthyminearelessstablethanguanine andcytosine,therecanbeapproximatematchessolongasasucientnumber ofcomplementarybasepairsexistinthesequence.Annealingconditionsregulate hybridization,includinginexactpairings,andthesenearmatcheshavebeenalsoused toinfergeneticinformation[21],andareaccountedforinthedevelopmentprocessto denethecDNAsequenceswithintheassays. Throughaprocesscalled"DideoxynucleotideDNAsequencing"see[21]fora basicunderstandingofsequencinggenesandtheirallelesareidentied.Thesubsets ofthesegenesarethenturnedintoanassayprobe{ahomogeneoussampleofthe single-strandedDNAmoleculesofaknownsequence,whichhasbeenpreparedand labeledwithareporterchemical,typicallyaradioactiveoruorescentmarker[1].An arrayofprobesisthenanassay. 4.2.2Microarrays Acell'sabilitytoreacttostimuli,buildandmaintaincellularstructures, biosynthesisofmacro-moleculesandenergyproductionarealldependentonprotein synthesisderivedfromtranscription.Eachcellcontainsalltheinformationwithinits DNAtoexpressallpossibleproteinsforanorganism,howeveracertaincell'sactivity, indirectlyitsfunction",isrepresentedbygenesitexpressesandtowhatlevelwhile 20

PAGE 31

bothalungandkidneycellcontainDNAencodingforallproteins,theyeachexpress dierentmRNAand/oratdierentlevels,whichcanbeusedtouniquelyidentify theirrolewithinthebody[25]. Amicroarrayisaconstructofspots"orprobes",eachcontainingmillionsof strandsofthesameDNAsequenceoligonucleotides.Thestrandsareconstructed usingachemicalsynthesisprocessthatprints"theoligonucleotidesusing photolithography.Theyaretypicallyabout25basepairslong;andwithotherprobes canrepresentgenessequencesitesassociatedwithapropertyorcondition,typically 11-20probesinaprobeset.Theintention,beyondbeingabletotestlongerthan25mergenes,istoimprovethesignal-to-noiseratioandreducefalsepositivedetection. Theamountofoligonucleotideineachspotcorrespondstoitscapacity,andtothe amountofuorescentmarkerpresent,whichcreatesarangeofdetectionforthat sequence,asaprobabilisticindicatorofexpressionlevel[9]. Anearlyversionofamicroarrayusedexperimentallywithinasourceofdata utilizedboth perfectmatch PMand mismatched MMprobes,whichvariedby asinglebasepair.Asnotedearlier,approximatematchesarepossible,especially atcertainannealingtemperaturesandbasepairtypes.Atthetimeoftheiruse,it wasthoughtthatsubtractingtheMMfromthePMprobeswouldbeaneective backgroundsubtractionfornormalization.[9] 4.2.2.1Bioinformatics Evenwithcommondiversephysicalcharacteristics,suchaseyecolorandblood type,allhumansareapproximately99.9%geneticallyidentical.Thecollectionof allthegeneticinformationforahumanbeinghasaboutthreebillionbasepairs, partitionedinto23pairsofchromosomesonefromeachparent,called homologous chromosomes .Eachchromosomerangesfrom50to250basepairsinlength,and eachchromosomecontainsasubsetoftheapproximately40,000genes.Eachofthese genesvariesinlengthfromafewhundredtoseveralthousandbasepairs.Within 21

PAGE 32

thesegenes,variationssuchasthosethatcodeforeyecolorareknownas alleles Eachpersoncarriestwoallelesofeachgene;insomecasestheyareidenticalbetween thechromosomalpair homozygous ,andinothersonlyoneversionisexpressed heterozygous .Withinthegenesthemselves,onlyafewsequencescodeforproteins, called exons .Betweentheexonsaresequencescalled introns whichhavenodirect codingfunctionareexcludedintranscription[1]. Giventhatthislargevolumeofinformationisalmostidentical,itlogicallyfollows thatsequencingtheentiregenomewouldbeofsignicantvaluesuchasthemuchpublicizedHumanGenomeProject.Italsoimpartsvalueonsubsequencesavailable inrepositoriessuchasBioconductor,ArrayExpress,GeneExpressionOmnibus[1], andthedataunderstudyhere,CMAP[20]. 4.2.2.2Usageexperimenttypes Microarrayexperimentsareusedinavarietyofbiologicalresearchareas[1], including: SoftTissue. Microarrayexperimentsaredoneontissuesamplesfromvariousparts ofthebodytoidentifytheexpressionproleasitrelatestofunction.For example,tissuesamplesfromtheliverwouldbeexaminedfortheproteinsand dieringmRNApresent. DevelopmentalGenetics. Similartoabove,buttestingtissuesatvariousstages ofdevelopment. GeneticDiseases. Identicationofmutationsandexpressionprolesofdiseased tissue. ComplexDiseases. Similartoabove,butinsteadidentifyinggeneticandexpression prolesforpolymorphismsthatmakeindividualssusceptibletocertaindiseases. PharmacologicalAgents. Identicationofgenesandexpressionprolesforexposure tocertainchemicalsorotheralterationstotheenvironment. 22

PAGE 33

PlantBreeding. Identicationofgenesresponsibleforspecicwantedtraitsto enablethecreationofmoredesirableplantvarietiesForexample,higheryield orsweeterfruit. EnvironmentalMonitoring. Measurementofthechangesinexpressionproles whengenesareexposedtocertainstressors,especiallycontaminants. ThedatautilizedwithinthecontextofthisworkisacombinationofGenetic Diseases,ComplexDiseasesandPharmacologicalAgentsandCMAPiscomprisedof seriesofexperimentsthatrepresenttheeectsofvaryingdosagesofpharmacological compoundsonknowncancerouscells.Regardlessofresearchareaclassication,all microarraystudiesareattemptstoidentifygenesandexpressionprolesbetween dieringconditionsbywhichahypothesisistestedagainstobservations. 4.2.2.3OligonucleotideversuscDNAmicroarrays Theconcernofthisthesisisprimarilywiththeanalysisofthedataresultant fromthehybridizationprocess,buttherearesourcesofnoiseandirregularitiesthat canoccurbecauseofthetypeofmicroarrayused.Thereareessentiallytwotypes ofgeneticprobes:anoligonucleotideandcDNA.Anoligonucleotidemicroarrayis comprisedofsyntheticallygeneratedsequencesofgenes,whereascDNAmicroarrays utilizedirectclonesusuallyfromacDNAlibraryofknownsequences.Thechoiceof thetypeofmicroarrayisimportanttothebiologicalexperimenter,butfallsoutside thescopeofthisresearch. 4.2.2.4Amicroarrayexperiment AnoverviewoftheproceduralstepsforamicroarraystudyisshowninFigure4.1, andisdetailedinthissectiontoalevelsucienttounderstandbothwhatthedata represents,andidentiesthesourcesofuncertainty. 1. Preparation. Themicroarrayitselfisanarrayoftinyspotscontaining eithercDNAfromalibrary,oroligonucleotidesequencesprintedinplaceby 23

PAGE 34

Figure4.1:Microarraydataanalysisprocess;adaptedfrom[1]. photolithographyorinkjettechnology.Thesamplefortestingispreparedby reversetranscribingcDNAfrommRNAwiththeadditionofauorescingdye forlaterdetection. 2. Hybridization. Inaspecialchamberthatcontrolsenvironmentalconditions suchastemperatureandhumidity,theuoro-cDNAisjoinedwiththemicroarray foranincubationperiodtypically16-19hours.Followingincubation,the excessuoro-cDNAiswashedawayandthemicroarrayisdried. 3. Scanning. Thegoalofscanningistocapturethebestimagebetweenlow copy"genesthosewhosemRNAsarenotabundantinthetestsampleand highcopy"genes.ItisimportanttonotethatmRNAlevelscanvarygreatly fromnothingtomillionsofcopieswhileuorescentdetectionistypicallyinthe rangeof40,000-65,000levelsofcontrastintensity,asthisintroducestheneed foradjustmentstothesensitivityofthescanningsystemadjustmentstothe photomultipliertubegainbyanoperator. 4. Processingthescannedimage. Gridding. Thescannedimageisalignedtoagridtoidentifyregionsfor spotanalysis,typicallybyidentifyingthecenters. Segmentation. Removalofthebackground.Thisisanon-trivialtask, becausespotshapecanvaryfromcrescentordoughnuttolledcircle. 24

PAGE 35

Quantication. Averageintensityofthespotisdetermined. Quality. Thequalityofthearrayshouldbeevaluated,especiallyincases wheretherearelocalizedvariationsinintensitynotisolatedtoindividual spots,inconsistentbackgroundintensity,ordefectivespots.Thelater casecreatessituationswherethedefectivespotsmaybeignoredordownweighted. Background. Theidealbackgroundnon-spotintensityiszero,however therecanbenon-specicbindingoftheuorescentlabeledsample.The assumptionismadethatthenonspecicbindingisuniformandadditive, andthereforeameasurementismadeforthebackgroundintensityandit issubtractedfromthespotintensities.Thereisasignicantexception, asdatausedinthisthesisusesamicroarraytypethatemploysperfect matchPM/mismatchMMtechnology:InthecaseoftheAymetrix R GeneChip R HumanGenomeU133array,thesignalisnotdirectlymeasured, butinsteaddeterminedbycombiningthePMandMMintensities, S g in S g = m g P i =1 PM gi )]TJ/F19 11.9552 Tf 11.955 0 Td [(MM gi m g = m g P i =1 Y gi m g : Anyvaluesgreaterthanthreestandarddeviationsfromthemeanare disregarded. 5. Preprocessingmicroarraydata. Priortoanydataanalysis,apreprocessing stepisappliedtoscalethedata,removesystematicsourcesofvariationand identifydiscrepancies.Typicalsourcesofsystematicbiasare:theconcentration andamountofDNAplacedinthemicroarrays,arrayingequipmentsuchas spottingpinsthatwearoutovertime,mRNApreparation,reversetranscription bias,hybridizationeciency,lackofspatialhomogeneityofthehybridizationon theslide,scannersettings,saturationeects,backgrounduorescence,linearity 25

PAGE 36

ofdetectionresponse,ambientconditions,dyebiasduetophysicochemical properties,labellingeciencies,andscanningpropertiesofthedyes.Scalemodifyingpreprocessingtransformationsisperformed,typicallyutilizingbase 2logarithms,asitmakestheintensitieslessdependentonthemagnitudeof thevalues,reducesskewnessofdistributions,andimprovesvarianceestimation [1,13].Earlymicroarrayexperimentshadsignicantvariationsinintensities, evenformicroarraystreatedidentically.Improvementsinmicroarraytechnology havemadetheresultsmoreconsistent,butthesevariationsremainalbeitsmaller inmagnitude. 4.2.3Knownsourcesofdatanoise"andirregularity Aswithanysamplingprocess,thequalityandquantityaresourcesofdisparity. TheDNAconcentrationsandtheirabilitytobeisolatedvarieswidelybytissuetype. Sourcesofnoise"withinthemicroarraytechnologyitselfcanbederivedfrom theprocesstocreateeitherthecDNAoroligonucleotidestrands.Oligonucleotide microarrayscanhavevariationsinthesequencesynthesisduetolithographicinaccuracies typicallyprinthead-related,whilecDNAmicroarrayshavetheequivalenttranscription inaccuracyitcanbeincreasedbytheamplicationprocess,postisolation.Atthe timeofhybridization,stringency,ortheenvironmentalvariablesoftemperature,salt concentrationandpH,canbeasourceforvariation[9].Forexample,highstringency athightemperaturesand/orlowsaltconcentrationscanslowthehybridization process,buthelpassurecorrectcomplementarysequencebinding.Lowstringency cancausenonspecicbindingnon-complementary. Overall,variationsorsourcesofnoise"canoccurataboutanystepintheprocess: fromsampleextractionandamplication,labeling,hybridization,scanningandnvaluestatisticalanalysislikeCELles.Therearecontainmentproceduresdesigned tomitigatethenoise,suchasbackgroundcorrection,normalization,andsummaryof multipleprobespertranscript,aswellasotherqualitycontrolmeasures[9]. 26

PAGE 37

Figure4.2:AblockdiagramofcDNAmicroarrayimageanalysis[25]. 4.3Personalhealthmonitoringdevicedata Sensorsbuiltintocellularphonesandpersonalhealthmonitoringdevicessuch astnessandsleeptrackers,heartratemonitorsandpedometershavecreated opportunitiesformobileandonlineapplicationstoanalyzethedataandreport results.Manyofthesedevicesaredesignedtoworkwirelesslyinenvironmentswhere thereisahighlikelihoodofinterference.Sensorsarealsobeingplacedindevicesthat wouldbeconsideredcomfortableforaconsumertowearlikeatthewristorwaist ratherthanatlocationsoptimalforthedetectionoftheintendedphysiologicalsignal, apossiblecongurationmismatchthatcancausefurtherdistortioninthecollected data. Currently,therearecommerciallyavailablepersonalhealthmonitoringdevices capableofmeasuring: heartrate steps distance caloricintake caloricexpenditureburning" 27

PAGE 38

elevationclimbing sleepqualitystillnessduringsleep skintemperature galvanicskinresponsesweat,stressresponse. Developmentcontinuesinthemarkettoreneandaddnewsensors,aswellasimprove thequalityoftheinformationderivedfromthem. 4.3.1HeartratemonitorHRM Heartratemonitoringhasbeenapopularmetricappliedtoexertionduring physicaltness,bothasadeterminationofrelativetnesslevelandtoidentifystress levelsassociatedwithbenetssuchasweightlosscalledtargetzones".Atthetime ofthiswriting,therearetwobasicvariantsofheartratesensorsavailable: Cheststrap .Asensordetectsthevoltagechangeovertime,whichiscaused bytheheartbeat,anddetectedthroughproximityskincontact.Thedevice typicallywirelesslytransmitsdatacontinuouslytodataloggingdevicemore convenientforuserreportingsuchawristwatch,mobiledevice,orexercise equipmentdisplay.ThistechnologyissimilartoelectrocardiogramECG/EKG testing,andisthesourceofthedataforthisthesis. Strapless" .Generallyconsideredlessaccuratethanthecheststrap,strapless measurementistypicallydonewithLEDsandisbasedondetectingthearterial volumemaximumovertimebytheamountoflightreectedbacktoasensing photodiode. 28

PAGE 39

4.3.1.1Heartphysiologyandelectrocardiography Thedetectionismadepossiblebytheamplicationofelectricalchargesonthe skincausedbydepolarizationduringheartbeatoftheheart'selectricalconduction systemthevarianceinchargethatcausescontractionandrelaxationoftheheart musclesandresultsinbloodow.Whilethemedicaldeviceusesagreaternumber ofleadstodetectandmeasurethecardiacwaveform,thedeviceunderstudyherein requiresonlytwoleads,andisinterestedinthetimebetweenmaximainthesignal. 4.3.1.2Signalacquisition Amplication,conversionandreporting. Thecardiacsignalisapproximately 10millivolts,andneedstobeampliedtowithinthe3-5voltsrangeforeasy detection.Avariationonathemecontainedwithinapatentdemonstrates theamplicationprocess,andasimpliedcircuitdiagraminFigure4.3would amplifythesignaltoapproximately3.6volts. Figure4.3:Ageneralizedversionofthemulti-stageampliercommontoaheartrate monitoringcheststrapgeneralizedfromthedesigncitedin[24]. 29

PAGE 40

Commonmoderejectionratio. Sinceonlythedetectionofthemaximum andthetimebetweenthatandthenexteventarerelevant,adierential amplierasseeninFigure4.3couldbeutilized,exaggeratingthedierence frombaselineandmaximumpotential. 4.3.2Knownsourcesofdatanoise"andirreguarity Theexpectedsourcesofnoisearevariationsingalvanicskinresponse,lead placementduetodevicemovementwhileinuse,othertransmittersinthesameradio frequencytypicallyISMbands:902-915MHzor2.4-2.5GHz,andnon-cardiac musclemovementpickedupbythedevice. 4.4Populationdata TheStateofColoradoisattemptingalayeroftransparencybymakingdatasets, documentsandotherresourcesavailabletocitizensviaawebsitecalledtheColorado InformationMarketplace"data.colorado.gov.Adataset,-2010CSAPSchool andDistrictSummaryResults",waschosenforapplicationofthetechniqueherein forreferenceandtodemonstratebreadth. 4.4.1Scoring Eachscoringcategory,suchasUnsatisfactoryCount",PartiallyProcient Count",etc.,isanaggregateofstudentswithinaschoolthathavescoredwithin thatcategoryforeachtwoyearsindependently.Thiswouldbesynonymouswitha surveyresponseforaparticularanswertakenattwodierenttimes. 4.4.2Knownsourcesofdatanoise"andirreguarity Thisparticulardatasetissubjecttoawidearrayofpsychological,sociological, linguistic,aptitudinal,intellectualandotherhuman-basedsourcesofnoise.While thedatasetcontainsbothsummaryanddetailrecords,asaqualitycheckitislikely misleadingasthesummariesarelikelygenerateddirectlyfromthedetailnoenclosure isprovidedwiththedatadetailinghowitwascreated,sotherecanbenoreasonable expectationoftheirindependence. 30

PAGE 41

5.Development 5.1Microarray-basedgenomicdata 5.1.1Introduction Procedurally,thestepsoftheanalysisforthisdatasetwereasfollows: 1. Identifythetestandcontrolles. Thedataanalysisprogressedthrough theexperiment,composedeachofatleastonecontrolandexactlyonetreatment dataset.Adriverwascreatedtoiteratethroughthelepairsthecontrol"is referredtoasavehicle,whilethetreatment"isperturbation;seeAppendixB. 2. Readtheintensities. Thedataneededtobecomeaccessibletotheplatforms. ForMATLAB,nativedriversincludedinthetoolboxesallowedaccess,with driverlesprovidedbythevendor. 3. Performthedescribedprocedure. Theprocedureisaparameterizedcode, makingitpossibletosetthevaluesfor theprobabilitythatanoutlierexists withinthedataandfor theprobabilityoffalsealarm.Selectedfromthe setof : 01 ; 0 : 02 ; 0 : 03 ; 0 : 05 ; 0 : 07,thevalueof0 : 05wasusedforbothvariables toproduceresultsenclosed. 4. Reportonprogress. Thesizeandvolumeofthedatatobeanalyzedrequired iterativereportingofprogress.Theintentionwastodivideworkandrun analysisinparallel,initiallybyassigningsubsetstoprocessors/machines,and laterbymoresophisticatedmeans. 5.1.2Domainresources 5.1.2.1Aymetrixlibraryles EachofthemicroarraysusedwithintheCMAPdatahavelibrarylesintended tocommunicateeitherspecicsoftheprobesorthephysicalcharacteristicsofthe instrumentsthemselves.Theselesareprovidedinplaintextformatandrequireno conversion. 31

PAGE 42

CIFFile. TheChipInformationFile",alsothemasterle"describesgridalignmentandscanningpropertiesforthemicroarray.Ofparticularinterest tothefocusofthisthesisarethescanningproperties,astheyareusedtocreate theCELintensity"orExpressionLevel"les.Outliers,thresholdsand otherstatisticalpropertiesfromthislewillbeaddressedinothersections. CDFFile. CDFstandsforChipDescriptionFile",anddescribesthelocations ofprobesetassociatedwithaparticulargene,mRNAorcontrolsequenceon themicroarray.Forexample,anentryfor`AFFX-PheX-M at'wouldidentify thenumberofcellsusedforitsdetectionas40,theirspeciclocationson themicroarray,thelengthofbasepairsandthePMandMMbasepair dierence.Thisleislargelyunnecessaryunlesstheintentionistoalignthe dataanalysisprocesstogenes,mRNAorcontrolsequences,whichcertainly isanapproachfordataqualitycell-by-cellanalysis.However,featuresthat mayassistintheisolationofthebackgroundorothernormalizingprocedures, suchashousekeepinggenes",wouldbeindiscerniblesincetheywouldbean aggregationofcells. GINFile. TheGenomicInformationlecontainsfeatureanddenitioninformation relevanttoaspecicidentier.Forexample,the`AFFX-PheX-M at'is identiedasacontrolsequenceratherthanageneormRNAfor bacillus subtilis abacteriacommonlyfoundinthehumangastrointestinaltract. PSIFile. TheProbeSetInformationleissimplyalistofallprobesetsandtheir associatednamesidentication,inprobeorder. SIFFile. TheSequenceInformationFilelistsgenomicsequences,usedlargelyfor cross-chipmatching.Anexampleentry,foranmRNAcalled`217005 at',is: GAAAGTGCAAGGAGCCTGAAGGACCTGGCCCCTCATATGATTTG GCCCATTTAATCCTTGCAAAGAGGGCAAGAACTGTTATTAGACC 32

PAGE 43

CACTTTACAGATAGGGAAACTGAGGGCCCAGAGACACACGGCCC AAGTGAGAGAAGGTCAGCAAGGGAGTGAGGACGACACCTGGACT CCATCTCGTGACCAAAATGTTCGTGGCCACTGAAGGGACCCGTCT CTGGGTGAAG 5.1.3MATLABversionsandtoolboxes TwodierentversionsofMATLABwereusedintheinitialdevelopmentversions 7and8.1,withthefollowingtoolboxes: Simulink BioinformaticsToolbox CommunicationsSystemToolbox ControlSystemToolbox DSPSystemToolbox DatabaseToolbox Fixed-PointDesigner ImageProcessingToolbox OptimizationToolbox ParallelComputingToolbox PhasedArraySystemToolbox SignalProcessingToolbox SimBiology SimulinkControlDesign 33

PAGE 44

StatisticsToolbox SymbolicMathToolbox. AtotaloffourcomputersrunningMATLABwereused,twoofthemutilizingthe ParallelComputingToolbox,toreducethetimetocompleteanalysisonalllesfrom approximatelyoneweektolessthanaday. 5.1.4Experimentaldata TheCMAPCELlesareprovidedinsevencompressedarchives;uncompressed thetotalnumberofavailableintensitylesis7,056,representingbothtreatmentand controldata.TheCELlesareinaproprietarybinaryformatVersionIV,Aymetrix CELformatthatis5.2megabyteseach,butwithatoolavailablefromAymetrix apt-cel-converttheycanbeconvertedtoaASCII-basedleof13.6megabytes.The sizeofthetotalavailablelesisover93gigabytesconvertedanduncompressed. 5.1.5Development Oncetheprobesetvaluesweremadeavailablethroughconversionoftheformat, decisionsneededtobemadetoguidethescopeoftheobservations.Asdescribedin Section4.2.2.4,thereareanumberoffactorstoconsiderregardingtheinformation contentofaCELle: Microarrays'probesetsaggregatetogenes,someofthemthesocalledhousekeeping genes"consideredacontrolintensity. Themarkingagentisnotuniformforallprobesets,whichmeansthatalthough expressionmightbesimilar,thedetectedintensityoftheuorescingagentin therawimagemaynotbe. Theintensities,asscanned,arebasedontheoperator'ssubjectiveabilityto developthebestoverallimagebyvisualinspection.Thiscancauseinconsistencies inevenidenticallyconditionedandtreatedmicroarrays,andlikelyaccountsfor 34

PAGE 45

thefavoringofsummarydataleslikeCELles,ratherthantherawimages, forsharingwithpeers. Collectingprobesetsintoasinglegeneasthescopeforanalysisevaluatingby generatherthantheentiremicroarray,asshowninFigure5.1hastheadvantageof beingmoreeasilystoredinmemorythelargestfromthemicroarraysunderstudy consistsof69probesets,butalsoseemstobeanarticialconstructwithintheactual microarraydata.Thedesignconceptofhavingsubsequencesofagenespreadout overthemicroarrayprobescertainlyhasqualitycontrolimplicationsforbatchesof microarrays,butinthescopeofasinglemicroarrayalltheprobesetsaresubjectto ideallyidenticalconditionsandtreatment.Eveninthecasesofhousekeepinggenes", thesignicanceofthosegenescouldarguablybebiologicalonly. Figure5.1:Apossibleapproachtomicroarrayintensityanalysis:focusthe observationscopetosinglegenes.Dierentgenesareindicatedbycolorinthis simpliedillustration,withprobesscatteredphysicallyacrossthearray,asisthe caseontheactualdevice. Thevariationsduetouorescencebeingintroducedmorereadilyintosome basepairsequencesthanotherswouldseemtoindicateananalysisacrossmultiple microarraysasshowninFigure5.2wouldbeappropriateforthebestchanceat 35

PAGE 46

normalizationofaprobeset.This,unfortunately,isamoreusefulobservationatthe timeofexperimentation,asattheanalysisstageonecanonlyutilizethedatamade available.Withover22,000probesets,itwillsimplyhavetobethehopethatacross theentiremicroarraytherearesucientelementswithintheobservationtodampen anyeectthesevariationshave;thepossibilityofanalysisacrosssetsofmicroarrays isrelegatedtofuturework.Thetwotechnologicalconstraintsare:theimagelecan onlycaptureawindowofcolorfrequencies;andthewindowisdecidedbyhuman eyes,whichissimilartothevariationsinuorescencebetweenprobesets. Figure5.2:Anotherapproach,thistimenormalizingaprobeorprobesetacross multipleexperimentsCELles.Indicatorsa,bandcallshowthesameprobe inthreedierentmicroarrayexperiments,andatvariousintensitylevelsduetoeither variationsintheuorescentmarkerortheadjustmentsforimagecapture. ThescopeofanobservationwasthenconcludedtobethatofasingleCEL leforthepurposesofthisthesis,thedataderivedfromasingleexperimentona microarrayanditsassociatedcontrol.However,thealternativetechniques,including 36

PAGE 47

othervariousstrategiestopartitionthedatatoreducememoryfootprint,arepotential extensionstothiswork. 5.2Transducer-basedHRMdata 5.2.1Introduction Procedurally,thestepsoftheanalysisforthisdatasetwereasfollows: 1. Collectdata. Thedataitselfwasgeneratedbyawirelessheartratemonitor bandbyacellular-basedmobileapplication,foraseriesofrunsbytheauthor. ThecheststrapbasedheartratemonitorwasaJarv R HRM-10FCC-ID:Q7Z12010R1,Bluetooth R -connectedtoanApple R iPhone R 5runningPTechHM's HeartWorks. 2. Portthedata. Theappusedtocollectthedatawasnotdesignedtoshare it.Fortunately,thedataformatwasavailableasplaintext,andwaseasily interpretedforcontent. 3. Performthedescribedprocedure. 4. Report. Eachofthe93runs"wereindependentlyreportedandcomparedfor consistency. 5.2.2Domainresources Althoughthecalculationofheartrateistypicallyeitherarollingaverageor extrapolatedfromadurationlessthanaminute,nospecialdomainresourcesare necessary.Thedata,asprovided,wastested. 5.2.3Experimentaldata Eachworkouthastwodatasets:`Events'and`Monitor'seeSectionB.2.As showninFigure5.3,thestagesoftheruncanbeeasilyinterpreted:thewarmup walk,therunitself,andthecooldownperiod. 37

PAGE 48

Figure5.3:Anexampleofthedatafromtheset;redisthemeasuredheartrate, whilethegraybarsagpointswherethedevicelostsignal.Forthisinterpretation, linearinterpolationwasusedforsignallossperiods. 5.2.4Development AnalternativeformoftheoriginalprocedureSection3.3.1wasusedasatest foraneasierwaytoderivethethresholds d;d 0 ;d 1 : 1.Considerthedataarray,denotedby x 1 ;:::;x n ,thesigneddierencesbetween controlandtreatmentdatathecontrol"fortheheartmonitordatawasthe restingheartrate,andthetreatment"wasthedata. 2.Compute ,themedian,ofthesequence. 3.Findthemedianofthesequence x 1 )]TJ/F19 11.9552 Tf 12.494 0 Td [(;:::;x n )]TJ/F19 11.9552 Tf 12.495 0 Td [( 2 andexpressitssquared root 38

PAGE 49

4.Computetheratio, c = 5.Formthefunction, f y = y +exp c 2 2 )]TJ/F20 7.9701 Tf 6.586 0 Td [(cy c )]TJ/F19 11.9552 Tf 12.109 0 Td [(y ,andndthe y value y 0 suchthat f y 0 = )]TJ/F19 11.9552 Tf 12.549 0 Td [(" )]TJ/F17 7.9701 Tf 6.586 0 Td [(1 .Thefunction f y ismonotonicallydecreasing frominnityto1,as y increasesfrom to 1 .Valuesof arechosenof : 01 ; 0 : 03 ; 0 : 05 6.Thevalue c )]TJ/F19 11.9552 Tf 11.956 0 Td [(y 0 equals d 5.3Populationdata 5.3.1Introduction Procedurally,thestepsoftheanalysisforthisdatasetwereasfollows: 1. Clensethedata. Thedatawasprovidedwithamixtureofdetailandsummary information,includingunusablevaluessummaryandincompleterecordswere expungedfromthele. 2. Portthedata. Toallowformoreecientandquickeranalysis,subjectvalues whichareinaclosedsetwereconvertedtoanenumeratedvalueREADING= 10001,WRITING=10002,...,andpercentageswereconvertedfrominteger representationstooatingpointvalues% 0 : 1. 3. Performthedescribedprocedure. 4. Report. 5.3.2Domainresources ThedatasetisavailablefromtheColoradoInformationMarketplace,andrequires noadditionalresourcestoannotate. 5.3.3Development Thealternativeprocedureutilizedinthetransducerdatawasalsousedforthe populationdata,inpartbecauseofitsmallsize. 39

PAGE 50

6.Results 6.1Genomicdata Foreachexperiment,theproposedprocedureoutlinedinSection3.3.1would identifysuspectedoutliersthattheappliedAymetrixalgorithmmissed.The treatment-controlpairsexperimentsderivedfromanHT HG-U133Amicroarray showedapproximately4%morepotentialoutliers,whilethosefromthemoremodern HG-U133Amicroarraywerecloserto0.5%.Asasumofallexperiments,forthe HT HG-U133AtheAymetrixmethodagged565,896probes,whereastheproposed procedureidentied587,896.Therewasonetestle,whichobviouslyshouldhave beendeleted,forwhichtheAymetrixalgorithmexceededthendingsofthe proposedmethod,butitwasextreme:ofthe22,283probesinthearray,16,278 wereagged%.TheHG-U133Ahadnearequivalentndingswith411,642found byAymetrix,and413,517byourmethod. Figure6.1showsasummaryoftheincreasedsuspectedoutlieridenticationby theproposedprocedureoverthevendormethod.Withintheprogressionofthe experiments,theHG-U133Ashowsdramaticallyfewerdetectedadditionalsuspected outliers,whichcouldimplyatechnologicalimprovementfromtheHT HG-U133A microarray.ItislikelyattributabletothedierentAymetrixmethodologies fordeterminingoutliers,astheHT HG-U133Aassumedrankedoutliersfromthe comparisonofthetestandcontrol,wheretheHG-U133Awasself-sucientinagging outliersineach,thetestandthecontrol. 40

PAGE 51

Figure6.1:Outliersidentiedbyappliedtechniquenotaggedbythenative microarrayalgorithm.Experimentnumberalongthex-axis,numberofoutlierprobes alongy-axis. 6.2Transducer-baseddata Theproceduresuccessfullyidentiedthresholds,asshowninFigure6.2.Since thenatureofthedevicewastosuppressmeasurementsduringtransmissionfailures, thedatasetisresilienttooutliersatthoseevents.However,thedataitselfiserratic, butbounded,andwasanexcellentdemonstrationoftheabilitytoapplythresholds identifyingthesignalofthesubjectrunningwithoutsueringlowprobabilityevents atthewarmupandcooldownstagesincludingcaseswherethereweremultiple warmupandcooldownperiods. 41

PAGE 52

Figure6.2:OutputsamplesforHeartRateMonitordata,withthresholdsinblueand green. 42

PAGE 53

Figure6.3:Outputsamplesforpopulationdata,withthresholdsinblueandgreen. 6.3Populationdata ThepopulationdatasamplesshowninFigure6.3,aswiththetransducer analysis,didnothaveacomparisonpurpose,butinsteadtheprocedurewasevaluated foritsapplicability.Arguably,theoutliersidentiedwerenotduetoincorrectly reportedvalues,butlikelyindicatethatthesubsetswithinthedatawheresummary totals,whichwerenotconsistentduetothecleansingprocessremovingincomplete records. 43

PAGE 54

7.Conculsion Thisthesisappliedaspecicstatisticalrobustmethodologytothreediverse domainstodemonstrateanabilitytoimprovethedetectionofoutliers.Thepedigree ofthistechniqueisdrawnfromNeyman-Pearsonhypothesistesting,withthea prioriconstraintsremoved,whichispowerfulincertainsituationswhereapriori probabilitiesareunavailable.Therstandmostremarkableapplicationwasto genomicdata,adomainmorefamiliarwithvariantsofBayesiananalysis.The applicationisperceivablymoreappropriateastheaprioriinformationtypically usedisnotagreeduponwithinthemicrobiologydiscipline.Itisbelievedthatthe techniquewill,asdecisiontheory,identifyifasignal"ispresent;orwhetherthe dataisoverwhelmedwithnoise".Thetechniquewasthenappliedtoaknown-noisy transducer-baseddataset,andacommonpopulationsetforcontrast.Itwasshown thatqualitativerobustnessisadvantageousfortheanalysisofboththetransducer andpopulationdata,andhadapproximatelyfourpercentgreaterdetectionofoutliers onthesummary-statisticsgenomiclesdespitegreaterapplicabilitytounprocessed data.Itisclearthatmicroarraydata,historicallydescribedasnoisy,shouldbe analyzedwithrobuststrategies,andthatqualitativelyrobusttechniquesareaclear improvementoverthemoreroutineBayesiananalysis. 7.1Lessonslearned Therearetwogroupsofitemswithinthecategoryoflessonslearned",dataspecicandgeneraltotheimplementationofthetechnique.Importantlywithinthe data-specicgroupistheissueoftheincorrectdatausedforthegenomicanalysis.The datatypicallysharedwithinthiskindofresearchtheCELlesarenotunprocessed rawdata,butareinstead average intensitiesand standarddeviations fromthat average.Thetechniquewouldbebetterappliedtotheactualimageintensitiesalbeit relativetoasomewhatsubjectivecolorfrequencyandtemperaturerange.Fromthe perspectiveofimplementation,themosttime-consumingsectionofprocessingwasin 44

PAGE 55

thesolvingofCumulativeDistributionFunction,zeromeanandunitvariancefor aspecicvalue.Threedierentmethodsweretried:anexplicitsolver,aniterative solver,andatable-drivenvaluelookupbasedonabookofmathematicaltables [4].Whilethelastmethodoutperformedtheothertwoonsmall-scaletests,the memoryrequirementsoftwosetsofmicroarraydataandthelookuptableexceeded theavailablecomputationalresources. Parallelizationwasintendedtobeamoresignicantcomponent,butbecause qualitativelyrobusttechniquesusethemedianratherthanthemeanitwouldrequire atleasttwopassesforeachexperiment.Forpracticalreasons,theexperimentswere associatedwithjobstreams,whichshortenedoverallanalysistimesignicantlyby employingmultipleprocessorsonmultiplecomputers.Therecouldbesignicant improvementbyincorporatingaGPGPUsystemasthedataareuniforminsize,and itislikelythatthetablelookupmethodologyforthefunctionwouldbeused. 7.2Futurework Asdescribedabove,themostsignicantimprovementorextensionofthiswork wouldbetousetherawdataleratherthanaprocessedone.Thisleisnottypically published,likelyduetosizeconstraints,butitisimportanttotestondatathathad nototherwisebeenmanipulated.Extendingfurther,otherdatasourcesforrawles shouldbetestedandcomparedforquality. AnimplementationusingthestreamprocessingcapabilitiesofaGPGPUsystem wouldbeaverypragmaticimprovement.Theanalysisofgenomicdataisdicult mostlybecauseofthevolumeofdata,whichcanpossiblybemitigatedwithparallel processing.Thedecisionofwhetherthedatabeinganalyzedisvaluablecontains informationiscriticaltotheresearchthatwouldaskbiggerquestions. AnimplementationinthestatisticallanguageRwould,perhaps,beworthfurther investigation.However,duetoissuesrelatedtotheAymetrixproprietaryformats, memoryusage,andtheabilitytoeasilyparallelize,MATLABwaschosen. 45

PAGE 56

REFERENCES [1]D.Amaratunga,J.Cabrera,andZ.Shkedy.Explorationandanalysisofdna microarrayandotherhigh-dimensionaldata,2014. [2]RAKESHKBansalandPanayotaPapantoni-Kazakos.Outlier-resistant algorithmsfordetectingachangeinastochasticprocess. InformationTheory, IEEETransactionson ,35:521{535,1989. [3]DanielP.Berrar,WernerDubitzky,andMartinGranzow.Apracticalapproach tomicroarraydataanalysis,2003. [4]WilliamH.Beyer.Crcstandardprobabilityandstatistics:tablesandformulae, 1991. [5]KailashBirmiwalandPPapantoni-Kazakos.Outlierresistantpredictionfor stationaryprocesses,1994. [6]HakanDelicandPPapantoni-Kazakos.Robustdecentralizeddetectionby asymptoticallymanysensors. Signalprocessing ,33:223{233,1993. [7]HakanDelic,PanayotaPapantoni-Kazakos,andDimitriKazakos.Fundamental structuresandasymptoticperformancecriteriaindecentralizedbinary hypothesistesting. Communications,IEEETransactionson ,431:32{43,1995. [8]FrankR.Hampel,ElvezioM.Ronchetti,PeterJ.Rousseeuw,andWenerA. Stahel.Robuststatistics:Theapproachbasedoninuencefunctions,1986. [9]G.Hardiman.Microarrayinnovations:Technologyandexperimentation,2009. [10]PeterJ.Huber.Robuststatistics,1981. [11]StevenM.Kay.Fundamentalsofstatisticalsignalprocessing:Estimationtheory, 1993. [12]StevenM.Kay.Fundamentalsofstatisticalsignalprocessing:Detectiontheory, 1993;2013. [13]StevenM.Kay.Fundamentalsofstatisticalsignalprocessing,volumeiii: Practicalalgorithmdevelopment,2013. [14]DimitriKazakosandP.Papantoni-Kazakos.Detectionandestimation,1990. [15]GeoreyJ.McLachlan,K.ADo,andC.Ambroise.Analyzingmicroarraygene expressiondata,2004. 46

PAGE 57

[16]DimitriosAPados,KarenWHalford,DimitriKazakos,andPanayotaPapantoniKazakos.Distributedbinaryhypothesistestingwithfeedback. Systems,Man andCybernetics,IEEETransactionson ,25:21{42,1995. [17]DimitrisAPados,PPapantoni-Kazakos,DemetriosKazakos,andAchillesG Koyiantis.On-linethresholdlearningforneyman-pearsondistributeddetection. Systems,ManandCybernetics,IEEETransactionson ,24:1519{1531,1994. [18]PPapantoni-Kazakos.Someaspectsofqualitativerobustnessintimeseries.In RobustandNonlinearTimeSeriesAnalysis ,pages218{230.Springer,1984. [19]LansingM.Prescott,JohnP.Harley,andDonaldA.Klein.Microbiology,2005. [20]XiaoyanA.QuandDeepakK.Rajpal.Applicationsofconnectivitymapindrug discoveryanddevelopment. DrugDiscoveryToday ,17-24:1289{1298,2012. [21]MarkFrederickSandersandJohnL.Bowman.Geneticanalysis:Anintegrated approach,2012. [22]DovStekel.Microarraybioinformatics,2003. [23]HarryL.VanTrees.Detection,estimation,andmodulationtheory,2001. [24]D.W.Vidrine,J.G.Kisslinger,andJ.M.Brown.Heartratemonitorand method,January2000. [25]WeiZhangandIlyaShmulevich.Computationalandstatisticalapproachesto genomics,2006. 47

PAGE 58

APPENDIXA.Code A.1Example-MATLABcodeforHT HG-U133Amicroarray code/HT HG U133A.m clc algorithm_version=5; algorithm_revision='an'; clc fileID=fopen'cmap_instances_02.csv'; C=textscanfileID,'%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s','Delimiter','|'; fclosefileID; forindex=1:sizeC{1},1 the_control=C{10}{index}; ext_start=-1; %Ifthecontrolstartswithaperiod,itisofthenotation: %.H07.G08.E09.D10.B11.A12 %Pickthecorrespondingletterfromthetest,ifpossiblepickthe %first,ifnot %getthefirstcharacterofthecontrol: ifC{10}{index}1=='.' ext_start=findC{10}{index}==C{9}{index}lengthC{9}{index}-2; ifisemptyext_start controlfilename=strcatC{9}{index}1:findC{9}{index}=='.',C{10}{ index}2:4; else controlfilename=strcatC{9}{index}1:findC{9}{index}=='.',C{10}{ index}ext_start:ext_start+2; end end display[C{1}{index}'Test:'C{9}{index}',Control:'controlfilename'on microarray'C{8}{index}'.'C{3}{index}'-'C{5}{index}'experiment']; affytype=C{8}{index}; testfilename=C{9}{index}; celpath='/CMAP/experiment2/'; 48

PAGE 59

ifstrcmpaffytype,'HG-U133A' cdfpath='/CMAP/CD_HG-U133/Full/HG-U133A/LibFiles'; end ifstrcmpaffytype,'HT_HG-U133A' cdfpath='/CMAP/HT_HG-U133A_AGCC'; end %symbolsandcharactersusedforprintinganddisplay win_crlf='rn'; ctheta=char952; csigma=char963; cmean=char956; cepsilon=char949; cmedian=char1019; cdelta=char916; clambda=char955; cgamma=char947; cd0=strcat'd',char8320; cd1=strcat'd',char8321; display['BeginningVersion'num2stralgorithm_versionalgorithm_revision' trial...']; resultsfile=fopenstrcat'Version-',num2stralgorithm_version, algorithm_revision,'_',strrepdatestrnow,':','','w+','n','UTF-8'; fprintfresultsfile,['BeginningVersion'num2stralgorithm_version algorithm_revision'trial...'win_crlf]; fprintfresultsfile,['Symbolsused:'win_crlf]; fprintfresultsfile,[ctheta'=theta,'csigma'=sigma,'cmean'=mean,' cepsilon'=epsilon,'cmedian'=median,'cdelta'=delta'win_crlf]; %Withinprobesetvaluesthereare20columns,weareinterestedin7, %PerfectMatchIntensity PMIntensity=7; OutlierFlag=17; %readinCELandCDFfiles,pathsareabsolute;firstcontrol ProbeStructure_control=celintensityreadcontrolfilename,affytype,'CELPath', celpath,'CDFPath',cdfpath; %thentheoneundertest 49

PAGE 60

ProbeStructure_test=celintensityreadtestfilename,affytype,'CELPath',celpath ,'CDFPath',cdfpath; fprintfresultsfile,['Usingcontrolfile'controlfilename'andtestfile' testfilename',bothoftype'affytype'.'win_crlf]; %gettheCELstructure celStruct_control=affyreadstrcatcelpath,controlfilename; celStruct=affyreadstrcatcelpath,testfilename; %gettheCDFstructure cdfStruct=affyreadstrcataffytype,'.cdf',cdfpath; microarray_control_intensities=ProbeStructure_control.PMIntensities; microarray_test_intensities=ProbeStructure_test.PMIntensities; %1.Considerthedataarray,denotebyx1,..,,xn,..thesigneddifferences %betweencontrolandtreatmentdatathatis,includebothpositiveand %negativeresponsestotreatment both=zerossizemicroarray_control_intensities,1,1; fori=1:sizemicroarray_control_intensities,1 difference=microarray_test_intensitiesi-microarray_control_intensities i; ifdifference<0 bothi=microarray_test_intensitiesimicroarray_control_intensitiesi; end end %2.Computethemedian,?,oftheabovedatasequence. %3.Findthemedianofthesequencex_1??^2,?,x_n??^2,?andexpress %itssquaredroot? sigma_m=sqrtabsmedianboth; %4.Computetheratioc=?/? c=medianboth/sigma_m; %5.Formthefunctionfy=?y+exp{c^2/2?cy}?c-yandfindthe %yvaluey_osuchthatfy_o=1??-1.Thefunctionfyis %monotonicallydecreasingfrominfinityto1,asyincreasesfromminus %infinitytoinfinity. 50

PAGE 61

fortarget={both}; forepsilon={.05};%{.01,.03,.05,.07,.1}; closest_match=1000; closest_match_y=0; min_diff=1000; theta=absmediantarget{1}; sigma_m=sqrtsumtarget{1}-mediantarget{1}*onessizetarget{1},1 ,1.^2/sizetarget{1},1; end_iteration=200; start_iteration=-200; iteration_value=1; fory=start_iteration:iteration_value:end_iteration fy=1-epsilon{1}*normcdf-y+theta/sigma_m+exptheta* y/sigma_m^2-theta^2/2*sigma_m^2*normcdfy/sigma_m ; ifnotisnanfy ifabs1-fy
PAGE 62

display[ctheta':'num2strtheta','csigma':'num2strsigma_m] ; fprintfresultsfile,[ctheta':'num2strtheta','csigma':' num2strsigma_mwin_crlf]; %6.Thevaluec-yoequalsd/? display['d='num2strc-closest_match_y/sigma_m]; fprintfresultsfile,['d='num2strc-closest_match_y/sigma_mwin_crlf ]; display' --------------------------------------------------------------------------'; fprintfresultsfile,[' --------------------------------------------------------------------------'win_crlf]; forprobeset=1:ProbeStructure_control.NumProbeSets ps_control=probesetvaluescelStruct_control,cdfStruct,probeset; ps_test=probesetvaluescelStruct,cdfStruct,probeset; ifmodprobeset,ceilProbeStructure_control.NumProbeSets/20==0 display[num2strfloorprobeset/ProbeStructure_control. NumProbeSets*100'%completeforthisepsilonvalue' cepsilon'='num2strepsilon{1}'.']; end forprobe=1:sizeps_test,1 outlierflag='false'; test_intensity=ps_testprobe,PMIntensity; cntl_intensity=ps_controlprobe,PMIntensity; delta=absps_testprobe,PMIntensity-ps_controlprobe, PMIntensity; ifps_testprobe,OutlierFlag==1 outlierflag='TRUE'; end ifdelta>y_0||delta
PAGE 63

num2strps_testprobe,PMIntensity-ps_controlprobe, PMIntensity';'cmean':'num2strmeanboth';' cepsilon':'num2strepsilon{1}';'csigma,':' num2strsigma_m';'cmedian':'num2strmedianboth ';'ctheta':'num2strtheta';'cd1':'num2stry_1 ';'cd0':'num2stry_0'Papantoni:TRUE;Affy:' outlierflagwin_crlf]; fprintfresultsfile,['Test1:Is'ctheta'>2d?' regexprepsprintf'%i',theta>closest_match_y*2,{'1','0 '},{'TRUE','false'}'d='num2strclosest_match_y'; ']; ify_0>y_1 fprintfresultsfile,['Test2:0'char8804''cd0''ctheta'/2=-'cd1'-'ctheta'/2whichis0 'char8804''num2stry_0-theta/2'='num2str -y_1-theta/2'?'regexprepsprintf'%i',0<=y_0 -theta/2&&y_0-theta/2==-y_1-theta/2,{'1','0' },{'TRUE','false'}win_crlf]; else fprintfresultsfile,['Test2:0'char8804''cd1''ctheta'/2=-'cd0'-'ctheta'/2whichis0 'char8804''num2stry_1-theta/2'='num2str -y_0-theta/2'?'regexprepsprintf'%i',0<=y_1 -theta/2&&y_1-theta/2==-y_0-theta/2,{'1','0' },{'TRUE','false'}win_crlf]; end elseifps_testprobe,OutlierFlag==1 fprintfresultsfile,[''num2strprobe':'num2strprobeset 'it:'num2strps_testprobe,PMIntensity';ic:' num2strps_controlprobe,PMIntensity';'cdelta,':' num2strps_testprobe,PMIntensity-ps_controlprobe, PMIntensity';'cmean':'num2strmeanboth';' cepsilon':'num2strepsilon{1}';'csigma,':' num2strsigma_m';'cmedian':'num2strmedianboth ';'ctheta':'num2strtheta';'cd1':'num2stry_1 ';'cd0':'num2stry_0'Papantoni:false;Affy: TRUE'outlierflagwin_crlf]; end end end end 53

PAGE 64

end display'Allresultscomplete.'; fprintfresultsfile,[win_crlf'Endofrun.'win_crlf]; fcloseresultsfile; end A.2Example-MATLABCodeforheartratemonitordata A.2.1Driver code/HeartRateMonitor revA.m clc algorithm_version=2; algorithm_revision='A'; HR_epsilon=.05; filename='/Monitor.txt'; where='HRM/'; what='/Monitor.txt'; resting_hr=60; thedirectory=dir'HRM/'; forindex=1:lengththedirectory iflengththedirectoryindex.name>10 %Readintheplaintextdatafile. Monitor_HRM=csvreadstrcatwhere,thedirectoryindex.name,what; displaystrcatwhere,thedirectoryindex.name,what; %Re-epochtheMobile-epochvaluestothebeginningofthissession forjndex=1:sizeMonitor_HRM,1 Monitor_HRMjndex,1=Monitor_HRMjndex,1-session; end %Step1:Considerthedataarray,denotedby x 1 ;:::;x n ,thesigneddifferencesbetween controlandtreatment. HeartRates=Monitor_HRM:,2-resting_hr; %Step2:Computethemedian, ,ofthesequence. HR_theta=medianMonitor_HRM:,2;%medianHeartRates; %Step3:Findthemedianofthesequence x 1 )]TJ/F20 7.9701 Tf 8.187 0 Td [(;:::;x n )]TJ/F20 7.9701 Tf 8.186 0 Td [( 2 andexpressitssquaredroot 54

PAGE 65

forkndex=1:sizeHeartRates,1 HR_sqdiffkndex=HeartRateskndex-HR_theta^2; end HR_sigma=sqrtmedianHR_sqdiff; %Step4:Computetheratio, c = c=HR_theta/HR_sigma; %Step5:Formthefunction, f y = y +exp c 2 2 )]TJ/F21 5.9776 Tf 5.756 0 Td [(cy c )]TJ/F20 7.9701 Tf 8.468 0 Td [(y ,andfindthe y value y 0 such that f y 0 = )]TJ/F20 7.9701 Tf 8.469 0 Td [(" )]TJ/F18 5.9776 Tf 5.756 0 Td [(1 .Thefunction f y ismonotonicallydecreasingfrom 1 to 1 ,as y increasesfrom to 1 y_zero=fzero@xpapantoni_simple_dc,HR_epsilon,x,0; %Step6:Thevalue c )]TJ/F20 7.9701 Tf 8.468 0 Td [(y 0 equals d d_zero=c-y_zero*HR_sigma; %Step7:plotting plotMonitor_HRM:,2,'red';holdon; plot[0sizeHeartRates,1],[-d_zero+HR_theta+resting_hr-d_zero+HR_theta+ resting_hr],'blue';holdon; plot[0sizeHeartRates,1],[d_zero+resting_hrd_zero+resting_hr],'green'; saveasgcf,strcatthedirectoryindex.name,'.pdf'; closeall; end end A.2.2FZerosolverfunction code/papantoni simple d.m function[f_y]=papantoni_simple_dc,epsln,y %PAPANTONISIMPLED-Thisisthesimplifiedversionofthreshold %Prof.TitsaPapantoni % f y = y +exp c 2 2 )]TJ/F21 5.9776 Tf 5.756 0 Td [(cy c )]TJ/F20 7.9701 Tf 8.469 0 Td [(y )]TJ/F18 5.9776 Tf 14.296 3.259 Td [(1 1 )]TJ/F21 5.9776 Tf 5.756 0 Td [(" f_y=normcdfy+expc^2/2-c*y*normcdfc-y-1/1-epsln; end A.3Example-MATLABcodeforpopulationdata code/Population data.m clc 55

PAGE 66

algorithm_version=1; algorithm_revision='A'; Pop_epsilon=.05; filename='CSAP_School_And_District_Summary_Results_2010_modified.csv'; file_subject=1; file_grade=2; file_D=3; file_C=4; file_B=5; file_A=6; file_U=7; file_offset_09=0; file_offset_10=5; reading=10001; writing=10002; mathematics=10003; science=10004; text_subjects={'Reading','Writing','Mathematics','Science'}; text_grades={'1st','2nd','3rd','4th','5th','6th','7th','8th','9th','10th'}; text_grade_letters={'F','F','D','C','B','A','U'}; %Startgrabbingdataafterthestatetotals... Pop_data=csvreadfilename,9,2; forsubject=reading:science %narrowbysubject subject_matter=Pop_dataanyPop_data==subject,2,:; forgrade=3:10 %narrowtothegradelevel subject_matter_grade=subject_matteranysubject_matter==grade,2,:; forgrade_letter=file_D:file_U %narrowtothefieldsofinterest subject_matter_interest=subject_matter_grade:,[file_subject grade_letter+file_offset_09grade_letter+file_offset_10]; %stuff %Step1:Considerthedataarray,denotedby x 1 ;:::;x n ,thesigneddifferences betweencontrolandtreatment. 56

PAGE 67

year_diff=subject_matter_interest:,3-subject_matter_interest:,2; %Step2:Computethemedian, ,ofthesequence. Pop_theta=mediansubject_matter_interest:,3; %Step3:Findthemedianofthesequence x 1 )]TJ/F20 7.9701 Tf 8.468 0 Td [(;:::;x n )]TJ/F20 7.9701 Tf 8.468 0 Td [( 2 andexpressitssquared root forindex=1:sizeyear_diff,1 Pop_sqdiffindex=year_diffindex-Pop_theta^2; end Pop_sigma=sqrtmedianPop_sqdiff; ifPop_sigma==0 break; end %Step4:Computetheratio, c = c=Pop_theta/Pop_sigma; %Step5:Formthefunction, f y = y +exp c 2 2 )]TJ/F21 5.9776 Tf 5.756 0 Td [(cy c )]TJ/F20 7.9701 Tf 8.468 0 Td [(y ,andfindthe y value y 0 suchthat f y 0 = )]TJ/F20 7.9701 Tf 8.469 0 Td [(" )]TJ/F18 5.9776 Tf 5.756 0 Td [(1 .Thefunction f y ismonotonicallydecreasingfrom 1 to 1 ,as y increasesfrom to 1 y_zero=fzero@xpapantoni_simple_dc,Pop_epsilon,x,0; %Step6:Thevalue c )]TJ/F20 7.9701 Tf 8.468 0 Td [(y 0 equals d d_zero=c-y_zero*Pop_sigma; %Step7:plotting plotsubject_matter_interest:,3,'red';holdon; plot[0sizeyear_diff,1],[-d_zero+Pop_theta+Pop_theta-d_zero+ Pop_theta+Pop_sigma],'blue';holdon; plot[0sizeyear_diff,1],[d_zero+Pop_thetad_zero+Pop_theta],'green' ; savefileas=strcat'Pop/',text_subjectssubject-10000,'-', text_gradesgrade,'-',text_grade_lettersgrade_letter,'.pdf'; saveasgcf,savefileas{:}; closeall; end end end 57

PAGE 68

APPENDIXB.FileFormats B.1Aymetrixleformats B.1.0.1IntensitylesCEL Theoutputofamicroarrayexperimentasintensitycalculationsonpixelswithin ascannedimage,speciallytheoutputasdescribedinSection4.2.2.4onpage23, areCELintensityles.Therearesixsectionstypicallytothele,delimitedbythe followingheaders: [CEL ]CELleformatversionnumber. [HEADER ]Identiesthecoordinatesystemappliedtotheintensityinformation includingtheoriginandthenumberofcellsinboththeXandYdimensions. Moreinterestingly,italsoidentiestheheaderlefromtheDATlethe rawintensitiesasascannedimageandthealgorithmandparametersused togeneratetheCELle. [INTENSITY ]Liststhecoordinateofthecell,themeanintensity,thestandard deviationandthenumberofpixelsusedtotocomputetheintensityfromthe scannedimageDATle. [MASKS ]Liststhecoordinatesofanycellstheuser"hasmarkedforomission. [OUTLIERS ]Liststhecoordinatesofanycellsthealgorithmsee[HEADER]has identiedasoutliers. [MODIFIED ]Liststhecoordinatesandthenewmeanvalueforanycellsmodied bytheuser".ThissectionisnotusedintheCMAPdatasetcontainsno dataandisnotincludedinpost-version4CELlesfromAymetrix. 58

PAGE 69

B.1.0.2CMAPexperimentlist AvailabledirectlyfromtheCMAPsiteasacompaniontotheCELles,the `cmap instances 02.xls'lecontainsalistofchemicalsandconcentrationsutilized duringthecourseofthestudy,includingthecelllineandspecicmicroarrayused. Instance id. Instanceidentication. Batch id. Batchidentication.Examplesforthissectionwillbedrawnfrom `instance id=1'and`batch id=1'ofthebuild02versionoftheleprovidedby CMAP. CMAP name. Thechemicalortreatmentunderstudy;forexample,metformin. INN WhetherthechemicalnameisanInternationalNonproprietaryNamerather thantradeormarketnamecommonlyknownasitsgenericname. ConcentrationM. Theconcentrationunderstudyinthisexperiment,suchas 0.00001moles = m 3 .Itisimportanttonotethatthesamechemicalwith dierentconcentrationsweretried,ascertainexpressionprolesoccuratspecic concentrations. Durationh Incubationperiodoftheexperiment,inhours.Allexperimentsused 6hourincubationtimes. Cell. Thecelllineusedfortheexperiment,suchasMFC7,identiedwithinthe leashumanbreastepithelialadenocarcinomacelllinederivedfrompleural eusionATCC#HTB-22culturedinDMEMsupplementedwith10%fetal bovineserumand1%penicillin-streptomycin-glutamine." Array Microarrayused,HG-U133A Perturbation scan ID Thescanmadeoftheexperiment. 59

PAGE 70

Vehicle scan ID Thescanmadeofacontrol"versionoftheexperiment.Control inthissensecanbeexperimentaldataknownnottohaveaneectinthegenes ormRNAofinterest,adatasetknowntoencompasssucienthousekeeping genes",oramoreclassiccontrol.AtthetimeoftheCMAPexperiments, themicroarraytechnologywassucientlyexpensivetowarrantcost-saving measuresthatarereectedintheinterpretationofcontrol. Scanner Modelofthescannerused. Vehicle Thesolventused;typicallywater,saline,DMSOdimethylsulfoxide,or ethanol. Vendor Thevendorprovidingthechemicalortreatment. Catalog number Vendor-speciccatalognumber. Catalog name Typicallythefullchemicalname,suchas1,1-dimethylbiguanide hydrochlorideformetformin. B.2HRMleformats Events.txt Foreachrecordinthele,atimecodefromthemobiledevice'sown epochofJanuary1st,2001inmillisecondsaccompaniesamessageindicating thetypeofevent.Theeventsofinterestare: Dataloggingapplicationstarted Nodatarecordedfromcheststrapsignallost Reconnectedtothecheststrap Dataloggingapplicationcommandedtostoplisteningforthecheststrap Dataloggingapplicationexited. 60

PAGE 71

Thereareadditionalsuperuouselds,includingwhatappearstobeadevice IDandtimezoneinformation,usedbytheheartratemonitoringapplication. Monitor.txt Foreachrecordinthele,thesametypeoftimecodeas`Events' accompaniesthebeats-per-minutemeasuredatthattime.Theaveragetime betweenmeasurementswas29.729seconds.Thereisoneadditionaleldthat appearstoindicateadeviceID;asin`Events',itisunused. B.3Populationleformat B.3.1Experimentaldata Thedatarecordisinthefollowingformat:DistrictNumber,District[Name], SchoolNumber,School[Name],Subject,GradeLevel,[20]09TotalCount,[20]09 UnsatisfactoryCount,[20]09PercentUnsatisfactory,[20]09PartiallyProcientCount, [20]09PercentPartiallyProcient,[20]09ProcientCount,[20]09PercentProcient, [20]09AdvancedCount,[20]09PercentAdvanced,[20]09PercentProcient&Advanced, [20]09NotScoredCount,[20]09PercentNotScored,[20]10TotalCount,[20]10 UnsatisfactoryCount,[20]10PercentUnsatisfactory,[20]10PartiallyProcientCount, [20]10PercentPartiallyProcient,[20]10ProcientCount,[20]10PercentProcient, [20]10AdvancedCount,[20]10PercentAdvanced,[20]10PercentProcient&Advanced, [20]10NotScoredCount,[20]10PercentNotScored,andFall[20]09PercentFree& ReducedLunch. 61