AppliedtoDocumentImages
Marc’AurelioRanzatoYannLeCunCourantInstituteofMathematicalSciencesNewYorkUniversity-NewYork,NY10003
Abstract
Wedescribeanunsupervisedlearningalgorithmforex-tractingsparseandlocallyshift-invariantfeatures.Wealsodeviseaprincipledprocedureforlearninghierarchiesofin-variantfeatures.Eachfeaturedetectoriscomposedofasetoftrainableconvolutionalfiltersfollowedbyamax-poolinglayerovernon-overlappingwindows,andapoint-wisesig-moidnon-linearity.Asecondstageofmoreinvariantfea-turesisfedwithpatchesprovidedbythefirststagefeatureextractor,andistrainedinthesameway.Themethodisusedtopre-trainthefirstfourlayersofadeepconvolutionalnetworkwhichachievesstate-of-the-artperformanceontheMNISTdatasetofhandwrittendigits.Thefinaltestingerrorrateisequalto0.42%.Preliminaryexperimentsoncom-pressionofbitonaldocumentimagesshowverypromisingresultsintermsofcompressionratioandreconstructioner-ror.
1.Introduction
Mostcomputervisionsystemsfordocumentanalysisaswellasforgeneralimageunderstandingincludeafeatureextractorasfirstpre-processingstep.PrincipalCompo-nentAnalysisandK-Meansarethemostwellknowntech-niquesusedinthesevisionapplications.Thefirstproblemweaddressinthispaperishowwecanlearninvariantfea-tures.Somemethodsbuildinvariancefrompre-determinedfeaturesbasedonlocalhistogramssuchastheSIFTde-scriptors[8].However,learningthesefeaturescouldmakethemmoreadaptivetotheparticulardatasetinuse.Othermethods[11,9,10]achieveindirectlyinvariancetosmallshiftsthankstoasub-samplinglayerwhichisaddedafterthetrainingphaseiscompleted.Themethodweproposeincludesandintegratesefficientlyinvarianceinthelearn-ingstage.Eachfeatureextractoriscomposedofasetoflinearfiltersthatareconvolvedwiththeinputimage,fol-lowedbyamax-poolinglayerwhichselectsthelargestval-ueswithinnon-overlappingwindows,andbyapoint-wise
sigmoidnon-linearity.Sec.2.1and2.4describesthede-tailsofthearchitectureusedduringtrainingandthelearn-ingalgorithmwhichallowsustolearninvariantandsparsefeatures.
Thesecondproblemweaddressinthepaperishowwecanlearnhierarchiesoffeaturesinaprincipledway.Hier-archicalfeatureextractorhavealreadybeenproposedintheliterature,butsomeofthemintroduceratheradhoclearningprocedures[4],orproposetohard-wireGaborfilters[11,9],orseeminefficientwhendealingwithdeepmulti-layersys-temsandfewlabelledsamples[7,6].Severalrecentworkshaveshowntheadvantages(intermsofspeedandaccu-racy)ofpre-trainingeachlayerofadeepnetworkinun-supervisedmode,beforetuningthewholesystemwithagradient-basedalgorithm[5,2,10].Thepresentworkisinspiredbythesemethods.Sec.2.5describeshowwecanlearnlayerbylayerfeaturedetectorswithincreasinglevelofinvarianceandcomplexity,andsec.3demonstratesthemethodfortheclassificationofhandwrittennumeralsandforcompressionofbitonaldocumentimages.
2.TheSystem
ThesystemwedescribeisderivedfromtheEnergy-BasedModelforlearningsparsefeaturesintroducedbyRanzatoetal.[10].Weextendthatmodelbydevisinganar-chitecturewhichproducesfeaturesthatarenotonlysparse,butalsolocallyinvarianttoshifts.Moreover,weproposeahierarchicalextensionwhichlearnsfeaturesthatcorrespondtolargerandmorecomplexreceptivefieldsininputspace.
2.1Architecture
Duringtrainingthearchitectureofthesystemiscom-posedofanencoderandadecoderasshowninfig.1.TheencodertakesimagepatchesYandcomputesapredictionoftheoptimalinternalrepresentationZ,namelythecode.Wepostponethedefinitionofcodeoptimalityuntilthelaterdiscussion.ThedecoderproducesareconstructionofY
Figure1.Theencoder-decoderarchitectureforunsupervisedlearningoffeatures.Inanin-variantfeatureextractor,Zrepresentthein-variantpartofthecodewhilethevariablesUencodethetransformationwhichtheinvari-antfeaturesmustundergoinordertorecon-structtheinput.
fromtheinputcodeZ.Themodelhasanassociateden-ergywhichistheweightedsumoftwoterms,theencoderanddecoderenergies.ThedecoderenergyisthesquareEu-clideandistancebetweentheoutputofthedecoderandtheinputY.TheencoderenergyisthesquareEuclideandis-tancebetweentheencoderpredictionandtheoptimalcodeZ.ThecodeZisoptimalwhenitminimizestheoverallenergyofthesystem.Atthesametimeitminimizesthere-constructionerrorofY,anditisascloseaspossibletotheencoderoutputforthegivensetofparametersinthesystem.Theproblemregardinginvariantfeatureextractioninsucharchitectureisaboutthereconstruction.Canwede-codetheinputfromitsinvariantrepresentation?Unfortu-nately,wecannot.However,wecanalwaysaugmenttheinternalrepresentationwiththeinformationnecessarytore-constructtheinput(i.e.thenon-invariantpartofthecode)anduseitfordecoding.Thisiswhatwecalltransformationparametersinfig.1.Theencoderextractsboththetrans-formationparametersU,andtheinvariantcodeZ.Thepa-rametersUarecopiedintothedecoderinordertoallowthereconstruction.Forinstanceinourshiftinvariantmodel,ZrepresentswhatfeaturesarepresentintheinputpatchwhileUrepresentswherethesefeaturesappearintheimage.Gen-erally,thetransformationparametersandthedecodingarenecessaryonlyduringthelearningphaseinordertoassesswhetherthecodeisgoodinpreservingenoughinformationfromtheinput,sothatweareabletoobtainaccuraterecon-structions.Inapplicationswherethegoalistoextractin-variantfeatures,thetransformationparametersandthede-coderaredisregardedafterthetrainingphasebecausetheinterestisonlyintheinvariantcodeZproducedbytheen-coder.
2.2InvariancetoShifts
Inthissectionwepresentaninstanceofthepreviouslydescribedgeneralmodel.Thismodelallowstheextractionoflocallyshift-invariantfeatures.Theencoderiscomposedofasetoffiltersthatareconvolvedwiththeinput,andamax-poolinglayer.Thefilterbankperformsfeaturedetec-tionandproducesasetoffeaturemaps.Howthefiltersarelearnedisexplainedinsec.2.4.Thefollowingmax-poolinglayeroperatesfeaturemapbyfeaturemapseparately,andselectsthelargestvaluesinnonoverlappingwindows.Theresultingspatiallyreducedrepresentationconstitutesthelo-callyshift-invariantcodeZ,whilethepositionsofthese-lectedlargestvaluesarethetransformationparametersU.Nomatterwherethefeaturesappearintheinputpatch,thecodeZwillbealwaysthesameandonlytheparametersUwillbeaffected.TheparametersUarecopiedintothede-coderwhichisalsocomposedofasetoflinearfilters.ThereconstructioniscomputedbyplacingeachcodevalueofZattheproperlocationinthedecoderfeaturemap,usingthetransformationparametersUobtainedintheencoder,andsettingallothervaluesinthefeaturemapstozero.Thereconstructionissimplythesumofthedecoderbasisfunc-tionsweightedbythefeaturemapvaluesatalllocations.
2.3Sparsity
Sparsityisachievedbyinsertinganon-linearityinfrontofthedecoder,dubbedSparsifyingLogisticasin[10].Thisnon-linearityisanadaptivelogisticwithaveryhighthresh-oldwhichenforcessparsityofitsoutputacrosstrainingsamples.Aftertrainingthethresholdisfixed,andthelo-gisticturnsintoastandardlogisticfunction(withalargebias).ThesparsifyinglogisticmoduleputcodevectorZintoasparsecodeZ
¯transformsthein-vectorwithpositivecomponentsbetween[0,1].Letusconsiderthek-thtrain-ingsampleandthei-thcomponentofthecode,zi(k)withi∈[1..m]wheremisthenumberofcomponentsinthecodevector.Letz¯i(k)beitscorrespondingoutputafterthesparsifyinglogistic.Giventwoparametersη∈[0,1]andβ>0,thetransformationperformedbythisnon-linearityisgivenby:i(k)
z¯i(k)=
eβzη
ζi(k−1)(1)
Thiscanbeseenasakindofweighted“softmax”functionoverpastvaluesofthecodeunit.Thisadaptivelogisticcanoutputalargevalue,i.e.avaluecloseto1,onlyiftheunithasundergonealongenoughquiescentperiod.Theparam-eterηcontrolsthesparsenessofthecodebydeterminingthelengthofthewindowoverwhichsamplesaresummedup.βcontrolsthegainofthelogisticfunction,withlargevaluesyieldingquasi-binaryoutputs.
Recallingtheshift-invariantmodeldescribedearlier,wehavethatthecodeZisshift-invariantcodeZ
¯whilethetransformed
willbenotonlyshift-invariant,butalsosparse.Thisisthecodethatisusedbythefollowingdecodermodulesthatperformtheweightedsumofbasisfunctions.
2.4LearningAlgorithm
LetWCandWDbethetrainableparametersintheen-coderanddecoder,respectively.Theseparametersarethesetoffiltersintheencoder,andthesetofbasisfunctionsinthedecoder.ThegoalofthelearningalgorithmistofindavalueforWCandWDthatminimizetheenergyofthesystemoverthetrainingdataset.ThisenergyistheweightedsumofencoderenergyECanddecoderenergyED.DenotingtheoutputoftheencoderwithEnc(Y;WC)andtheoutputofthedecoderwithDec(Z,U;WD),thesequantitiesaredefinedas:EC=||Z−Enc(Y;WC)||2andED=||Y−Dec(Z,U;WD)||2.
Then,weneedtosolveforminWC,WDminZEC+αED,whereαhasbeensetto1inourexperiments.LearningproceedsinaEM-likefashionwiththefollowingon-linealgorithm:
1.propagatetheinputthroughtheencodertoproduceapredictionZ0oftheoptimalcodeZ,andcopythetransformationparametersUintothedecoder2.keepingbothUandthesetofparametersWCandWD
fixed,findbygradientdescentthecodeZ∗thatmini-mizestheenergyEC+αEDstartingfromtheinitialvalueZ0thatwasprovidedbytheencoder3.keepingfixedthecodeatZ∗,updatethedecoderpa-rametersWDbyonestepofgradientdescentinthedecoderenergy4.updatetheencoderparametersWCbyonestepofgra-dientdescentintheencoderenergywhereZ∗willplaytheroleoftargetvaluefortheencoderoutput.Aftertraining,thesystemconvergestoastatewheremini-mumenergycodesZarepredictedbytheencoderinoneshotwithouttheneedforaminimizationincodespace,andthedecoderproducesgoodreconstructionsfromtheen-coderoutput.
2.5
HierarchiesofLocally-InvariantFea-tures
Oncethesystemistrained,itcanbeappliedtolargerim-agesinordertoextractlocallyinvariantfeaturemaps.Thesefeaturemapscanbeusedtotrainanothermachinewhichwillproducefeaturesthataremoreinvariantandmorecom-plexthanthefirstlevelfeatures.Disregardingtheeffectof
thefiltersizeusedintheconvolution,letusassumethattheinputimagehassizepxq,andthatthefirstlevelfeatureextractorperformsapoolinginNxNneighborhoodswhilethesecondlevelfeatureextractorpoolsinaMxMneigh-borhood.Whiletheoutputofthefirstlevelfeatureextrac-torof(approximate)sizep/Nxq/NisinvariantinNxNmax-poolingwindows,theoutputofthesecondlevelfea-tureextractorisinvariantinMNxMNwindowsininputspace.Moreover,thesecondlevelfeatureextractorcom-binesmanyfirstlevelfeaturemapsintoeachoutputfeaturemapincreasinginthiswayitsrepresentationalpowerforencodingcomplexpatternsininputspace.Theconnectionbetweeneachsetofinputfeaturemapswithasingleoutputfeaturemapisgivenbyapre-determinedtable,whichintheexperimentsdescribedinsec.3.1israndom.
Learningahierarchyoffeatureextractorsproceedsinse-quence.Eachlevelistrainedseparatelyand,whentrained,itprovidestheinputforthenexthigherlevelfeatureex-tractor.Asafinalstep,aglobalrelaxationthroughoutthewholesystemcanbedoneinordertofinetunetheparam-eters.ThisprocedurewasoriginallyproposedbyHintonetal.[5]fortrainingdeepbeliefnets.
3.Experiments
Wepresenttwoapplicationsoftheproposedunsuper-visedmethodforlearninginvariantfeatures.Inthefirstex-amplewehaveconsideredtheMNISTdataset[1]ofhand-writtendigits,andwehaveusedthealgorithmtopre-trainthefirstlayersofadeepconvolutionalnetwork.Afterthisunsupervisedlayer-by-layertrainingyieldedahierarchicalshift-invariantfeatureextractor,alltheparametersofthenetworkweretunedtogetherinasupervisedway,aspro-posedbyHintonetal.[5].Thesecondexampleshowsastraightforwardapplicationofthisalgorithmforcompres-sionofbitonaltextimagesyieldingverypromisingresults.Inbothexperiments,ηandβintheSparsifyingLogisticarefixedto0.01and1,respectively.Also,asmalllassoregu-larizationtermoftheorderof0.001isaddedtotheenergyloss.
3.1ClassificationofMNISTdataset
Inthissectionwedescribethetrainingprotocolofa6layerconvolutionalnetworktrainedontheMNISTdataset.Thetrainingdatasetwasaugmentedwithelasticallydis-torteddigits,asdiscussedinSimardetal.[12].Foreachtrainingsample10distortedreplicashavebeenaddedtotheset,makingthetotalnumberoftrainingdigits660,000.Fortrainingwehaveselected5,000digits(fromtheoriginaltrainingdataset)forvalidation.
Thetrainingprotocolisthefollowing.First,wetrainun-supervisedonthewholeoriginaltrainingdatasetthefirst
AB
Figure2.(A)Thefifty7x7filtersthatwerelearnedbythesparseandshift-invariantmodeltrainedontheMNISTdataset.Thesefiltersareusedtoinitializethefirstlayerofadeepconvolutionalnetwork.(B)Thefifty7x7filtersinthefirstlayeroftheconvolutionalnetworkaftersupervisedtrainingontheaug-menteddataset.
Figure3.The42misclassifieddigitsinthetestingMNISTdataset.Ontheupperleftcor-nerthereisthetruelabel.
fourlayersoftheconvolutionalnetworkwiththehierar-chicalfeatureextractordescribedintheprevioussections.Second,wetrainsupervisedonthewholetrainingdatasetthetoptwolayersusingthefeaturesprovidedbythefeatureextractor.Bytakingtheparametersthatgavetheminimumlossonthevalidationset,wehavefoundtheinitialvalueoftheparametersforthelastsupervisedtrainingofthewholenetwork,includingthefirstfourlayers.Testingisdoneontheparametersthatgavetheminimumlossonthevalidationdataset.
Inparticular,theunsupervisedtrainingofthefirstfourlayersisdoneasfollows.First,welearnthefiltersintheconvolutionallayerwiththesparsifyingencoder-decodermodeldescribedinsec.2.4trainedonpatchesrandomlyextractedfromtrainingimages.Welearnfifty7x7filtersthatcapturefeaturethatareinvariantina2x2neighbor-hoodsincethemax-poolinglayerconsiders2x2windows.Oncetrainingiscomplete,theencoderanddecoderfiltersarefrozen,andthesparsifyinglogisticisreplacedbyatanhsigmoidfunctionwithatrainablebiasandagaincoefficient.Thebiasandthegainaretrainedwithafewiterationsofback-propagationthroughtheencoder-decodersystem.Therationaleforrelaxingthesparsityconstraintistoproducerepresentationwitharicherinformationcontent.Whilethethesparsifyinglogisticdrivesthesystemtoproducegoodfilters,thequasi-binarycodesproduceddonotcarryenoughinformationforthelaterclassification.Then,trainingim-
agesarerunthroughthisleveltogeneratepatchesforthenextlevelinthehierarchy.Thesecondlevelfeatureextrac-tor(i.e.layer3and4oftheconvolutionalnet)has1,280filtersofsize5x5thatconnect10randomlychoseninputfeaturemapstoeachoutputfeaturemap.Thereare128outputfeaturemapsthataresubsequentlymax-pooledover2x2neighborhoods.Thissecondfeatureextractoristrainedinthesamewayasbefore.Oncethefeatureextractorisfedwith34x34paddeddigits,itproducesfeaturesofsize128x5x5thataregiventothetoptwolayersofthecon-volutionalnetwork.Theselayersformatwo-layerneuralnetwith200hiddenunitsand10outputunits.Thesuper-visedtrainingofboththetoptwolayersaswellasofthewholenetworkisdonebyerrorback-propagationusingalosswhichistheaveragesquareEuclideandistancebetweentheoutputofthenetandthetargetvalueoftheinputdigit(a1ofNcodeofthelabel).Theerrorrateonthetestingdatasetisequalto0.42%,verycloseto0.39%whichistherecord[10]forthisdataset.Forcomparison,trainingthesamenetworkfromrandominitialconditionbysupervisedback-propagationofgradientsyieldsanerrorrateequalto0.48%.
3.2
CompressionofTextDocumentIm-ages
Inthissectionwedescribehowwecanapplytheun-supervisedlearningalgorithmtocompressionofdocumentimages.WeconsideredtheCCITTblackandwhitetestim-agenumber5,whichcontainsbothtextanddrawings.Witharesolutionof200dpi,theimagehassize2376x1728pixels.First,weranaconnectedcomponent(CC)analysiswhichrevealed1,421components.Weconsideredpatchesofsize30x30thatcoverabout95%oftheseCC’s,andwebuiltadatasetofpatchesthatweusedtotrainthesystem.IfaCCisbiggerthan30x30pixel,itissplitinchunksof30x30pixels.Ifitissmaller,itiscenteredina30x30patch.Intotaltherewere3102patches,someareshowninfig.5.Weconsidereda(singlestage)invariantfeatureextractorwith25630x30filtersinbothencoderanddecoder.Sincetheinputisevenlypaddedina34x34window,themax-poolinglayeroperatesina5x5neighborhood.EncodingconsistsofperformingtheCCanalysis,encodingthelocationsofthepatchesthatareextractedfromthedocument(whoseentropyis5.16bits),propagatingthepatchthroughtheen-coderfilterbankandtheSparsifyingLogistic,thresholdingthecodeinordertomakeitbinary,andfinally,encodingboththecodeandlocationsofthelargestvaluesselectedbythemax-poolinglayer(whichamountsin13.9bitsperpatch).Thedecodinghastouncompressthebitstream,usethedecoderfilterstoreconstructthecodesandthetransfor-mationparameters,andplacethethresholdedreconstruc-tioninthegivenlocationintheimage.Ideally,thefilters
Figure4.LEFTPANELSExamplesofwin-dowsthatareusedfortrainingandcom-pressingtheCCITTtestpagenumber5.RIGHTPANELSReconstructionsprovidedbythealgorithm.
Figure5.DetailofthetestCCITTimagenum-ber5at200dpiwhichhasbeencompressedandreconstructedbyouralgorithm.
wouldbelearnedfromavarietyofimageswithdifferentfontsandwouldbehard-wiredinthedecoder.Undertheassumptionthatwedonotneedtoencodealsothedecoderfilters,thecompressionratiowouldbeequalto95.Thepercentageofreconstructedpixelswhosevaluehasbeenflippedisequalto1.35%.Anexampleofpatchesthatareencodedandreconstructedisshowninfig.4.Fig.5showsadetailofthereconstructedimage.Forcomparison,thestate-of-the-artmethodforcompressingimagesistheJB2[3]whichachievesalossycompressionratioequalto55andproducesavisuallylosslessresultonthistestimage.
4.Conclusions
Wehavedescribedanunsupervisedmethodforlearninghierarchiesofshift-invariantfeatures.Wehaveappliedthisalgorithmtopre-trainthefirstfourlayersofadeepconvolu-tionalnetworkachievingstate-of-the-artperformanceintheclassificationofhandwrittendigitsintheMNISTdataset.Anapplicationincompressionusingasinglelevelinvariantfeatureextractorhasalsobeenpresentedandshowspromis-ingresults.
Inthispaperwehavedefinedanovelmethodforlearninglocallyshiftinvariantfeatures.Moreover,wehaveproposed
ageneralandprincipledmethodforlearninghierarchiesoffeatures.Inourexperiments,wehavelimitedourselvestotwolayersoffeaturesbutonecouldstackasmanymod-uleslikethisasoneneedsinordertoachievethedesiredlevelofinvarianceandcomplexity.Thismethodisusefultopre-traindeeparchitecturessuchasconvolutionalneuralnetworks.Butalso,itcanbeusedtolearnandextractfea-turesfromdatasetswithfewlabelledexamples,anditcanbeusedtolearncomplexinvariantfeatureswiththedesireddimensionality.
Thefutureworkwilltakeintoaccountotherlearningal-gorithmstotrainmultiplelevelsoffeatureextractorallto-gether,thedefinitionofamoreefficientandhierarchicalmethodforcompression,thepossibilitytolearnalsotheparametersintheSparsifyingLogisticwhichcontrolsthesparsityoftherepresentation,andamoreexhaustiveexper-imentationinordertounderstandbetterinwhichconditionsunsupervisedlearningisbeneficialtosupervisedlearning.
References
[1]http://yann.lecun.com/exdb/mnist/.
[2]Y.Bengio,P.Lamblin,D.Popovici,andH.Larochelle.
Greedylayer-wisetrainingofdeepnetworks.InNIPS.MITPress,2007.
[3]L.Bottou,P.Haffner,P.Howard,P.Simard,Y.Bengio,and
Y.LeCun.Highqualitydocumentimagecompressionwithdjvu.JournalofElectronicImaging,7(3):410–425,1998.[4]K.FukushimaandS.Miyake.Neocognitron:Anewalgo-rithmforpatternrecognitiontolerantofdeformationsandshiftsinposition.PatternRecognition,15(6):455–469,1982.
[5]G.Hinton,S.Osindero,andY.-W.Teh.Afastlearningalgo-rithmfordeepbeliefnets.NeuralComputation,18:1527–1554,2006.
[6]F.-J.HuangandY.LeCun.Large-scalelearningwithsvm
andconvolutionalnetsforgenericobjectcategorization.InCVPR.IEEEPress,2006.
[7]Y.LeCun,L.Bottou,Y.Bengio,andP.Haffner.Gradient-basedlearningappliedtodocumentrecognition.Proceed-ingsoftheIEEE,86(11):2278–2324,November1998.
[8]D.Lowe.Distinctiveimagefeaturesfromscale-invariant
keypoints.InternationalJournalofComputerVision,60(2):91–110,2004.
[9]J.MutchandD.Lowe.Multiclassobjectrecognitionwith
sparse,localizedfeatures.InCVPR,2006.
[10]M.Ranzato,C.Poultney,S.Chopra,andY.LeCun.Ef-ficientlearningofsparserepresentationswithanenergy-basedmodel.InNIPS.MITPress,2006.
[11]T.Serre,L.Wolf,andT.Poggio.Objectrecognitionwith
featuresinspiredbyvisualcortex.InCVPR,2005.
[12]P.Simard,D.Steinkraus,andJ.Platt.Bestpracticesfor
convolutionalneuralnetworks.InICDAR,2003.
因篇幅问题不能全部显示,请点此查看更多更全内容