您的当前位置:首页正文

A sparse and locally shift invariant feature extractor applied to document images

2024-08-07 来源:钮旅网
ASparseandLocallyShiftInvariantFeatureExtractor

AppliedtoDocumentImages

Marc’AurelioRanzatoYannLeCunCourantInstituteofMathematicalSciencesNewYorkUniversity-NewYork,NY10003

Abstract

Wedescribeanunsupervisedlearningalgorithmforex-tractingsparseandlocallyshift-invariantfeatures.Wealsodeviseaprincipledprocedureforlearninghierarchiesofin-variantfeatures.Eachfeaturedetectoriscomposedofasetoftrainableconvolutionalfiltersfollowedbyamax-poolinglayerovernon-overlappingwindows,andapoint-wisesig-moidnon-linearity.Asecondstageofmoreinvariantfea-turesisfedwithpatchesprovidedbythefirststagefeatureextractor,andistrainedinthesameway.Themethodisusedtopre-trainthefirstfourlayersofadeepconvolutionalnetworkwhichachievesstate-of-the-artperformanceontheMNISTdatasetofhandwrittendigits.Thefinaltestingerrorrateisequalto0.42%.Preliminaryexperimentsoncom-pressionofbitonaldocumentimagesshowverypromisingresultsintermsofcompressionratioandreconstructioner-ror.

1.Introduction

Mostcomputervisionsystemsfordocumentanalysisaswellasforgeneralimageunderstandingincludeafeatureextractorasfirstpre-processingstep.PrincipalCompo-nentAnalysisandK-Meansarethemostwellknowntech-niquesusedinthesevisionapplications.Thefirstproblemweaddressinthispaperishowwecanlearninvariantfea-tures.Somemethodsbuildinvariancefrompre-determinedfeaturesbasedonlocalhistogramssuchastheSIFTde-scriptors[8].However,learningthesefeaturescouldmakethemmoreadaptivetotheparticulardatasetinuse.Othermethods[11,9,10]achieveindirectlyinvariancetosmallshiftsthankstoasub-samplinglayerwhichisaddedafterthetrainingphaseiscompleted.Themethodweproposeincludesandintegratesefficientlyinvarianceinthelearn-ingstage.Eachfeatureextractoriscomposedofasetoflinearfiltersthatareconvolvedwiththeinputimage,fol-lowedbyamax-poolinglayerwhichselectsthelargestval-ueswithinnon-overlappingwindows,andbyapoint-wise

sigmoidnon-linearity.Sec.2.1and2.4describesthede-tailsofthearchitectureusedduringtrainingandthelearn-ingalgorithmwhichallowsustolearninvariantandsparsefeatures.

Thesecondproblemweaddressinthepaperishowwecanlearnhierarchiesoffeaturesinaprincipledway.Hier-archicalfeatureextractorhavealreadybeenproposedintheliterature,butsomeofthemintroduceratheradhoclearningprocedures[4],orproposetohard-wireGaborfilters[11,9],orseeminefficientwhendealingwithdeepmulti-layersys-temsandfewlabelledsamples[7,6].Severalrecentworkshaveshowntheadvantages(intermsofspeedandaccu-racy)ofpre-trainingeachlayerofadeepnetworkinun-supervisedmode,beforetuningthewholesystemwithagradient-basedalgorithm[5,2,10].Thepresentworkisinspiredbythesemethods.Sec.2.5describeshowwecanlearnlayerbylayerfeaturedetectorswithincreasinglevelofinvarianceandcomplexity,andsec.3demonstratesthemethodfortheclassificationofhandwrittennumeralsandforcompressionofbitonaldocumentimages.

2.TheSystem

ThesystemwedescribeisderivedfromtheEnergy-BasedModelforlearningsparsefeaturesintroducedbyRanzatoetal.[10].Weextendthatmodelbydevisinganar-chitecturewhichproducesfeaturesthatarenotonlysparse,butalsolocallyinvarianttoshifts.Moreover,weproposeahierarchicalextensionwhichlearnsfeaturesthatcorrespondtolargerandmorecomplexreceptivefieldsininputspace.

2.1Architecture

Duringtrainingthearchitectureofthesystemiscom-posedofanencoderandadecoderasshowninfig.1.TheencodertakesimagepatchesYandcomputesapredictionoftheoptimalinternalrepresentationZ,namelythecode.Wepostponethedefinitionofcodeoptimalityuntilthelaterdiscussion.ThedecoderproducesareconstructionofY

Figure1.Theencoder-decoderarchitectureforunsupervisedlearningoffeatures.Inanin-variantfeatureextractor,Zrepresentthein-variantpartofthecodewhilethevariablesUencodethetransformationwhichtheinvari-antfeaturesmustundergoinordertorecon-structtheinput.

fromtheinputcodeZ.Themodelhasanassociateden-ergywhichistheweightedsumoftwoterms,theencoderanddecoderenergies.ThedecoderenergyisthesquareEu-clideandistancebetweentheoutputofthedecoderandtheinputY.TheencoderenergyisthesquareEuclideandis-tancebetweentheencoderpredictionandtheoptimalcodeZ.ThecodeZisoptimalwhenitminimizestheoverallenergyofthesystem.Atthesametimeitminimizesthere-constructionerrorofY,anditisascloseaspossibletotheencoderoutputforthegivensetofparametersinthesystem.Theproblemregardinginvariantfeatureextractioninsucharchitectureisaboutthereconstruction.Canwede-codetheinputfromitsinvariantrepresentation?Unfortu-nately,wecannot.However,wecanalwaysaugmenttheinternalrepresentationwiththeinformationnecessarytore-constructtheinput(i.e.thenon-invariantpartofthecode)anduseitfordecoding.Thisiswhatwecalltransformationparametersinfig.1.Theencoderextractsboththetrans-formationparametersU,andtheinvariantcodeZ.Thepa-rametersUarecopiedintothedecoderinordertoallowthereconstruction.Forinstanceinourshiftinvariantmodel,ZrepresentswhatfeaturesarepresentintheinputpatchwhileUrepresentswherethesefeaturesappearintheimage.Gen-erally,thetransformationparametersandthedecodingarenecessaryonlyduringthelearningphaseinordertoassesswhetherthecodeisgoodinpreservingenoughinformationfromtheinput,sothatweareabletoobtainaccuraterecon-structions.Inapplicationswherethegoalistoextractin-variantfeatures,thetransformationparametersandthede-coderaredisregardedafterthetrainingphasebecausetheinterestisonlyintheinvariantcodeZproducedbytheen-coder.

2.2InvariancetoShifts

Inthissectionwepresentaninstanceofthepreviouslydescribedgeneralmodel.Thismodelallowstheextractionoflocallyshift-invariantfeatures.Theencoderiscomposedofasetoffiltersthatareconvolvedwiththeinput,andamax-poolinglayer.Thefilterbankperformsfeaturedetec-tionandproducesasetoffeaturemaps.Howthefiltersarelearnedisexplainedinsec.2.4.Thefollowingmax-poolinglayeroperatesfeaturemapbyfeaturemapseparately,andselectsthelargestvaluesinnonoverlappingwindows.Theresultingspatiallyreducedrepresentationconstitutesthelo-callyshift-invariantcodeZ,whilethepositionsofthese-lectedlargestvaluesarethetransformationparametersU.Nomatterwherethefeaturesappearintheinputpatch,thecodeZwillbealwaysthesameandonlytheparametersUwillbeaffected.TheparametersUarecopiedintothede-coderwhichisalsocomposedofasetoflinearfilters.ThereconstructioniscomputedbyplacingeachcodevalueofZattheproperlocationinthedecoderfeaturemap,usingthetransformationparametersUobtainedintheencoder,andsettingallothervaluesinthefeaturemapstozero.Thereconstructionissimplythesumofthedecoderbasisfunc-tionsweightedbythefeaturemapvaluesatalllocations.

2.3Sparsity

Sparsityisachievedbyinsertinganon-linearityinfrontofthedecoder,dubbedSparsifyingLogisticasin[10].Thisnon-linearityisanadaptivelogisticwithaveryhighthresh-oldwhichenforcessparsityofitsoutputacrosstrainingsamples.Aftertrainingthethresholdisfixed,andthelo-gisticturnsintoastandardlogisticfunction(withalargebias).ThesparsifyinglogisticmoduleputcodevectorZintoasparsecodeZ

¯transformsthein-vectorwithpositivecomponentsbetween[0,1].Letusconsiderthek-thtrain-ingsampleandthei-thcomponentofthecode,zi(k)withi∈[1..m]wheremisthenumberofcomponentsinthecodevector.Letz¯i(k)beitscorrespondingoutputafterthesparsifyinglogistic.Giventwoparametersη∈[0,1]andβ>0,thetransformationperformedbythisnon-linearityisgivenby:i(k)

z¯i(k)=

eβzη

ζi(k−1)(1)

Thiscanbeseenasakindofweighted“softmax”functionoverpastvaluesofthecodeunit.Thisadaptivelogisticcanoutputalargevalue,i.e.avaluecloseto1,onlyiftheunithasundergonealongenoughquiescentperiod.Theparam-eterηcontrolsthesparsenessofthecodebydeterminingthelengthofthewindowoverwhichsamplesaresummedup.βcontrolsthegainofthelogisticfunction,withlargevaluesyieldingquasi-binaryoutputs.

Recallingtheshift-invariantmodeldescribedearlier,wehavethatthecodeZisshift-invariantcodeZ

¯whilethetransformed

willbenotonlyshift-invariant,butalsosparse.Thisisthecodethatisusedbythefollowingdecodermodulesthatperformtheweightedsumofbasisfunctions.

2.4LearningAlgorithm

LetWCandWDbethetrainableparametersintheen-coderanddecoder,respectively.Theseparametersarethesetoffiltersintheencoder,andthesetofbasisfunctionsinthedecoder.ThegoalofthelearningalgorithmistofindavalueforWCandWDthatminimizetheenergyofthesystemoverthetrainingdataset.ThisenergyistheweightedsumofencoderenergyECanddecoderenergyED.DenotingtheoutputoftheencoderwithEnc(Y;WC)andtheoutputofthedecoderwithDec(Z,U;WD),thesequantitiesaredefinedas:EC=||Z−Enc(Y;WC)||2andED=||Y−Dec(Z,U;WD)||2.

Then,weneedtosolveforminWC,WDminZEC+αED,whereαhasbeensetto1inourexperiments.LearningproceedsinaEM-likefashionwiththefollowingon-linealgorithm:

1.propagatetheinputthroughtheencodertoproduceapredictionZ0oftheoptimalcodeZ,andcopythetransformationparametersUintothedecoder2.keepingbothUandthesetofparametersWCandWD

fixed,findbygradientdescentthecodeZ∗thatmini-mizestheenergyEC+αEDstartingfromtheinitialvalueZ0thatwasprovidedbytheencoder3.keepingfixedthecodeatZ∗,updatethedecoderpa-rametersWDbyonestepofgradientdescentinthedecoderenergy4.updatetheencoderparametersWCbyonestepofgra-dientdescentintheencoderenergywhereZ∗willplaytheroleoftargetvaluefortheencoderoutput.Aftertraining,thesystemconvergestoastatewheremini-mumenergycodesZarepredictedbytheencoderinoneshotwithouttheneedforaminimizationincodespace,andthedecoderproducesgoodreconstructionsfromtheen-coderoutput.

2.5

HierarchiesofLocally-InvariantFea-tures

Oncethesystemistrained,itcanbeappliedtolargerim-agesinordertoextractlocallyinvariantfeaturemaps.Thesefeaturemapscanbeusedtotrainanothermachinewhichwillproducefeaturesthataremoreinvariantandmorecom-plexthanthefirstlevelfeatures.Disregardingtheeffectof

thefiltersizeusedintheconvolution,letusassumethattheinputimagehassizepxq,andthatthefirstlevelfeatureextractorperformsapoolinginNxNneighborhoodswhilethesecondlevelfeatureextractorpoolsinaMxMneigh-borhood.Whiletheoutputofthefirstlevelfeatureextrac-torof(approximate)sizep/Nxq/NisinvariantinNxNmax-poolingwindows,theoutputofthesecondlevelfea-tureextractorisinvariantinMNxMNwindowsininputspace.Moreover,thesecondlevelfeatureextractorcom-binesmanyfirstlevelfeaturemapsintoeachoutputfeaturemapincreasinginthiswayitsrepresentationalpowerforencodingcomplexpatternsininputspace.Theconnectionbetweeneachsetofinputfeaturemapswithasingleoutputfeaturemapisgivenbyapre-determinedtable,whichintheexperimentsdescribedinsec.3.1israndom.

Learningahierarchyoffeatureextractorsproceedsinse-quence.Eachlevelistrainedseparatelyand,whentrained,itprovidestheinputforthenexthigherlevelfeatureex-tractor.Asafinalstep,aglobalrelaxationthroughoutthewholesystemcanbedoneinordertofinetunetheparam-eters.ThisprocedurewasoriginallyproposedbyHintonetal.[5]fortrainingdeepbeliefnets.

3.Experiments

Wepresenttwoapplicationsoftheproposedunsuper-visedmethodforlearninginvariantfeatures.Inthefirstex-amplewehaveconsideredtheMNISTdataset[1]ofhand-writtendigits,andwehaveusedthealgorithmtopre-trainthefirstlayersofadeepconvolutionalnetwork.Afterthisunsupervisedlayer-by-layertrainingyieldedahierarchicalshift-invariantfeatureextractor,alltheparametersofthenetworkweretunedtogetherinasupervisedway,aspro-posedbyHintonetal.[5].Thesecondexampleshowsastraightforwardapplicationofthisalgorithmforcompres-sionofbitonaltextimagesyieldingverypromisingresults.Inbothexperiments,ηandβintheSparsifyingLogisticarefixedto0.01and1,respectively.Also,asmalllassoregu-larizationtermoftheorderof0.001isaddedtotheenergyloss.

3.1ClassificationofMNISTdataset

Inthissectionwedescribethetrainingprotocolofa6layerconvolutionalnetworktrainedontheMNISTdataset.Thetrainingdatasetwasaugmentedwithelasticallydis-torteddigits,asdiscussedinSimardetal.[12].Foreachtrainingsample10distortedreplicashavebeenaddedtotheset,makingthetotalnumberoftrainingdigits660,000.Fortrainingwehaveselected5,000digits(fromtheoriginaltrainingdataset)forvalidation.

Thetrainingprotocolisthefollowing.First,wetrainun-supervisedonthewholeoriginaltrainingdatasetthefirst

AB

Figure2.(A)Thefifty7x7filtersthatwerelearnedbythesparseandshift-invariantmodeltrainedontheMNISTdataset.Thesefiltersareusedtoinitializethefirstlayerofadeepconvolutionalnetwork.(B)Thefifty7x7filtersinthefirstlayeroftheconvolutionalnetworkaftersupervisedtrainingontheaug-menteddataset.

Figure3.The42misclassifieddigitsinthetestingMNISTdataset.Ontheupperleftcor-nerthereisthetruelabel.

fourlayersoftheconvolutionalnetworkwiththehierar-chicalfeatureextractordescribedintheprevioussections.Second,wetrainsupervisedonthewholetrainingdatasetthetoptwolayersusingthefeaturesprovidedbythefeatureextractor.Bytakingtheparametersthatgavetheminimumlossonthevalidationset,wehavefoundtheinitialvalueoftheparametersforthelastsupervisedtrainingofthewholenetwork,includingthefirstfourlayers.Testingisdoneontheparametersthatgavetheminimumlossonthevalidationdataset.

Inparticular,theunsupervisedtrainingofthefirstfourlayersisdoneasfollows.First,welearnthefiltersintheconvolutionallayerwiththesparsifyingencoder-decodermodeldescribedinsec.2.4trainedonpatchesrandomlyextractedfromtrainingimages.Welearnfifty7x7filtersthatcapturefeaturethatareinvariantina2x2neighbor-hoodsincethemax-poolinglayerconsiders2x2windows.Oncetrainingiscomplete,theencoderanddecoderfiltersarefrozen,andthesparsifyinglogisticisreplacedbyatanhsigmoidfunctionwithatrainablebiasandagaincoefficient.Thebiasandthegainaretrainedwithafewiterationsofback-propagationthroughtheencoder-decodersystem.Therationaleforrelaxingthesparsityconstraintistoproducerepresentationwitharicherinformationcontent.Whilethethesparsifyinglogisticdrivesthesystemtoproducegoodfilters,thequasi-binarycodesproduceddonotcarryenoughinformationforthelaterclassification.Then,trainingim-

agesarerunthroughthisleveltogeneratepatchesforthenextlevelinthehierarchy.Thesecondlevelfeatureextrac-tor(i.e.layer3and4oftheconvolutionalnet)has1,280filtersofsize5x5thatconnect10randomlychoseninputfeaturemapstoeachoutputfeaturemap.Thereare128outputfeaturemapsthataresubsequentlymax-pooledover2x2neighborhoods.Thissecondfeatureextractoristrainedinthesamewayasbefore.Oncethefeatureextractorisfedwith34x34paddeddigits,itproducesfeaturesofsize128x5x5thataregiventothetoptwolayersofthecon-volutionalnetwork.Theselayersformatwo-layerneuralnetwith200hiddenunitsand10outputunits.Thesuper-visedtrainingofboththetoptwolayersaswellasofthewholenetworkisdonebyerrorback-propagationusingalosswhichistheaveragesquareEuclideandistancebetweentheoutputofthenetandthetargetvalueoftheinputdigit(a1ofNcodeofthelabel).Theerrorrateonthetestingdatasetisequalto0.42%,verycloseto0.39%whichistherecord[10]forthisdataset.Forcomparison,trainingthesamenetworkfromrandominitialconditionbysupervisedback-propagationofgradientsyieldsanerrorrateequalto0.48%.

3.2

CompressionofTextDocumentIm-ages

Inthissectionwedescribehowwecanapplytheun-supervisedlearningalgorithmtocompressionofdocumentimages.WeconsideredtheCCITTblackandwhitetestim-agenumber5,whichcontainsbothtextanddrawings.Witharesolutionof200dpi,theimagehassize2376x1728pixels.First,weranaconnectedcomponent(CC)analysiswhichrevealed1,421components.Weconsideredpatchesofsize30x30thatcoverabout95%oftheseCC’s,andwebuiltadatasetofpatchesthatweusedtotrainthesystem.IfaCCisbiggerthan30x30pixel,itissplitinchunksof30x30pixels.Ifitissmaller,itiscenteredina30x30patch.Intotaltherewere3102patches,someareshowninfig.5.Weconsidereda(singlestage)invariantfeatureextractorwith25630x30filtersinbothencoderanddecoder.Sincetheinputisevenlypaddedina34x34window,themax-poolinglayeroperatesina5x5neighborhood.EncodingconsistsofperformingtheCCanalysis,encodingthelocationsofthepatchesthatareextractedfromthedocument(whoseentropyis5.16bits),propagatingthepatchthroughtheen-coderfilterbankandtheSparsifyingLogistic,thresholdingthecodeinordertomakeitbinary,andfinally,encodingboththecodeandlocationsofthelargestvaluesselectedbythemax-poolinglayer(whichamountsin13.9bitsperpatch).Thedecodinghastouncompressthebitstream,usethedecoderfilterstoreconstructthecodesandthetransfor-mationparameters,andplacethethresholdedreconstruc-tioninthegivenlocationintheimage.Ideally,thefilters

Figure4.LEFTPANELSExamplesofwin-dowsthatareusedfortrainingandcom-pressingtheCCITTtestpagenumber5.RIGHTPANELSReconstructionsprovidedbythealgorithm.

Figure5.DetailofthetestCCITTimagenum-ber5at200dpiwhichhasbeencompressedandreconstructedbyouralgorithm.

wouldbelearnedfromavarietyofimageswithdifferentfontsandwouldbehard-wiredinthedecoder.Undertheassumptionthatwedonotneedtoencodealsothedecoderfilters,thecompressionratiowouldbeequalto95.Thepercentageofreconstructedpixelswhosevaluehasbeenflippedisequalto1.35%.Anexampleofpatchesthatareencodedandreconstructedisshowninfig.4.Fig.5showsadetailofthereconstructedimage.Forcomparison,thestate-of-the-artmethodforcompressingimagesistheJB2[3]whichachievesalossycompressionratioequalto55andproducesavisuallylosslessresultonthistestimage.

4.Conclusions

Wehavedescribedanunsupervisedmethodforlearninghierarchiesofshift-invariantfeatures.Wehaveappliedthisalgorithmtopre-trainthefirstfourlayersofadeepconvolu-tionalnetworkachievingstate-of-the-artperformanceintheclassificationofhandwrittendigitsintheMNISTdataset.Anapplicationincompressionusingasinglelevelinvariantfeatureextractorhasalsobeenpresentedandshowspromis-ingresults.

Inthispaperwehavedefinedanovelmethodforlearninglocallyshiftinvariantfeatures.Moreover,wehaveproposed

ageneralandprincipledmethodforlearninghierarchiesoffeatures.Inourexperiments,wehavelimitedourselvestotwolayersoffeaturesbutonecouldstackasmanymod-uleslikethisasoneneedsinordertoachievethedesiredlevelofinvarianceandcomplexity.Thismethodisusefultopre-traindeeparchitecturessuchasconvolutionalneuralnetworks.Butalso,itcanbeusedtolearnandextractfea-turesfromdatasetswithfewlabelledexamples,anditcanbeusedtolearncomplexinvariantfeatureswiththedesireddimensionality.

Thefutureworkwilltakeintoaccountotherlearningal-gorithmstotrainmultiplelevelsoffeatureextractorallto-gether,thedefinitionofamoreefficientandhierarchicalmethodforcompression,thepossibilitytolearnalsotheparametersintheSparsifyingLogisticwhichcontrolsthesparsityoftherepresentation,andamoreexhaustiveexper-imentationinordertounderstandbetterinwhichconditionsunsupervisedlearningisbeneficialtosupervisedlearning.

References

[1]http://yann.lecun.com/exdb/mnist/.

[2]Y.Bengio,P.Lamblin,D.Popovici,andH.Larochelle.

Greedylayer-wisetrainingofdeepnetworks.InNIPS.MITPress,2007.

[3]L.Bottou,P.Haffner,P.Howard,P.Simard,Y.Bengio,and

Y.LeCun.Highqualitydocumentimagecompressionwithdjvu.JournalofElectronicImaging,7(3):410–425,1998.[4]K.FukushimaandS.Miyake.Neocognitron:Anewalgo-rithmforpatternrecognitiontolerantofdeformationsandshiftsinposition.PatternRecognition,15(6):455–469,1982.

[5]G.Hinton,S.Osindero,andY.-W.Teh.Afastlearningalgo-rithmfordeepbeliefnets.NeuralComputation,18:1527–1554,2006.

[6]F.-J.HuangandY.LeCun.Large-scalelearningwithsvm

andconvolutionalnetsforgenericobjectcategorization.InCVPR.IEEEPress,2006.

[7]Y.LeCun,L.Bottou,Y.Bengio,andP.Haffner.Gradient-basedlearningappliedtodocumentrecognition.Proceed-ingsoftheIEEE,86(11):2278–2324,November1998.

[8]D.Lowe.Distinctiveimagefeaturesfromscale-invariant

keypoints.InternationalJournalofComputerVision,60(2):91–110,2004.

[9]J.MutchandD.Lowe.Multiclassobjectrecognitionwith

sparse,localizedfeatures.InCVPR,2006.

[10]M.Ranzato,C.Poultney,S.Chopra,andY.LeCun.Ef-ficientlearningofsparserepresentationswithanenergy-basedmodel.InNIPS.MITPress,2006.

[11]T.Serre,L.Wolf,andT.Poggio.Objectrecognitionwith

featuresinspiredbyvisualcortex.InCVPR,2005.

[12]P.Simard,D.Steinkraus,andJ.Platt.Bestpracticesfor

convolutionalneuralnetworks.InICDAR,2003.

因篇幅问题不能全部显示,请点此查看更多更全内容