A sparse and locally shift invariant feature extractor applied to document images

2024-08-07 来源：钮旅网

ASparseandLocallyShiftInvariantFeatureExtractor

AppliedtoDocumentImages

Marc’AurelioRanzatoYannLeCunCourantInstituteofMathematicalSciencesNewYorkUniversity-NewYork,NY10003

Abstract

Wedescribeanunsupervisedlearningalgorithmforex-tractingsparseandlocallyshift-invariantfeatures.Wealsodeviseaprincipledprocedureforlearninghierarchiesofin-variantfeatures.Eachfeaturedetectoriscomposedofasetoftrainableconvolutionalﬁltersfollowedbyamax-poolinglayerovernon-overlappingwindows,andapoint-wisesig-moidnon-linearity.Asecondstageofmoreinvariantfea-turesisfedwithpatchesprovidedbytheﬁrststagefeatureextractor,andistrainedinthesameway.Themethodisusedtopre-traintheﬁrstfourlayersofadeepconvolutionalnetworkwhichachievesstate-of-the-artperformanceontheMNISTdatasetofhandwrittendigits.Theﬁnaltestingerrorrateisequalto0.42%.Preliminaryexperimentsoncom-pressionofbitonaldocumentimagesshowverypromisingresultsintermsofcompressionratioandreconstructioner-ror.

1.Introduction

Mostcomputervisionsystemsfordocumentanalysisaswellasforgeneralimageunderstandingincludeafeatureextractorasﬁrstpre-processingstep.PrincipalCompo-nentAnalysisandK-Meansarethemostwellknowntech-niquesusedinthesevisionapplications.Theﬁrstproblemweaddressinthispaperishowwecanlearninvariantfea-tures.Somemethodsbuildinvariancefrompre-determinedfeaturesbasedonlocalhistogramssuchastheSIFTde-scriptors[8].However,learningthesefeaturescouldmakethemmoreadaptivetotheparticulardatasetinuse.Othermethods[11,9,10]achieveindirectlyinvariancetosmallshiftsthankstoasub-samplinglayerwhichisaddedafterthetrainingphaseiscompleted.Themethodweproposeincludesandintegratesefﬁcientlyinvarianceinthelearn-ingstage.Eachfeatureextractoriscomposedofasetoflinearﬁltersthatareconvolvedwiththeinputimage,fol-lowedbyamax-poolinglayerwhichselectsthelargestval-ueswithinnon-overlappingwindows,andbyapoint-wise

sigmoidnon-linearity.Sec.2.1and2.4describesthede-tailsofthearchitectureusedduringtrainingandthelearn-ingalgorithmwhichallowsustolearninvariantandsparsefeatures.

Thesecondproblemweaddressinthepaperishowwecanlearnhierarchiesoffeaturesinaprincipledway.Hier-archicalfeatureextractorhavealreadybeenproposedintheliterature,butsomeofthemintroduceratheradhoclearningprocedures[4],orproposetohard-wireGaborﬁlters[11,9],orseeminefﬁcientwhendealingwithdeepmulti-layersys-temsandfewlabelledsamples[7,6].Severalrecentworkshaveshowntheadvantages(intermsofspeedandaccu-racy)ofpre-trainingeachlayerofadeepnetworkinun-supervisedmode,beforetuningthewholesystemwithagradient-basedalgorithm[5,2,10].Thepresentworkisinspiredbythesemethods.Sec.2.5describeshowwecanlearnlayerbylayerfeaturedetectorswithincreasinglevelofinvarianceandcomplexity,andsec.3demonstratesthemethodfortheclassiﬁcationofhandwrittennumeralsandforcompressionofbitonaldocumentimages.

2.TheSystem

ThesystemwedescribeisderivedfromtheEnergy-BasedModelforlearningsparsefeaturesintroducedbyRanzatoetal.[10].Weextendthatmodelbydevisinganar-chitecturewhichproducesfeaturesthatarenotonlysparse,butalsolocallyinvarianttoshifts.Moreover,weproposeahierarchicalextensionwhichlearnsfeaturesthatcorrespondtolargerandmorecomplexreceptiveﬁeldsininputspace.

2.1Architecture

Duringtrainingthearchitectureofthesystemiscom-posedofanencoderandadecoderasshowninﬁg.1.TheencodertakesimagepatchesYandcomputesapredictionoftheoptimalinternalrepresentationZ,namelythecode.Wepostponethedeﬁnitionofcodeoptimalityuntilthelaterdiscussion.ThedecoderproducesareconstructionofY

Figure1.Theencoder-decoderarchitectureforunsupervisedlearningoffeatures.Inanin-variantfeatureextractor,Zrepresentthein-variantpartofthecodewhilethevariablesUencodethetransformationwhichtheinvari-antfeaturesmustundergoinordertorecon-structtheinput.

fromtheinputcodeZ.Themodelhasanassociateden-ergywhichistheweightedsumoftwoterms,theencoderanddecoderenergies.ThedecoderenergyisthesquareEu-clideandistancebetweentheoutputofthedecoderandtheinputY.TheencoderenergyisthesquareEuclideandis-tancebetweentheencoderpredictionandtheoptimalcodeZ.ThecodeZisoptimalwhenitminimizestheoverallenergyofthesystem.Atthesametimeitminimizesthere-constructionerrorofY,anditisascloseaspossibletotheencoderoutputforthegivensetofparametersinthesystem.Theproblemregardinginvariantfeatureextractioninsucharchitectureisaboutthereconstruction.Canwede-codetheinputfromitsinvariantrepresentation?Unfortu-nately,wecannot.However,wecanalwaysaugmenttheinternalrepresentationwiththeinformationnecessarytore-constructtheinput(i.e.thenon-invariantpartofthecode)anduseitfordecoding.Thisiswhatwecalltransformationparametersinﬁg.1.Theencoderextractsboththetrans-formationparametersU,andtheinvariantcodeZ.Thepa-rametersUarecopiedintothedecoderinordertoallowthereconstruction.Forinstanceinourshiftinvariantmodel,ZrepresentswhatfeaturesarepresentintheinputpatchwhileUrepresentswherethesefeaturesappearintheimage.Gen-erally,thetransformationparametersandthedecodingarenecessaryonlyduringthelearningphaseinordertoassesswhetherthecodeisgoodinpreservingenoughinformationfromtheinput,sothatweareabletoobtainaccuraterecon-structions.Inapplicationswherethegoalistoextractin-variantfeatures,thetransformationparametersandthede-coderaredisregardedafterthetrainingphasebecausetheinterestisonlyintheinvariantcodeZproducedbytheen-coder.

2.2InvariancetoShifts

Inthissectionwepresentaninstanceofthepreviouslydescribedgeneralmodel.Thismodelallowstheextractionoflocallyshift-invariantfeatures.Theencoderiscomposedofasetofﬁltersthatareconvolvedwiththeinput,andamax-poolinglayer.Theﬁlterbankperformsfeaturedetec-tionandproducesasetoffeaturemaps.Howtheﬁltersarelearnedisexplainedinsec.2.4.Thefollowingmax-poolinglayeroperatesfeaturemapbyfeaturemapseparately,andselectsthelargestvaluesinnonoverlappingwindows.Theresultingspatiallyreducedrepresentationconstitutesthelo-callyshift-invariantcodeZ,whilethepositionsofthese-lectedlargestvaluesarethetransformationparametersU.Nomatterwherethefeaturesappearintheinputpatch,thecodeZwillbealwaysthesameandonlytheparametersUwillbeaffected.TheparametersUarecopiedintothede-coderwhichisalsocomposedofasetoflinearﬁlters.ThereconstructioniscomputedbyplacingeachcodevalueofZattheproperlocationinthedecoderfeaturemap,usingthetransformationparametersUobtainedintheencoder,andsettingallothervaluesinthefeaturemapstozero.Thereconstructionissimplythesumofthedecoderbasisfunc-tionsweightedbythefeaturemapvaluesatalllocations.

2.3Sparsity

Sparsityisachievedbyinsertinganon-linearityinfrontofthedecoder,dubbedSparsifyingLogisticasin[10].Thisnon-linearityisanadaptivelogisticwithaveryhighthresh-oldwhichenforcessparsityofitsoutputacrosstrainingsamples.Aftertrainingthethresholdisﬁxed,andthelo-gisticturnsintoastandardlogisticfunction(withalargebias).ThesparsifyinglogisticmoduleputcodevectorZintoasparsecodeZ

¯transformsthein-vectorwithpositivecomponentsbetween[0,1].Letusconsiderthek-thtrain-ingsampleandthei-thcomponentofthecode,zi(k)withi∈[1..m]wheremisthenumberofcomponentsinthecodevector.Letz¯i(k)beitscorrespondingoutputafterthesparsifyinglogistic.Giventwoparametersη∈[0,1]andβ>0,thetransformationperformedbythisnon-linearityisgivenby:i(k)

z¯i(k)=

eβzη

ζi(k−1)(1)

Thiscanbeseenasakindofweighted“softmax”functionoverpastvaluesofthecodeunit.Thisadaptivelogisticcanoutputalargevalue,i.e.avaluecloseto1,onlyiftheunithasundergonealongenoughquiescentperiod.Theparam-eterηcontrolsthesparsenessofthecodebydeterminingthelengthofthewindowoverwhichsamplesaresummedup.βcontrolsthegainofthelogisticfunction,withlargevaluesyieldingquasi-binaryoutputs.

Recallingtheshift-invariantmodeldescribedearlier,wehavethatthecodeZisshift-invariantcodeZ

¯whilethetransformed

willbenotonlyshift-invariant,butalsosparse.Thisisthecodethatisusedbythefollowingdecodermodulesthatperformtheweightedsumofbasisfunctions.

2.4LearningAlgorithm

LetWCandWDbethetrainableparametersintheen-coderanddecoder,respectively.Theseparametersarethesetofﬁltersintheencoder,andthesetofbasisfunctionsinthedecoder.ThegoalofthelearningalgorithmistoﬁndavalueforWCandWDthatminimizetheenergyofthesystemoverthetrainingdataset.ThisenergyistheweightedsumofencoderenergyECanddecoderenergyED.DenotingtheoutputoftheencoderwithEnc(Y;WC)andtheoutputofthedecoderwithDec(Z,U;WD),thesequantitiesaredeﬁnedas:EC=||Z−Enc(Y;WC)||2andED=||Y−Dec(Z,U;WD)||2.

Then,weneedtosolveforminWC,WDminZEC+αED,whereαhasbeensetto1inourexperiments.LearningproceedsinaEM-likefashionwiththefollowingon-linealgorithm:

1.propagatetheinputthroughtheencodertoproduceapredictionZ0oftheoptimalcodeZ,andcopythetransformationparametersUintothedecoder2.keepingbothUandthesetofparametersWCandWD

ﬁxed,ﬁndbygradientdescentthecodeZ∗thatmini-mizestheenergyEC+αEDstartingfromtheinitialvalueZ0thatwasprovidedbytheencoder3.keepingﬁxedthecodeatZ∗,updatethedecoderpa-rametersWDbyonestepofgradientdescentinthedecoderenergy4.updatetheencoderparametersWCbyonestepofgra-dientdescentintheencoderenergywhereZ∗willplaytheroleoftargetvaluefortheencoderoutput.Aftertraining,thesystemconvergestoastatewheremini-mumenergycodesZarepredictedbytheencoderinoneshotwithouttheneedforaminimizationincodespace,andthedecoderproducesgoodreconstructionsfromtheen-coderoutput.

2.5

HierarchiesofLocally-InvariantFea-tures

Oncethesystemistrained,itcanbeappliedtolargerim-agesinordertoextractlocallyinvariantfeaturemaps.Thesefeaturemapscanbeusedtotrainanothermachinewhichwillproducefeaturesthataremoreinvariantandmorecom-plexthantheﬁrstlevelfeatures.Disregardingtheeffectof

theﬁltersizeusedintheconvolution,letusassumethattheinputimagehassizepxq,andthattheﬁrstlevelfeatureextractorperformsapoolinginNxNneighborhoodswhilethesecondlevelfeatureextractorpoolsinaMxMneigh-borhood.Whiletheoutputoftheﬁrstlevelfeatureextrac-torof(approximate)sizep/Nxq/NisinvariantinNxNmax-poolingwindows,theoutputofthesecondlevelfea-tureextractorisinvariantinMNxMNwindowsininputspace.Moreover,thesecondlevelfeatureextractorcom-binesmanyﬁrstlevelfeaturemapsintoeachoutputfeaturemapincreasinginthiswayitsrepresentationalpowerforencodingcomplexpatternsininputspace.Theconnectionbetweeneachsetofinputfeaturemapswithasingleoutputfeaturemapisgivenbyapre-determinedtable,whichintheexperimentsdescribedinsec.3.1israndom.

Learningahierarchyoffeatureextractorsproceedsinse-quence.Eachlevelistrainedseparatelyand,whentrained,itprovidestheinputforthenexthigherlevelfeatureex-tractor.Asaﬁnalstep,aglobalrelaxationthroughoutthewholesystemcanbedoneinordertoﬁnetunetheparam-eters.ThisprocedurewasoriginallyproposedbyHintonetal.[5]fortrainingdeepbeliefnets.

3.Experiments

Wepresenttwoapplicationsoftheproposedunsuper-visedmethodforlearninginvariantfeatures.Intheﬁrstex-amplewehaveconsideredtheMNISTdataset[1]ofhand-writtendigits,andwehaveusedthealgorithmtopre-traintheﬁrstlayersofadeepconvolutionalnetwork.Afterthisunsupervisedlayer-by-layertrainingyieldedahierarchicalshift-invariantfeatureextractor,alltheparametersofthenetworkweretunedtogetherinasupervisedway,aspro-posedbyHintonetal.[5].Thesecondexampleshowsastraightforwardapplicationofthisalgorithmforcompres-sionofbitonaltextimagesyieldingverypromisingresults.Inbothexperiments,ηandβintheSparsifyingLogisticareﬁxedto0.01and1,respectively.Also,asmalllassoregu-larizationtermoftheorderof0.001isaddedtotheenergyloss.

3.1ClassiﬁcationofMNISTdataset

Inthissectionwedescribethetrainingprotocolofa6layerconvolutionalnetworktrainedontheMNISTdataset.Thetrainingdatasetwasaugmentedwithelasticallydis-torteddigits,asdiscussedinSimardetal.[12].Foreachtrainingsample10distortedreplicashavebeenaddedtotheset,makingthetotalnumberoftrainingdigits660,000.Fortrainingwehaveselected5,000digits(fromtheoriginaltrainingdataset)forvalidation.

Thetrainingprotocolisthefollowing.First,wetrainun-supervisedonthewholeoriginaltrainingdatasettheﬁrst

Figure2.(A)Theﬁfty7x7ﬁltersthatwerelearnedbythesparseandshift-invariantmodeltrainedontheMNISTdataset.Theseﬁltersareusedtoinitializetheﬁrstlayerofadeepconvolutionalnetwork.(B)Theﬁfty7x7ﬁltersintheﬁrstlayeroftheconvolutionalnetworkaftersupervisedtrainingontheaug-menteddataset.

Figure3.The42misclassiﬁeddigitsinthetestingMNISTdataset.Ontheupperleftcor-nerthereisthetruelabel.

fourlayersoftheconvolutionalnetworkwiththehierar-chicalfeatureextractordescribedintheprevioussections.Second,wetrainsupervisedonthewholetrainingdatasetthetoptwolayersusingthefeaturesprovidedbythefeatureextractor.Bytakingtheparametersthatgavetheminimumlossonthevalidationset,wehavefoundtheinitialvalueoftheparametersforthelastsupervisedtrainingofthewholenetwork,includingtheﬁrstfourlayers.Testingisdoneontheparametersthatgavetheminimumlossonthevalidationdataset.

Inparticular,theunsupervisedtrainingoftheﬁrstfourlayersisdoneasfollows.First,welearntheﬁltersintheconvolutionallayerwiththesparsifyingencoder-decodermodeldescribedinsec.2.4trainedonpatchesrandomlyextractedfromtrainingimages.Welearnﬁfty7x7ﬁltersthatcapturefeaturethatareinvariantina2x2neighbor-hoodsincethemax-poolinglayerconsiders2x2windows.Oncetrainingiscomplete,theencoderanddecoderﬁltersarefrozen,andthesparsifyinglogisticisreplacedbyatanhsigmoidfunctionwithatrainablebiasandagaincoefﬁcient.Thebiasandthegainaretrainedwithafewiterationsofback-propagationthroughtheencoder-decodersystem.Therationaleforrelaxingthesparsityconstraintistoproducerepresentationwitharicherinformationcontent.Whilethethesparsifyinglogisticdrivesthesystemtoproducegoodﬁlters,thequasi-binarycodesproduceddonotcarryenoughinformationforthelaterclassiﬁcation.Then,trainingim-

agesarerunthroughthisleveltogeneratepatchesforthenextlevelinthehierarchy.Thesecondlevelfeatureextrac-tor(i.e.layer3and4oftheconvolutionalnet)has1,280ﬁltersofsize5x5thatconnect10randomlychoseninputfeaturemapstoeachoutputfeaturemap.Thereare128outputfeaturemapsthataresubsequentlymax-pooledover2x2neighborhoods.Thissecondfeatureextractoristrainedinthesamewayasbefore.Oncethefeatureextractorisfedwith34x34paddeddigits,itproducesfeaturesofsize128x5x5thataregiventothetoptwolayersofthecon-volutionalnetwork.Theselayersformatwo-layerneuralnetwith200hiddenunitsand10outputunits.Thesuper-visedtrainingofboththetoptwolayersaswellasofthewholenetworkisdonebyerrorback-propagationusingalosswhichistheaveragesquareEuclideandistancebetweentheoutputofthenetandthetargetvalueoftheinputdigit(a1ofNcodeofthelabel).Theerrorrateonthetestingdatasetisequalto0.42%,verycloseto0.39%whichistherecord[10]forthisdataset.Forcomparison,trainingthesamenetworkfromrandominitialconditionbysupervisedback-propagationofgradientsyieldsanerrorrateequalto0.48%.

3.2

CompressionofTextDocumentIm-ages

Inthissectionwedescribehowwecanapplytheun-supervisedlearningalgorithmtocompressionofdocumentimages.WeconsideredtheCCITTblackandwhitetestim-agenumber5,whichcontainsbothtextanddrawings.Witharesolutionof200dpi,theimagehassize2376x1728pixels.First,weranaconnectedcomponent(CC)analysiswhichrevealed1,421components.Weconsideredpatchesofsize30x30thatcoverabout95%oftheseCC’s,andwebuiltadatasetofpatchesthatweusedtotrainthesystem.IfaCCisbiggerthan30x30pixel,itissplitinchunksof30x30pixels.Ifitissmaller,itiscenteredina30x30patch.Intotaltherewere3102patches,someareshowninﬁg.5.Weconsidereda(singlestage)invariantfeatureextractorwith25630x30ﬁltersinbothencoderanddecoder.Sincetheinputisevenlypaddedina34x34window,themax-poolinglayeroperatesina5x5neighborhood.EncodingconsistsofperformingtheCCanalysis,encodingthelocationsofthepatchesthatareextractedfromthedocument(whoseentropyis5.16bits),propagatingthepatchthroughtheen-coderﬁlterbankandtheSparsifyingLogistic,thresholdingthecodeinordertomakeitbinary,andﬁnally,encodingboththecodeandlocationsofthelargestvaluesselectedbythemax-poolinglayer(whichamountsin13.9bitsperpatch).Thedecodinghastouncompressthebitstream,usethedecoderﬁlterstoreconstructthecodesandthetransfor-mationparameters,andplacethethresholdedreconstruc-tioninthegivenlocationintheimage.Ideally,theﬁlters

Figure4.LEFTPANELSExamplesofwin-dowsthatareusedfortrainingandcom-pressingtheCCITTtestpagenumber5.RIGHTPANELSReconstructionsprovidedbythealgorithm.

Figure5.DetailofthetestCCITTimagenum-ber5at200dpiwhichhasbeencompressedandreconstructedbyouralgorithm.

wouldbelearnedfromavarietyofimageswithdifferentfontsandwouldbehard-wiredinthedecoder.Undertheassumptionthatwedonotneedtoencodealsothedecoderﬁlters,thecompressionratiowouldbeequalto95.Thepercentageofreconstructedpixelswhosevaluehasbeenﬂippedisequalto1.35%.Anexampleofpatchesthatareencodedandreconstructedisshowninﬁg.4.Fig.5showsadetailofthereconstructedimage.Forcomparison,thestate-of-the-artmethodforcompressingimagesistheJB2[3]whichachievesalossycompressionratioequalto55andproducesavisuallylosslessresultonthistestimage.

4.Conclusions

Wehavedescribedanunsupervisedmethodforlearninghierarchiesofshift-invariantfeatures.Wehaveappliedthisalgorithmtopre-traintheﬁrstfourlayersofadeepconvolu-tionalnetworkachievingstate-of-the-artperformanceintheclassiﬁcationofhandwrittendigitsintheMNISTdataset.Anapplicationincompressionusingasinglelevelinvariantfeatureextractorhasalsobeenpresentedandshowspromis-ingresults.

Inthispaperwehavedeﬁnedanovelmethodforlearninglocallyshiftinvariantfeatures.Moreover,wehaveproposed

ageneralandprincipledmethodforlearninghierarchiesoffeatures.Inourexperiments,wehavelimitedourselvestotwolayersoffeaturesbutonecouldstackasmanymod-uleslikethisasoneneedsinordertoachievethedesiredlevelofinvarianceandcomplexity.Thismethodisusefultopre-traindeeparchitecturessuchasconvolutionalneuralnetworks.Butalso,itcanbeusedtolearnandextractfea-turesfromdatasetswithfewlabelledexamples,anditcanbeusedtolearncomplexinvariantfeatureswiththedesireddimensionality.

Thefutureworkwilltakeintoaccountotherlearningal-gorithmstotrainmultiplelevelsoffeatureextractorallto-gether,thedeﬁnitionofamoreefﬁcientandhierarchicalmethodforcompression,thepossibilitytolearnalsotheparametersintheSparsifyingLogisticwhichcontrolsthesparsityoftherepresentation,andamoreexhaustiveexper-imentationinordertounderstandbetterinwhichconditionsunsupervisedlearningisbeneﬁcialtosupervisedlearning.

References

[1]http://yann.lecun.com/exdb/mnist/.

[2]Y.Bengio,P.Lamblin,D.Popovici,andH.Larochelle.

Greedylayer-wisetrainingofdeepnetworks.InNIPS.MITPress,2007.

[3]L.Bottou,P.Haffner,P.Howard,P.Simard,Y.Bengio,and

Y.LeCun.Highqualitydocumentimagecompressionwithdjvu.JournalofElectronicImaging,7(3):410–425,1998.[4]K.FukushimaandS.Miyake.Neocognitron:Anewalgo-rithmforpatternrecognitiontolerantofdeformationsandshiftsinposition.PatternRecognition,15(6):455–469,1982.

[5]G.Hinton,S.Osindero,andY.-W.Teh.Afastlearningalgo-rithmfordeepbeliefnets.NeuralComputation,18:1527–1554,2006.

[6]F.-J.HuangandY.LeCun.Large-scalelearningwithsvm

andconvolutionalnetsforgenericobjectcategorization.InCVPR.IEEEPress,2006.

[7]Y.LeCun,L.Bottou,Y.Bengio,andP.Haffner.Gradient-basedlearningappliedtodocumentrecognition.Proceed-ingsoftheIEEE,86(11):2278–2324,November1998.

[8]D.Lowe.Distinctiveimagefeaturesfromscale-invariant

keypoints.InternationalJournalofComputerVision,60(2):91–110,2004.

[9]J.MutchandD.Lowe.Multiclassobjectrecognitionwith

sparse,localizedfeatures.InCVPR,2006.

[10]M.Ranzato,C.Poultney,S.Chopra,andY.LeCun.Ef-ﬁcientlearningofsparserepresentationswithanenergy-basedmodel.InNIPS.MITPress,2006.

[11]T.Serre,L.Wolf,andT.Poggio.Objectrecognitionwith

featuresinspiredbyvisualcortex.InCVPR,2005.

[12]P.Simard,D.Steinkraus,andJ.Platt.Bestpracticesfor

convolutionalneuralnetworks.InICDAR,2003.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文

全部栏目

A sparse and locally shift invariant feature extractor applied to document images