ComparingUserassistedandAutomaticQueryTranslation

DaqingHe1,JianqiangWang2,DouglasW.Oard2,MichaelNossal1

1

InstituteforAdvancedComputerStudiesUniversityofMaryland,CollegePark,MD20742USA

{daqingd,nossal}@umiacs.umd.edu

2

CollegeofInformationStudies&InstituteforAdvancedComputerStudiesUniversityofMaryland,CollegePark,MD20742USA

{wangjq,oard}@glue.umd.edu

Abstract.Forthe2002CrossLanguageEvaluationForumInteractiveTrack,theUniversityofMarylandteamfocusedonqueryformulationandreformulation.TwelvepeopleperformedatotaloffortyeightsearchesintheGermandocumentcollectionusingEnglishqueries.Halfofthesearcheswerewithuserassistedquerytranslation,andhalfwithfullyautomaticquerytranslation.Fortheuserassistedquerytranslationcondition,participantswereprovidedtwotypesofcuesaboutthemeaningofeachtranslation:alistofothertermswiththesametranslation(potentialsynonyms),andasentenceinwhichthewordwasusedinatranslationappropriatecontext.FoursearchersperformedtheofficialiCLEFtask,theothereightsearchedasmallercollection.Searchersperformingtheofficialtaskwereabletomakemoreaccuraterelevancejudgmentswithuserassistedquerytranslationforthreeofthefourtopics.Weobservedthatthenumberofqueryiterationsseemstovarysystematicallywithtopic,system,andcollection,andweareanalyzingquerycontentandrankedretrievalmeasurestoobtainfurtherinsightintothesevariationsinsearchbehavior.

1Introduction

InteractiveCrossLanguageInformationRetrieval(CLIR)isaniterativeprocessinwhichsearcherandsystemcollaboratetofinddocumentsthatsatisfyaninformationneed,regardlessofwhethertheyarewritteninthesamelanguageasthequery.Humansandmachinesbringcomplementarystrengthstothisprocess.Machinesareexcellentatrepetitivetasksthatarewellspecified;humansbringcreativityandexceptionalpatternrecognitioncapabilities.Properlycouplingthesecapabilitiescanresultinasynergythatgreatlyexceedstheabilityofeitherhumanormachinealone.Thedesignofthefullyautomatedcomponentstosupportcrosslanguagesearching(e.g.,structuredquerytranslationandrankedretrieval)hasbeenwellresearched,butachievingtruesynergyrequiresthatthemachinealsoprovidetoolsthatwillallowitshumanpartnerstoexercisetheirskillstothegreatestpossibledegree.SuchtoolsarethefocusofourworkintheCrossLanguageEvaluationForum’s(CLEF)interactivetrack(iCLEF).In2001,webeganbyexploringsupportfordocumentselection[5].Thisyear,ourfocusisonqueryformulation.

Crosslanguageretrievaltechniquescangenerallybeclassifiedasquerytranslation,documenttranslation,orinterlingualdesigns[2].Weadoptedaquerytranslationdesignbecausethequerytranslationstageprovidesanadditionalinteractionopportunitynotpresentindocumenttranslationbasedsystems.OursearchersfirstformulateaqueryinEnglish,thenthesystemtranslatesthatqueryintothedocumentlanguage(German,inourcase).Thetranslatedqueryisusedtosearchthedocumentcollection,andarankedlistofdocumentsurrogates(first40words,inourcase)isdisplayed.Thesearchercanexamineindividualdocuments,andcanoptionallyrepeattheprocessbyreformulatingthequery.Althoughthereareonlythreepossibleinteractionpoints(queryformulation,querytranslation,anddocumentselection),theiterativenatureoftheprocessintroducessignificantcomplexity.Wethereforeperformedextensiveexploratorydataanalysistounderstandhowsearchersemploythesystemsthatweprovided.

Ourstudywasmotivatedbythefollowingquestions:

  1. Whatstrategiesdosearchersapplywhenformulatingtheirinitialqueryandwhenreformulatingthatquery?Inwhatwaysdotheirstrategiesdifferfromthoseusedinmonolingualapplications?Howdoindividualdifferencesinsubjectknowledge,languageskills,searchexperience,andotherfactorsaffectthisprocess?
  2. Whatinformationdosearchersneedwhenreformulatingtheirquery,andhowdotheyobtainthatinformation?
  3. Cansearchersfinddocumentsmoreeffectivelyifwegivethemsomedegreeofcontroloverthequerytranslationprocess?Dosearchersprefertoexercisecontroloverthequerytranslationprocess?Whatreasonsdotheygivefortheirpreference?
  4. Whatmeasurescanbestilluminatetheeffectofinteractivequeryreformulationonretrievaleffectiveness?

Thesequestionsare,ofcourse,fartobroadtobeansweredcompletelybyanysingleexperiment.Fortheexperimentsreportedinthispaper,wechosetoprovideoursearcherswithtwovariantsonasingleretrievalsystem,onewithsupportforinteractionduringquerytranslation(whichwecall“manual”),andtheotherwithfullyautomaticquerytranslation(whichwecall“auto”).Thisdesignallowedustotestahypothesisderivedfromourthirdquestionabove.Wereliedonobservations,questionnaires,semistructuredinterviews,andexploratorydataanalysistoaugmenttheinsightgainedthroughhypothesistesting,andtobeginourexplorationoftheotherquestions.

Inthenextsection,wedescribethedesignofoursystem.Section3thendescribesourexperiment,andSection4presentstheresultsthatweobtained.Section5concludesthepaperwithabriefdiscussionoffuturework.

2SystemDesign

Inthissection,wedescribetheresourcesthatweused,thedesignofourcrosslanguageretrievalsystem,andouruserinterfacedesign.

2.1Resources

WechoseEnglishasthequerylanguageandGermanasthedocumentlanguagebecauseourpopulationofpotentialsearcherswasgenerallyskilledinEnglishbutnotGerman.ThefullGermandocumentcollectioncontained71,677newsstoriesfromtheSwissNewsAgency(SDA)and13,979newstoriesfromDerSpiegel.WeusedtheGermantoEnglishtranslationsprovidedbytheiCLEForganizersforconstructionofdocumentsurrogates(fordisplayinarankedlist)andfordisplayoffulldocumenttranslations(whenselectedforviewingbythesearcher).ThetranslationswerecreatedusingSystranProfessional3.0.

WeobtainedaGermanEnglishbilingualtermlistfromtheChemnitzUniversityofTechnology3,andusedtheGermanstemmerfromthe“snowball”project4.OurKeywordinContext(KWIC)techniquerequiresparallel(i.e.,translationequivalent)German/Englishtexts–weobtainedthosefromtheForeignBroadcastInformationService(FBIS)TIDESdatadisk,release2.

2.2CLIRSystem

WeusedtheInQuerytextretrievalsystem(version3.1p1)fromtheUniversityofMassachusetts,alongwithlocallyimplementedextensionstosupportcrosslanguageretrievalbetweenGermanandEnglish.WeusedPirkola’sstructuredquerytechniqueforquerytranslation[4],whichaggregatesGermantermfrequenciesanddocumentfrequenciesseparatelybeforecomputingtheweightforeachEnglishqueryterm.ThistendstosuppressthecontributiontotherankingcomputationsofthoseEnglishtermsthathaveatleastonetranslationthatisacommonGermanword(i.e.,thatoccursinmanydocuments).Fortheautomaticcondition,allknowntranslationswereused.Forthemanualcondition,onlytranslationsselectedbythesearcherwereused.Weemployedabackofftranslationstrategytomaximizethecoverageofthebilingualtermlist[3].IfnotranslationwasfoundforthesurfaceformofanEnglishterm,westemmedtheterm(usingthePorterstemmer)andtriedagain.Ifthatfailed,wealsostemmedtheEnglishsideofthebilingualtermlistandtriedathirdtime.Ifthatstillfailed,wetreatedtheuntranslatedtermasitsowntranslationinthehopethatitmightbeapropername.

2.3UserInterfaceDesign

Forourautomaticcondition,weadoptedaninterfacedesignsimilartothatofpresentWebsearchengines.SearchersenteredEnglishquerytermsinanonelinetextfield,basedontheirunderstandingofafullCLEFtopicdescription

3

http://dict.tuchemnitz.de/

4

http://snowball.sourceforge.net

(title,description,andnarrative).Weprovidedthattopicdescriptiononpaperinordertoencourageamorenaturalqueryformulationprocessthanmighthavenotbeenthecaseifcutandpastefromthetopicdescriptionwereavailable.Whenthesearchbuttonwasclicked,arankedlistofdocumentsurrogateswasdisplayedbelowthequeryfield,thusallowingthequerytoserveascontextwheninterpretingtherankedlist.Tensurrogatesweredisplayedsimultaneouslyasapage,andupto10pages(intotal100surrogates)couldbeviewedbyclicking“next”button.Oursurrogatesconsistedofthefirst40wordsintheTEXTfieldofthetranslateddocument.Englishwordsinthesurrogatethatsharedacommonstemwithanyqueryterm(usingthePorterstemmer)werehighlightedinred.SeeFigure1foranillustrationoftheautomaticuserinterface.

Fig.1.Userinterface,automaticcondition.

Eachsurrogateislabeledwithanumericrank(1,2,3,...),whichisdisplayedasanumberedbuttontotheleftofthesurrogate.Ifthesearcherselectedthebutton,thefulltextofthatdocumentwouldbedisplayedinaseparatewindow,withquerytermshighlightedinthesamemanner.Inordertoprovidecontext,werepeatedthenumericrankandthesurrogateatthetopofthedocumentexaminationwindow.Figure2illustratesadocumentexaminationwindow.

Wecollectedthreetypesofinformationaboutrelevancejudgments.First,searcherscouldindicatewhetherthedocumentwasnotrelevant(“N”),somewhatrelevant(“S”),orhighlyrelevant(“H”).Afourthvalue,“?”(indicatingunjudged),wasinitiallyselectedbythesystem.Second,searcherscouldindicatetheirdegreeofconfidenceintheirjudgmentaslow(“L”),medium(“M”),orhigh(“H”),withafourthvalue(“?”)beinginitiallyselectedbythesystem.

Bothrelevancejudgmentsandconfidencevalueswererecordedincrementallyinalogfile.Searcherscouldrecordrelevancejudgmentsandconfidencevaluesineitherthemainsearchwindoworinadocumentexaminationwindow(whenthatwindowwasdisplayed).Finally,werecordedthetimesatwhichdocumentswereselectedforexaminationandthetimesatwhichrelevancejudgmentsforthosedocumentswererecorded.Thisallowedustolatercomputethe(approximate)examinationtimeforeachdocument.Fordocumentsthatwerejudgedwithoutexamination(e.g.,basedsolelyonthesurrogate),weassignedzeroastheexaminationtime.

Fig.2.Documentexaminationwindow.

Forthemanualinterface,weusedavariantofthesameinterfacewithtwoadditionalitems:1)termbytermcontroloverthequerytranslationprocess,and2)asummaryofthetranslationschosenforallqueryterms.WeusedatabbedpanetoallowtheusertoexaminealternativetranslationsforoneEnglishquerytermatatime.Eachpossibletranslationwasshownonaseparateline,andacheckboxtotheleftofeachlineallowedtheusertodeselectorreselectthattranslation.Alltranslationswereinitiallyselected,sothemanualandautomaticconditionswouldbeidenticaliftheuserdidnotdeselectanytranslation.

SincewedesignedourinterfacetosupportsearcherswithnoknowledgeofGerman,weprovidedcuesinEnglishaboutthemeaningofeachGermantranslation.Fortheseexperiments,searcherswereabletoviewtwotypesofcues:(1)backtranslation,and(2)KeywordInContext(KWIC).Eachwascreatedautomatically,usingtechniquesdescribedbelow.Searcherswereabletoalternatebetweenthetwotypesofcuesusingtabs.Thequerytranslationsummaryareaprovidedadditionalcontextforinterpretationoftherankedlist,simultaneouslyshowingallselectedtranslations(withonebacktranslationeach).Inordertoemphasizethattwostepswereinvolved(querytranslation,followedbysearch),weprovidedboth“translatequery”and“search”buttons.Allotherfunctions

Fig.4.Backtranslationsof“religious.”

BackTranslationIdeally,wewouldprefertoprovidethesearcherwithEnglishdefinitionsforeachGermantranslationalternative.Dictionarieswiththesetypesofdefinitionsdoexistforsomelanguagepairs(althoughrightsmanagementconsiderationsmaylimittheiravailabilityinelectronicform),butbilingualtermlistsaremuchmoreeasilyavailable.Whatwecall“backtranslations”areEnglishtermsthatshareaspecificGermantranslation,somethingthatwecandeterminewithasimplebilingualtermlist.Forexample,theEnglishwordreligious hasseveralGermantranslationsinthetermlistthatweused,twoofwhicharefromm andgewissenhaft.Lookinginthesametermlistforcuestothemeaningoffromm,weseethatitcanbetranslatedintoEnglishasreligious, godly, pious, piously, orgodiler.Thusfromm seemstoclearlycorrespondtotheliteraluseofreligious.Bycontrast,gewissenhaft’sbacktranslationsarereligious, sedulous, precise, conscientious, faithful, orconscientiousness.Thisseemsasifitmightcorrespondwithamorefigurativeuseofreligious,asin“herodehisbiketoworkreligiously.”Ofcourse,manyGermantranslationswillthemselveshavemultiplesenses,sodetectingareliablesignalinthenoisycuesprovidedbybacktranslationsometimesrequirescommonsensereasoning.Fortunately,thatisataskforwhichhumanareuniquelywellsuited.TheoriginalEnglishtermwillalwaysbeitsownbacktranslation,sowesupressitsdisplay.Sometimesthisresultsinanempty(andthereforeuninformative)setofbacktranslations.Figure4showsthebacktranslationdisplayfor“religious”inourmanualcondition.

Fig.5.ConstructingcrosslanguageKWICusingasentencealignedparallelcorpus.

KeywordinContextOnewaytocompensatefortheweaknessesofbacktranslationistodrawadditionalevidencefromexamplesofusage.Inkeepingwiththecommonusageinmonolingualcontexts[1],wecallthisapproach“keywordincontext”or“KWIC.”ForeachGermantranslationofanEnglishterm,ourgoalistofindabriefpassageinwhichtheEnglishtermisusedinamannerappropriatetothetranslationinquestion.Todothis,westartedwithacollectionofdocumentpairsthataretranslationsofeachother.WeusedGermannewsstoriesthathadpreviouslybeenmanuallytranslatedintoEnglishbytheForeignBroadcastInformationService(FBIS)anddistributedasastandardresearchcorpus.WesegmentedtheFBISdocumentsintosentencesusingrulebasedsoftwarebasedonpunctuationandcapitalizationpatterns,andthenproducedalignedsentencepairsusingtheGSAalgorithm(whichusesdynamicprogrammingtodiscoveraplausiblemappingofsentenceswithinapaireddocumentsbaseduponknowntranslationrelationshipsfromthebilingualtermlist,sentencelengthsandrelativepositionsineachdocuments).WepresentedtheentireEnglishsentence,favoringtheshortestoneifmultiplesentencepairscontainedthesameEnglishterm.5

Formally,lette beanEnglishtermforwhichweseekanexampleofusage,andlettg betheGermantranslationfromthebilingualtermlistthatisofinterest.LetSe andSg betheshortestpairofsentencesthatcontainte andtg respectively.WethenpresentSe astheexampleofusagefortranslationtg .Figure5illustratesthisprocess.

3ExperimentDesign

Ourexperimentisdesignedtotesttheutilityofuserassistedquerytranslationinaninteractivecrosslanguageretrievalsystem.Weweremotivatedtoexplorethisquestionbytwopotentialbenefitsthatweforesaw:

Theeffectivenessofrankedretrievalmightbeimprovedifamorerefinedsetoftranslationsforkeyquerytermswereknown.
Thesearcher’sabilitytoemploytheretrievalsystemmightbeimprovedbyprovidinggreatertransparencyforthequerytranslationprocess.

Formally,wesoughttorejectthenullhypothesesthatthereisnodifferencebetweentheFα=0.8achievedusingtheautomaticandmanualsystems.TheF measureisanoutcomemeasure,however,andwewerealsointerestedinunderstandingprocessissues.Weusedexploratorydataanalysistoimproveourunderstandinghowthesearchersusedthecuesweprovided.

3.1Procedure

WefollowedthestandardprotocolforiCLEF2002experiments.Searchersweresequentiallygivenfourtopics(statedinEnglish),twoforusewiththemanualsystemandtwoforusewiththeautomaticsystem.Presentationorderfortopicsandsystemwasvariedsystematicallyacrosssearchersasspecifiedinthetrackguidelines.Afteraninitialtrainingsession,theyweregiven20minutesforeachsearchtoidentifyrelevantdocumentsusingtheradiobuttonsprovidedforthatpurposeinouruserinterface.Thesearcherswereaskedtoemphasizeprecisionoverrecall(bytellingthemthatitwasmoreimportantthatthedocumentthattheyselectedbetrulyrelevantthanthattheyfindeverypossiblerelevantdocument).Weaskedeachsearchertofilloutbriefquestionnairesbeforethefirst

5

Wedidnothighlightthequerytermincurrentversionduetotimeconstraints.Anotherlimitationofcurrentimplementationisthatabrieferpassagemayserveourpurposebetterinsomecases.

search(fordemographicdata),aftereachsearch,andafterusingeachsystem.Eachsearcherusedthesamesystematadifferenttime,sowewereabletoobserveeachindividuallyandmakeextensiveobservationalnotes.Wealsoconductedasemistructuredinterview(inwhichwetailoredourquestionsbasedonourobservations)afterallsearcheswerecompleted.

Weconductedapilotstudywithasinglesearcher(umd01)toexerciseournewsystemandrefineourdatacollectionprocedures.Eightsearchers(umd02umd09)thenperformedtheexperimentusingtheeightsubjectdesignspecifiedinthetrackguidelines6.Whilepreparingourresultsforsubmission,wenoticedthatnoSDAdocumentappearedinanyrankedlist.InvestigationrevealedthatInQueryhadfailedtoindexthosedocumentsbecausewehadnotconfiguredtheSGMLparsingcorrectlyforthatcollection.Wethereforecorrectedthatproblem,recruitedfournewsearchers(umd10umd13),andrepeatedtheexperiment,thistimeusingthefoursubjectdesignspecifiedinthetrackguidelines.

Wesubmittedalltwelverunsforuseinformingrelevancepools,butdesignatedthesecondexperimentasourofficialsubmissionbecausethefirstexperimentdidnotcomplywithonerequirementofthetrackguidelines(thecollectionstobesearched).Ourresultsfromthefirstexperimentare,however,interestingforseveralreasons.First,itturnedoutthattopic3hadnorelevantdocumentsinthecollectionsearchedinthefirstexperiment.7Thishappensinrealapplications,ofcourse,butthesituationisrarelystudiedininformationretrievalexperimentsbecausethetypicalevaluationmeasuresareunabletodiscriminatebetweensystemswhennorelevantdocumentsareexist.Second,thenumberofrelevantdocumentsfortheremainingthreetopicswassmallerinthefirstexperimentthanthesecond.Thisprovidedanopportunitytostudytheeffectofcollectioncharacteristicsonsearcherbehavior.

Forconvenience,werefertothefirstexperimentasthesmall collection experiment,andthesecondasthelarge collection experiment.

3.2Measures

Wecomputedthefollowingmeasuresinordertogaininsightintosearchbehaviorandsearchresults:

Fα=0.8,asdefinedinthetrackguidelines(with“somewatrelevant”documentstreatedasnotrelevant).Werefertothisconditionas“strict”relevancejudgments.Thisvaluewascomputedattheendofeachsearchsession.
Fα=0.8,butwith“somewatrelevant”documentstreatedasrelevant.Werefertothisconditionas“loose”relevancejudgments.Thisvalewasalsocomputedforeachsession.
MeanuninterpolatedAveragePrecision(MAP)fortherankedlistreturnedbyeachiterationofasearchprocess.

6

http://terral.lsi.uned.es/iclef/2002/

7

Inthispaper,wenumberthetopics1,2,3,and4inkeepingwiththetrackguidelines.ThesecorrespondtoCLEFtopicnumbersc053,c065,c056andc080,respectively.

AvariantofMAPinwhichdocumentsalreadymarkedas“highlyrelevant”areplacedatthetopoftherankedlist(inanarbitraryorder).Werefertothismeasureas“MAPS”(for“strict”).
AsecondvariantofMAPinwhichdocumentsalreadymarkedas“highlyrelevant”or“somewhatrelevant”areplacedatthetopoftherankedlist(inanarbitraryorder).Werefertothismeasureas“MAPL”(for“loose”).
AthirdvariantofMAPinwhichonlythedocumentsstatisfyingthetwoconditions–1)theyarealreadymarkedas“highlyrelevant”bythesubject;2)theyaretherealrelevantdocumentsaccordingto“groundtruth”–areplacedatthetopoftherankedlist(inanarbitraryorder).Werefertothismeasureas“MAPR”(for“real”).
Thetotalexaminationtime(inseconds)foreachdocument,summedoverallinstancesofexaminationforthesamedocument.Ifthefulltextofadocumentwasneverexamined,anexaminationtimeofzerowasrecorded.
Thetotalnumberofqueryiterationsforeachsearch.

Thesetorientedmeasures(strictandlooseF )aredeignedtocharacterizeendtoendtaskperformanceusingthesystem.Therankorientedmeasures(MAP,MAPS,MAPLandMAPR)aredesignedtoofferindirectinsightintothequeryformulationprocessbycharacterizingtheeffectofaquerybasedonthedensityofrelevantdocumentsnearthetopoftherankedlistproducedforthatquery(orforqueriesupthroughthatiterationbyeitherviewingfromthepointofthesubject’sownsenseofperformance,inthecaseofMAPSandMAPL,orviewingfromtheactualperformance,inthecaseofMAPR).Examinationtimeisintendedforuseinconjunctionwithrelevancejudgmentcategories,inordertogainsomeinsightintotherelevancejudgmentprocess.Wehavenotyetfinishedourtrajectoryanalysisortheanalysisofexaminationduration,sointhispaperwereportresultsonlyforthefinalvaluesofFα=0.8andforthenumberofiterations.

4Results

4.1Searchers

Oursearcherpopulationwasrelativelyhomogeneous.Specifically,theywere:

Affiliatedwithauniversity.Everyoneofoursearcherswasastudent,staffmemberorfacultymemberattheUniversityofMaryland.

Highlyeducated.Tenofthe12searchersareeitherenrolledinaMastersdegreeprogramorhadearnedaMastersdegreeorhigher.Theremainingtwowereundergraduatestudents,andtheyarebothinthesmallcollectionexperiment.

Mature.Theaverageageoverall12searcherswas31,withtheyoungestbeing19andtheoldestbeing43.Theaverageageofthefoursearchersinthelargecollectionexperimentwas32.

Mostlyfemale.Therewerethreetimesasmanyfemalesearchersasmales,bothoverallandinthelargecollectionexperiment.

Experiencedsearchers.Sixofthe12searchershelddegreesinlibraryscience.Thesearchersreportedanaverageofabout6yearsofonlinesearchingexperience,withaminimumof4yearsandmaximumof10years.MostsearchersreportedextensiveexperiencewithWebsearchservices,andallreportedatleastsomeexperiencesearchingcomputerizedlibrarycatalogs(rangingfrom”some”to”agreatdeal”).Elevenofthe12reportedthattheysearchatleastonceortwiceaday.Thesearchexperiencedataforthefourparticipantsinthelargecollectionexperimentwasslightlygreaterthanforthe12searchersasawhole.

Notpreviousstudyparticipants.Noneofthe12subjectshadpreviouslyparticipatedinaTRECororiCLEFstudy.

Inexperiencedwithmachinetranslation.Nineofthe12participantsreportedneverhavingusedanymachinetranslationsoftwareorfreeWebtranslationservice.Theother3reported“verylittleexperience”withmachinetranslationsoftwareorservices.Thefourparticipantsinthelargecollectionexperimentreportedthesameratio.

NativeEnglishspeakers.All12searcherswerenativespeakersofEnglish.

NotskilledinGerman.Eightofthe12searchersreportednoreadingskillsinGermanatall.Another3reportedpoorreadingskillsinGerman,andone(umd12)reportedgoodreadingskillinGerman.Amongthefoursearchersinthelargecollectionexperiment,3reportednoGermanskills,withthefourthreportinggoodreadingskillsinGerman.

Fig.6.Fα =0.8,largecollection,byconditionandtopic.

4.2LargeCollectionExperiment

Ourofficialresultsonthelargecollectionexperimentfoundthatthemanualsystemachieveda48%largervalueforFα=0.8thantheautomaticsystem(0.4995

vs.0.3371).However,thedifferenceisnotstatisticallysignificant,andthemostlikelyreasonisthesamllsamplesize.ThepresenceofasearcherwithgoodreadingskillsinGermanisalsopotentiallytroublesomegiventhehypothesisthatwewishedtotest.Wehavenotyetconductedsearcherbysearcheranalysistodeterminewhethersearcherumd12exhibitedsearchbehaviorsmarkedlydifferentfromtheother11searchers.Forcontrast,werecomputedthesameresultswithlooserelevance.Inthatcase,thesearchersinourlargecollectionexperimentachieveda22%increaseinFα=0.8overtheautomaticsystem(0.5095vs.0.4176).

AsFigure6shows,themanualsystemachievedthelargestimprovementsfortopics1(GenesandDiseases)and4(HungerStrikes)withstrictrelevance,buttheautomaticsystemactuallyoutperformedthemanualsystemontopic2(TreasureHunting).Looserelevancejudgmentsexhibitedasimilarpattern.Searchersthatwerepresentedwithtopic2inthemanualconditionreported(inquestionnaire)thatitwasmoredifficulttoidentifyappropriatetranslationsfortopic2thanforanyothertopic,andsearchersgenerallyindicatedthattheywerelessfamiliarwithtopic2thanwithothertopics.Wehavenotyetcompletedouranalysisofobservationalnotes,sowearenotabletosaywhetherthisresultedinanydifferencesinsearchbehavior.Butitseemslikelythatwithoutusefulcues,searchersremovedtranslationsthatwouldhavebeenbetteroffkeeping.Ifconfirmedthroughfurtheranalysis,thismayhaveimplicationsforusertraining.

Fig.7.Fα =0.8,smallcollectiongroup,bycondition.

4.3SmallCollectionExperiment

TheresultsofthesmallcollectionexperimentshowninFigure7arequitedifferent.Thesituationisreversedfortopic1,withautomaticnowoutperformingmanual,andtopic4nolongerdiscriminatesbetweenthetwosystems.8Overall,

8

Topic3,withnorelevantdocumentsinthesmallcollection,isnotshown.

themanualandautomaticsystemscouldnotbedistinguishedusinglooserelevance(0.2889vs0.2931),buttheautomaticsystemseemedtodobetterwithstrictrelevance(0.2268vs0.3206).Again,wedidnotfindthatthedifferenceisstatisticallysignificant.Thedatathatwehaveanalyzeddoes,however,seemtosuggestthatourmanualsystemisbettersuitedtocasesinwhichthereareasubstantialnumberofrelevantdocuments.Weplantousethisquestiontoguidesomeofourfurtherdataanalysis.

4.4SubjectiveAssessment

Weanalyzedquestionnairedataandinterviewresponsesinanefforttounderstandhowparticipantsemployedthesystemsandtobetterunderstandtheirimpressionsaboutthesystems.Questionnaireresponsesareona15scale(with1being“notatall,”and5being“stronglyagree”).

Searchersinthelargecollectionexperimentreportedthatthemanualandautomaticsystemswereequallyeasytosearchwith(average3.5),butsearchersinthesmallcollectionexperimentreportedthattheautomaticsystemwaseasiertousethanthemanualsystem(3.4vs.2.75).

Searchersinthelargecollectionexperimentreportedanequalneedtoreformulatetheirinitialquerieswithbothsystems(average3.25),butsearchersinthesmallcollectionexperimentreportedthatthiswassomewhatlessnecessarywiththeautomaticsystem(3.9vs.4.1).Onesearcher,umd07reportedthatitwas”extremelynecessary”toreformulatequerieswithbothsystems.Wenoticefromhis/heranswerstoouropenquestionsthathe/shethoughtthequerytranslationswere”usuallyverypoor,”andhe/shewouldlikebothsystemssupportBooleanqueries,proximityoperatorsandtruncationssothat”noise”couldberemoved.

Searchersinthelargecollectionexperimentreportedthattheywereabletofindrelevantdocumentsmoreeasilyusingthemanualsystemthantheautomaticsystem(4.0vs.3.5),butsearchersinthesmallcollectionexperimenthadtheoppositeopinion(2.6vs.3.0).

Forquestionsuniquetothemanualsystem,thelargecollectiongroupreportedpositivereactionstotheusefulnessofuserassistedquerytranslation(witheveryonechoosingavalueof4).Theygenerallyfeltthatitwaspossibletoidentifyunintendedtranslations(anaverageof3.5),andthatandmostofthetimethesystemprovidedappropriatetranslations(averageof3.9).

Mostparticipantsreportedthattheywerenotfamiliarwiththetopics,withtopic3(EuropeanCampaignsagainstRacism)havingthemostfamiliarity,andtopics1and2havingtheleast.

4.5Queryiterationanalysis

Wedeterminedthenumberofiterationsforeachsearchthroughlogfileanalysis.Inthelargecollectionexperiment,searchersaveraged9queryiterationspersearchacrossallconditions.Topic2hadthelargestnumberofiterations(averaging16),topic4hadthefewest(averaging6).Topics1and2exhibitedlittledifferenceintheaveragenumberofiterationsacrosssystems,buttopics3and4hadsubstantiallyfeweriterationswiththemanualsystem.Inthesmallcollectionexperiment,searchersperformedsubstantiallymoreiterationspersearchthanthatinthelargecollectionexperiment,averaging13iterationspersearchacrossallconditions.Topic2againhasthegreatestnumberofiterations(averaging16),whiletopic1hadthefewest(averaging8).

4.6Theeffectofthenumberofrelevantdocuments

TheunexpectedproblemwithindexingtheSDAcollectionreducedthenumberofsearchersthatcontributedtoourofficialresults,butitprovideduswithanextradimensionforouranalysis.Searchersinthelargecollectionandsmallcollectionexperimentsweregenerallydrawnfromthesamepopulation,weregiventhesametopics,usedthesamesystems,andperformedthesametasks.Themaindifferenceisthenatureofthecollectionthattheysearched,andinparticularthenumberofrelevantdocumentsthatwereavailabletobefound.Summarizingtheresultsabovefromthisperspective,weobservedthefollowingdifferencesbetweenthetwoexperiments:

Objectively,searchersseemedtoachieveabetteroutcomemeasurewiththemanualsysteminthelargecollectionexperiment,buttheyseemedtodobetterwiththeautomaticsysteminthesmallcollectionexperiment.
Subjectively,searcherspreferredusingthemanualsysteminthelargecollectionexperiment,buttheypreferredtheautomaticsysteminthesmallcollectionexperiment.
Examiningsearchbehavior,wefoundthattheaveragenumberofqueryrefinementiterationspersearchwasinverselycorrelatedwiththenumberofrelevantdocuments.

Wehavenotyetfinishedouranalysis,butthepreponderanceoftheevidencethatispresentlyavailablesuggeststhatcollectioncharacteristicsmaybeanimportantvariableinthedesignofinteractiveCLIRsystems.Webelievethatthisfactorshouldreceiveattentioninfutureworkonthissubject.

5Conclusionandfuturework

Wefocusedonsupportinguserparticipationinthequerytranslationprocess,andtestedtheeffectivenessoftwotypesofcues—back translation andkeyword in context inaninteractiveCLIRapplication.Ourpreliminaryanalysissuggeststhattogetherthesecuescansometimesbehelpful,butthatthedegreeofutilitythatisobtainedisdependentonthecharacteristicsofthetopic,thecollection,andtheavailabletranslationresources.

Ourexperimentssuggestanumberofpromisingdirectionsforfuturework.First,meanaverageprecisionisacommonlyreportedmeasureforthequalityofarankedlist(and,byextension,forthequalityofthequerythatledtothecreationofthatrankedlist).WehavefoundthatitisdifficulttodrawinsightsfromMAPtrajectories(variationsacrosssequentialqueryrefinementiterations),inpartbecausewedonotyethaveagoodwaytodescribethestrategiesthatasearchermightemploy.Wearepresentlyworkingtocharacterizethesestrategiesinausefulway,andtodevelopvariantsoftheMAPmeasure(threeofwhichweredescribedabove)thatmayofferadditionalinsight.Second,ourinitialexperimentswithusingKWICforuserassistedquerytranslationseempromising,butthereareseveralthingsthatwemightimprove.Forexample,itwouldbebetterifwecouldfindtheexamplesofusageinacomparablecorpus(oreventheWeb)ratherthanaparallelcorpusbecauseparallelcorporaaredifficulttoobtain.Finally,weobservedfarmorequeryreformulationactivityinthisstudythanwehadexpectedtosee.Ourpresentsystemprovidessomesupportforreformulationbyallowingtheusertoseewhichquerytermtranslationsarebeingusedinthesearch.Butwedonotyetprovidethesearcherwithanyinsightintothesecondhalfofthatprocess—whichGermanwordscorrespondtopotentiallyusefulEnglishtermsthatarelearnedbyexaminingthetranslations?Ifweusedthesameresourcesfordocumenttranslationasforquerytranslation,thismightnotbeaseriousproblem.Butwedon’t,soitisanissuethatweneedtothinkabouthowtosupport.

TheCLEFinteractivetrackhadproventobeanexcellentsourceofinsightintobothsystemdesignandexperimentdesign.Welookforwardtonextyear’sexperiments!

Acknowledgments

TheauthorswouldliketothankJulioGonzaloandFernandoL´

opezOstenerofortheirtirelesseffortstocoordinateiCLEF.ThisworkhasbeensupportedinpartbyDARPAcooperativeagreementN660010028910.

References

  1. Ricardo BaezaYatesandBerthierRibeiroNeto.Modern Information Retrieval.AddisonWesley,1999.
  2. DouglasW.OardandAnneDiekema. CrossLanguageInformationRetrieval.Annual Review of Information Science and Technology,33:223–256,1998.
  3. DouglasW.Oard,GinaAnneLevow,andClaraI.Cabezas.CLEFExperimentsatMaryland:StatisticalStemmingandbackofftranslation.InC.Peters,editor,CrossLanguage Information Retrieval and Evaluation: Workshop of CrossLanguage Evaluation Forum, CLEF 2000,pages176–187,Lisbon,Portugal,2000.
  4. AriPirkola. TheEffectsofQueryStructureandDictionarySetupsinDictionaryBasedCrossLanguageInformationRetrieval.InProceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,Melbourne,Australia,1998.
  5. Jianqiang WangandDouglasW.Oard.iCLEF2001atMaryland:ComparingWordforWordGlossandMT.InC.Peters,M.Braschler,J.Gonzalo,andKluckM,editors,Evaluation of CrossLanguage Information Retrieval Systems: Second Workshop of the CrossLanguage Evaluation Forum, CLEF 2001,pages336–354,Darmstadt,Germany,2001.