of 8

SmartWeb handheld—multimodal interaction with ontological knowledge bases and semantic web services

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
SmartWeb handheld—multimodal interaction with ontological knowledge bases and semantic web services
  SmartWeb Handheld — Multimodal Interaction with Ontological KnowledgeBases and Semantic Web Services Daniel Sonntag, Ralf Engel, Gerd Herzog, Alexander Pfalzgraf,Norbert Pfleger, Massimo Romanelli, Norbert Reithinger German Research Center for Artificial Intelligence66123 Saarbr¨ucken, Germanyfirstname.lastname@dfki.de Abstract S MART W EB  aims to provide intuitive multimodalaccess to a rich selection of Web-based informa-tion services. We report on the current prototypewith a smartphone client interface to the Seman-tic Web. An advanced ontology-based represen-tation of facts and media structures serves as cen-tral description for rich media content. Underlyingcontent is accessed through conventional web ser-vice middleware to connect the ontological knowl-edge base and an intelligent web service compo-sition module for external web services, which isable to translate between ordinaryXML-based datastructures and explicit semantic representations foruser queries and system responses. The presenta-tion module renders the media content and the re-sults generatedfromthe services andprovidesa de-tailed description of the content and its layout tothe fusion module. The user is then able to employmultiple modalities, like speech and gestures, to in-teract with the presented multimedia material in amultimodal way. 1 Introduction The development of a context-aware, multimodal mobile in-terface to the Semantic Web [Fensel  et al. , 2003], i.e., ontolo-gies and web services, is a very interesting task since it com-bines many state-of-the-arttechnologiessuch as ontologyde-velopment, distributed dialog systems, standardized interfacedescriptions (EMMA 1 , SSML 2 , RDF 3 , OWL-S 4 , WSDL 5 ,SOAP 6 , MPEG7 7 ), and composition of web services. In thiscontribution we describe the intermediate steps in the dia-log system development process for the project S MART W EB [Wahlster, 2004], which was started in 2004 by partners fromindustry and academia. 1 http://www.w3.org/TR/emma 2 http://www.w3.org/TR/speech-synthesis 3 http://www.w3.org/TR/rdf-primer 4 http://www.w3.org/Submission/OWL-S 5 http://www.w3.org/TR/wsdl 6 http://www.w3.org/TR/soap 7 http://www.chiariglione.org/mpeg In our main scenario, the user carries a smartphone PDAand poses closed and open domain multimodal questions inthe context of football games and a visit to a Football World-cup stadium. Many challenging task such as interaction de-sign for mobiledeviceswith restrictedcomputingpowerhaveto be addressed: the user should be able to use the PDA as aquestion answering (QA) system, using speech and gesturesto ask for information about players or games stored in on-tologies, or other up-to-date information like weather fore-cast information accessible through web services, SemanticWeb pages (Web pages wrapped by semantic agents), or theInternet.The partners of the S MART W EB  project share experiencefrom earlier dialog system projects [Wahlster, 2000; 2003;Reithinger  et al. , 2005b]. We followed guidelines for multi-modal interaction, as explained in [Oviatt, 1999] for exam-ple, in the developmentprocess of our first demonstrator sys-tem [Reithinger  et al. , 2005a] which contains the followingassets:  multimodality , more modalities allow for more natu-ral communication,  encapsulation , we encapsulate the mul-timodal dialog interface proper from the application,  stan-dards , adopting to standards opens the door to scalability,since we can re-use ours as well as other’s resources, and representation . A shared representation and a common onto-logicalknowledgebaseease thedataflowamongcomponentsand avoids costly transformation processes. In addition, se-mantic structures are our basis for representing dialog phe-nomena such as multimodal references and user queries. Thesame ontological query structures are input to the knowledgeretrieval and web service composition process.In the following we demonstrate the strength of Seman-tic Web technologyfor informationgatheringdialog systems,especially the integation of multiple dialog components, andshow how knowledge retrieval from ontologies and web ser-vices can be combined with advanced dialogical interaction,i.e., system-initiative callbacks, which present a strong ad-vancement to traditional QA systems. Traditional QA re-alizes like a traditional NLP dialog system a (recognize) -analyze - react - generate - (synthesize) pipeline [Allen  et al. , 2000]. Once a query is being started, the informationis pipelined until the end, which means that the user-systeminteraction is reduced to user and result messages. The typesof dialogical phenomena we address and support include ref-erence resolution, system-initiated clarification requests and  pointing gesture interpretation among others. Support for un-derspecified questions and enumeration question types addi-tionally shows advanced QA functionality in a multimodalsetting. One of the main contributions is the ontology-basedintegrationofverbalandnon-verbalsysteminput(fusion)andoutput (system reaction).The paper is organizedas follows: we begin with an exam-ple interaction sequence, in section 3, we explain the dialogsystem architecture. In section 4, the ontological knowledgerepresentation and web service access is described. Section5 then gives a description of the underlying language parsinganddiscourse processingsteps, and their integration. Conclu-sions about the success of the system so far and future plansare outlined in section 6. 2 Multimodal interaction sequence example The following interaction sequence is typical for theS MART W EB  dialog system.(1)  U:  “When was Germany world champion?”(2)  S:  “In the following 4 years: 1954 (in Switzerland),1974 (in Germany), 1990 (in Italy), 2003 (in USA)”(3)  U:  “And Brazil?”(4)  S:  “In the following 5 years: 1958 (in Sweden), 1962(in Chile), 1970 (in Mexico), 1994 (in USA), 2002 (inJapan)” + [ team picture, MPEG-7 annotated  ](5)  U:  Pointing gesture on player  Aldair   + “How manygoals did this player score?”(6)  S:  “Aldair scored none in the championship 2002.”(7)  U:  “What can I do in my spare time on Saturday?”(8)  S:  “Where?”(9)  U:  “In Berlin.”(10)  S:  The cinema program, festivals, and concerts in Berlin are listed. The first and second enumeration questions are answeredby deductive reasoning within the ontological knowledgebase modeled in OWL [Krotzsch  et al. , 2006] representingthe static but very rich implicit knowledge that can be re-trieved. The second example beginning with (7) evokes adynamically composed web service lookup. It is importantto note that the queryrepresentationis the same for all the ac-cess methods to the Semantic Web (cf. section 5.1) and is de-finedbyfoundationalanddomain-specificontologies. Incasethat the GPS co-cordinates were accessible from the mobiledevice, the clarification question would have been omitted. 3 Architecture approach A flexible dialog system platform is required in order to al-low for true multi-session operation with multiple concur-rent users of the server-side system as well as to supportFigure 1: S MART W EB  handheld architecture.audio transfer and other data connections between the mo-bile device and a remote dialog server. This types of sys-tems have been developed, like the Galaxy Communicator[Cheyer and Martin, 2001] (cf. also [Seneff   et al. , 1999;Thorisson  et al. , 2004; Herzog  et al. , 2004; Bontcheva  et al. , 2004]), and commercial platforms from major vendorslike VoiceGenie, Kirusa, IBM, and Microsoft use X+V1,HTML+SALT2, or derivatives for speech-based interactionon mobile devices. For our purposes these platforms aretoo limited. To implement new interaction metaphors andto use Semantic Web based data structures for both dialogsystem internal and external communication, we developed aplatform designed for Semantic Web data structures for NLPcomponents and backend knowledge server communication.The basic architecture is shown in figure 1.It consistsofthreebasicprocessingblocks: thePDA client,the dialog server which comprises the dialog manager, andthe Semantic Web access system.On the PDA client, a local Java-based control unit takescare of all I/O, and is connected to the GUI-controller. Thelocal VoiceXML-based dialog system resists on the PDA forinteraction during link downtimes.The dialog server system platform instantiates one dialogserver for each call and connects the multimodal recognizer  for speech andgesture recognition. The dialogsystem instan-tiates and sends the requests to the  Semantic Mediator  , whichprovides the umbrella for all different access methods to theSemantic Web we use. It consists of an open domain QA sys-tem, a Semantic Web service composer, Semantic Web pages(wrapped by semantic agents), and a knowledge server.The dialog system consist of different, self-contained pro-cessing components. To integrate them we developed a Java-based hub-and-spoke architecture [Reithinger and Sonntag,2005]. The most important processing modules in the dia-log system connected in the IHUB are: a speech interpre-tation component (SPIN), a modality fusion and discoursecomponent (FADE), a system reaction and presentation com-ponent (REAPR), and a natural language generation mod-ule (NIPSGEN), all discussed in section 5. An EMMA Un-packer/Packer (EUP) component provides the communica-tion with the dialogue server and Semantic Web subsystemexternaltothemultimodaldialogmanagerandcommunicateswith the other modules of the dialog server, the multimodalrecognizer, and the speech synthesis system.Processing a user turn, the normal data flows through SPIN   →  FADE   →  REAPR  →  SemanticMediator  → REAPR  →  NIPSGEN  . However, the data flow is oftenmore complicated when, for example, misinterpretations andclarifications are involved. 4 Ontology representation and web services Figure 2: A S MARTMEDIA  instance representing the decom-position of the Brazil 1998 world cup football team image.The ontological infrastructure of S MART W EB , theSWIntO (S MART W EB  Int egrated  O ntology), is based on anupper model ontology realized by merging well chosen con-cepts from two established foundational ontologies, DOLCE[Gangemi  et al. , 2002] and SUMO [Niles and Pease, 2001],in a unique one: the S MART W EB  foundational ontologyS MART SUMO [Cimiano  et al. , 2004]. Domain specificknowledge (sportevent, navigation) is defined in dedicatedontologies modeled as sub-ontologies of the S MART SUMO.The SWIntO integrates question answering specific knowl-edge of a discourse ontology (D ISC O NTO ) and representa-tion ofmultimodalinformationof a mediaontology(S MART - MEDIA ). The data exchange is RDF-based.We realizedadiscourseontology(D ISC O NTO )withpartic-ular attention to the modelingof discourse interactions in QAscenarios. The D ISC O NTO  provides concepts for dialogicalinteraction with the user as well as more technical request-response concepts for data exchange with the Semantic Websubsystem including answer status which is important in in-teractive systems. In particular D ISC O NTO  comprises con-cepts for multimodal dialog management, a dialog act taxon-omy, lexical rules for syntactic-semantic mapping, HCI con-cepts (e.g. pattern language for interaction design [Sonntag,2005]), and concepts for questions, question focus, seman-tic answer types [Hovy  et al. , March 2001], and multimodalresults [Sonntag and Romanelli, 2006].Information exchange between the components of theserver-side dialog system is based on the W3C EMMA stan-dard that is used to realize containers for the ontological in-stances representing, e.g., multimodal input interpretations.SWEMMA is our extension to the EMMA standard whichintroduces additional  Result   structures in order to representcomponents output. On the ontological level we modeled anRDF/S-representation of EMMA/SWEMMA.The S MARTMEDIA  is an MPEG7-based media on-tology and an extension to [Hunter, 2001; Benitez et al. , 2002] that we use to represent output result,offering functionality for multimedia decomposition inspace, time and frequency (mpeg7:SegmentDecomposition),file format and coding parameters (mpeg7:MediaFormat),and a link to the Upper Model Ontology (smart-media:aboutDomainInstance). In order to close the semanticgap between the different levels of media representations, the smartmedia:aboutDomainInstance property has been locatedin the top level class  smartmedia:Segment  . The link to theupper model ontology is inherited to all segments of a mediainstance decomposition to guarantee deep semantic represen-tations for the  smartmedia  instances referencing the specificmedia object and for making up segment decompositions.Figure 2 shows an example of this procedure applied to animage of the Brazilian football team in the final match of theWorld Cup 1998, as introducedin the interaction example. Inthe example an instance of the class  mpeg7:StillRegion , rep-resenting the complete image, is decomposed into different mpeg7:StillRegion instances representing the segments of theimage which show individual players.The  mpeg7:StillRegion  instance representing the en-tire picture is then linked to a  sportevent:MatchTeam  in-stance, and each segment of the picture is linked toa  sportevent:FieldFootballPlayer   instance or sub-instance.These representations offer a framework for gesture andspeech fusion when users interact with Semantic Web resultssuch as MPEG7-annotated images, maps with points-of in-  terest, or other interactive graphical media obtained from theontological knowledge base or multimedia web services. 4.1 Multimodal access to web services To connect to web services we developed a semantic repre-sentation formalism based on OWL-S and a service compo-sition component able to interpret an ontological user query.We extended the OWL-S ontologies to flexibly compose andinvoke web services on the fly, gaining sophisticated repre-sentation of information gathering services fundamental toS MART W EB .Sophisticated data representation is the key for developinga composition engine that exploits the semantics of web ser-vice annotation and query representation. The compositionengine follows a plan-based approach as explained, e.g., in[Ghallab  et al. , 2004]. It infers the initial and goal state fromthe semantic representation of the user query, whereas the setof semantic web services is considered as planningoperators.The output gained from automatic web service invocationis represented in terms of instances of the S MART W EB  do-main ontologies and enriched by additional media instances,if available. Media objects are represented in terms of theS MARTMEDIA  ontology (see above) and are annotated auto-matically during service execution. This enables the dialogmanager for multimodal interaction with web service results.A key feature of the service composition engine is to de-tect underspecified user queries, i.e., the lack of required webservice input parameters. In these cases the composition en-gine is able to formulate a clarification request as specifiedwithin the discourse ontology (D ISC O NTO ). This points outthe missing pieces of information to be forwarded to the dia-log manager. Then the composition engine expects a clarifi-cation reponse enabling it to replan on the refined ontologicaluser query.Figure 3: Data flow for the processing of a clarification re-quest as in the example (7-10) ”What can I do in my sparetime on Saturday?”.According to the interaction example (7-10) the composi-tion engine searches for a web service demandingfor activityevent types and gets its description. Normally, the contextmodule incorporated in the dialog manager would completethe query with the venue obtained from a GPS receiver at-tached to the handheld device. In case of no GPS signal, forinstance indoors, the compositionengine asks for the missingparameter(cf. figure 3), whichmakes the compositionenginemore robust and thus more suitable for interactive scenarios.In the interaction example (7-10) the composition plannerconsiders the  T-Info EventService  appropriate for answeringthe query. This service requires both date and location forlooking up events. While the date is already mentioned inthe initial user query, the location is being asked from theuser by clarification request. After the location information(dialogue step (9) in the example:  In Berlin ) is obtained fromthe user, the composition engine invokes in turn two T-Info(DTAG) web services 8 offeredby Deutsche TelekomAG (seealso [Ankolekar et al. , 2006]): first the  T-InfoEventService  asalready mentionedabove, and then the  T-Info MapService  forcalculating an interactive map showing the venue as point-of-interest. Text-based event details, additional image material,and the location map are semantically represented (the mapin MPEG7) and returned to the dialog engine. 5 Semantic parsing and discourse processing Semantic parsing and other discourse processing steps are re-flected on the interaction device as advanced user perceptualfeedback functionality. The following screenshot illustratesthe two most important processing steps for system-user in-teraction, the feedbackon the natural languageunderstandingstep and the presentation of multimodal results. The seman-tic parser produces a semantic query (illustrated on the left infigure 4), which is presented to the user in nested attribute-value form. The web service results (illustrated on the rightin figure 4) for the interaction example (7-10) are presentedin a multimodal way, combining text, image, and speech:  5Veranstaltungen  (five events).Figure 4: Semantic query (illustrated on the left) and webservice results (illustrated on the right). 8 http://services.t-info.de/soap.index.jsp  5.1 Language understanding with SPIN and textgeneration with NIPSGEN The parsing module is based on the semantic parser SPIN[Engel, 2005]. A syntactic analysis of the input utteranceis not performed, but the ontology instances are created di-rectly from word level. The typical advantages of a semanticparsing approach are that processing is faster and more ro-bust against speech recognition errors and disfluencies pro-duced by the user and the rules are easier to write and main-tain. Also, multilingual dialog systems are easier to realizeas a syntactic analysis is not required for each supported lan-guage. A disadvantage is that the complexity of the possibleutterancesis somewhatlimited,but this is acceptableformostdialog systems.One outstanding feature of the parser is the possibility fororder-independentmatching, i.e., the order of elements in theinput stream is ignored if order-independent matching is ac-tive. This simplifies the processing of free-word order lan-guages like German and increases the robustness. Order-independent matching can have an huge impact on perfor-mance as parsing in general becomes an NP-complete task [Huynh, 1983]. To ensure fast processing notwithstanding,several off-line optimizations, like rule ordering, have beenimplemented which increase the performance for rule setsthat are typical for dialog systems. The average processingtime is about 50ms per utterance, which ensures direct feed-back to user inputs.The knowledge base of the parser consists of 544 rules and2250lexiconentries currently. To give an impression how therules look like, four rules are providedas examplesto processthe utterance  When was Brazil world champion . The first onetransformstheword  Brazil totheontologyinstance Country : Brazil  →  Country(name:BRAZIL) Thesecondonetransformscountriestoteams as eachcountrycan stand for a team in our domain: $C=Country()  →  Team(srcin:$C) The third one processes  when  generating an instance of thetype  TimePoint  which is marked as questioned: when  → TimePoint(variable:QEVariable(focus:text)) Thefourthruleprocessestheverbalphrase < TimePoint  > was < Team > world champion $TP=TimePoint() was $TM=Team() worldchampion  → QEPattern(patternArg:Tournament(winner:$TM, happensAt:$TP)) The text generationmoduleuses the same SPIN parser thatis used in the language understandingmodule together with aTAG grammar which is modelled similar to the XTAG gram-mar 9 . The input of the generation module are instances of SWIntO representing the search results. Then these resultsare verbalized in different ways, e.g., as heading, as row of a table or as text which is synthesized. A processing optionindicates the current purpose. 9 http://www.cis.upenn.edu/  ∼ xtag/  The input is transformed to an utterance in four steps:1. An intermediate representation is built up on a phraselevel. The required rules are domain dependent.2. A set of domain independent rules transforms the inter-mediate representation to a derivation tree for the TAG-grammar.3. Theactual syntaxtree is constructedusingthe derivationtree. After the tree has been built up, the features of thetree nodes are unified.4. The correct inflections for all lexical leafs are looked upin the lexicon. Traversing the lexical leafs from left toright produces the result text.In the S MART W EB  system currently 179 domain depen-dent generation rules and 38 domain independent rules areused. 5.2 Multimodal discourse processing with FADE An important aspect of S MART W EB  is its context-aware pro-cessing strategy. All recognized user actions are processedwith respect to their situational and discourse context. Auser is thus not required to pose separate and unconnectedquestions. In fact, she might refer directly to the situation,e.g.,  “How do I get to Berlin from here?” , where  here  is re-solved to GPS information,or to previouscontributions(as inthe elliptical expression  “And in 2002?”  in the context of apreviously posed question  “Who won the Fifa World Cup in1990?” ). Theinterpretationofusercontributionswithrespectto their discourse contextis performedby a componentcalled Fusion and Discourse Engine —FADE [Pfleger, 2005] 10 . Thetask of FADE is to integrate the verbal and nonverbal usercontributions into a coherent multimodal representation to beenriched by contextual information, e.g., resolution of refer-ring and elliptical expressions.The basic architecture of FADE consists of two inter-weaved processing layers: (1) a production rule system—PATE—that is responsible for the reactive interpretation of perceived monomodal events, and (2) a discourse modeler—DiM—that is responsible for maintaining a coherent repre-sentation of the ongoing discourse and for the resolution of referring and elliptical expressions.In the following two subsections we will briefly discusssome context-related phenomena that can be resolved byFADE. Resolution of referring expressions A key feature of the S MART W EB  system is that the systemis capable of dealing with a broad range of referring expres-sionsas theyoccurinnaturaldialogs. Thismeanstheusercanemploy deictic references that are accompaniedby a pointinggesture (such as in  “How often did this team [pointing ges-ture] win the World Cup?” ) but also—if the context providesenough disambiguating information—without any accompa-nying gestures (e.g., if the previous question is uttered in thecontextof a previousrequestlike  “Whenwas GermanyWorld Cup champion for the last time?” ). 10 The situational context is maintained by another componentcalled  SitCom  that is not discussed in this paper.
Related Search
Related Docs
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks