A apresentação está carregando. Por favor, espere

A apresentação está carregando. Por favor, espere

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 1 Using subtitles to deal with.

Apresentações semelhantes


Apresentação em tema: "Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 1 Using subtitles to deal with."— Transcrição da apresentação:

1 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 1 Using subtitles to deal with Out-of-Domain interactions Daniel Magarreiro, Luísa Coheur, Francisco S. Melo INESC-ID / Instituto Superior Técnico, Lisbon, Portugal

2 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 2 Index Introduction Building the Subtle Corpus The Say Something Smart Engine –Corpora Indexing and candidate extraction –Choosing the answer Evaluation Meet Filipe Conclusions and Future Work

3 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 3 Motivation Users often insist in confronting domain-specialized virtual assistants with Out Of Domain (OOD) inputs.

4 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 4 Motivation Considering that: –people become more engaged with these applications if OOD requests are addressed (Bickmore and Cassell, 2000; Patel et al., 2006) –system designers are not able to successfully anticipate all the possible OOD requests

5 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 5 Motivation A possible approach: –explore the (semi-)automatic creation/enrichment of the knowledge base of virtual assistants/chatbots, taking advantage of the vast amount of dialogues available at the web.

6 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 6 Motivation We will focus on movie subtitles (for now)

7 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 7 Motivation Movie Subtitles –the web offers a vast number of repositories with a comprehensive archive of subtitle files this will allows data redundancy example: –How are you? Fine –So, how are you? Fine –How are you? Fine –How are you? I’m dying –subtitles are often available in multiple languages

8 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 8 Motivation Our approach –Build a corpus of interactions from the subtitles the Subtle corpus –Test a set of techniques to select an adequate response (from Subtle) to a user request Deployed in the Say Something Smart engine –Evaluate the plausibility of the selected answers

9 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 9 Index Introduction Building the Subtle Corpus The Say Something Smart Engine –Corpora Indexing and candidate extraction –Choosing the answer Evaluation Meet Filipe Conclusions and Future Work

10 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Building the Subtle Corpus The Subtle corpus will be a set of interactions –Like Edgar’s knowledge base Each interaction is a pair of turns (T, A): –T is the trigger –A is an answer (to the trigger) Example: –(T: So how old are you?, A: That’s none of your business) 03-02-2016 10

11 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Building the Subtle Corpus Problem: –Extracting interactions from subtitles files –Example: 770 01:01:05,537 --> 01:01:08,905 And makes an offer so ridiculous, 771 01:01:09,082 --> 01:01:11,881 the farmer is forced to say yes. 772 01:01:12,752 --> 01:01:15,494 We gonna offer to buy Candyland? 03-02-2016 11

12 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Building the Subtle Corpus Starting point: –2Gb of subtitles in Portuguese and English from OpenSubtitles Building Subtle: –Cleaning data Example: [TIRES SCREECHING] –Finding real turns Based on handcrafted rules (previous example) The user can configure the maximum time allowed between two slots for them to be considered part of a dialogue 03-02-2016 12

13 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Building the Subtle Corpus SubId - 100000 DialogId - 1 Diff - 3715 T - What a son! A - How about my mother? SubId - 100000 DialogId - 2 Diff - 80 T - How about my mother? A - Tell me, did my mother fight you? 03-02-2016 13 SubId - 100000 DialogId - 3 Diff - 1678 T - Tell me, did my mother fight you? A - Did she fight me?

14 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Building the Subtle Corpus 03-02-2016 14 English # Movies# Movies Ok# Interactions# Average 5,7645, 6655, 693, 811 1, 005 Portuguese # Movies# Movies Ok# Interactions# Average 3, 7013, 5983, 322, 683 923

15 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 15 Index Introduction Building the Subtle Corpus The Say Something Smart Engine –Corpora Indexing and candidate extraction –Choosing the answer Evaluation Meet Filipe Conclusions and Future Work

16 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 16 The Say Something Smart engine The Say Something Smart Engine (SSS) will use the Subtle corpus to get an answer to a given user request. Say Something Smart User: Where do you live? SSS: Anywhere I feel like! Sublte: (T10: What was your mother’s name?, A10: The mother’s name isn’t important.) (T121: Where do you live? A121: Beaver Creek, off the Route 10.)

17 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 17 The Say Something Smart engine Problem: –As we will compute the distance between the given user request and the interactions from the Subtle corpus we need to limit the number of interactions.

18 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 18 The Say Something Smart engine SSS main steps: –Corpora Indexing –Candidate extraction –Choosing the answer

19 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 19 The Say Something Smart engine SSS main steps: –Corpora indexing –Candidate extraction Tokenizers, stemmers, and stop-word filters –the default ones for English –snowball analyzer for the Portuguese language The number of retrieved interactions can be configured –Choosing the answer

20 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 20 The Say Something Smart engine SSS main steps: –Corpora indexing –Candidate extraction Tokenizers, stemmers, and stop-word filters –the default ones for English –snowball analyzer for the Portuguese language The number of retrieved interactions can be configured –Choosing the answer

21 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 21 The Say Something Smart engine (T4: You don’t have to go brother., A4: I’m not your brother.) (T5: You have a brother?, A5: Yeah, I’ve got a brother, man. You know that.) (T6: Joe doesn’t have a brother?, A6: No brother.) (T7: Brother, do you have tooth paste?, A7: What brother?) (T8: Have you seen my brother?, A8: He’s not your brother anymore.) Do you have a brother?

22 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016Título da apresentação 22 The Say Something Smart engine Being given: –A user request u –The set of interactions, U, retrieved by Lucene For each (T i, A i ) in U: Where w j is the weight assigned to measure M j Measures M 1, M 2 and M 3 are based on Jaccard similarity: J(A, B) = |A ∩ B| / |A U B|

23 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 23 The Say Something Smart engine M 1 : Jaccard similarity between user request and trigger (T9: How nice. What’s your mother’s name?, …) (T10: What was your mother’s name?, A10: The mother’s name isn’t important.) (T11: What’s your name?, …) (T12: What’s the name your mother and father gave you?, …) (T13: Your mother? how dare you to call my mother’s name?, …) u: What’s your mother’s name?

24 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 24 The Say Something Smart engine M 2 : a higher score is given to the most “frequent” answer (Jaccard) (T14: Where do you live?, A14: Right here.) (T15: Where are you living?, A15: Right here.) (T16: Where do you live?, A16: New York City.) (T17: Where do you live?, A17: Dune Road. ) u: How are you?

25 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 25 The Say Something Smart engine M 3 : Jaccard similarity between the user request and the answer (T9: How nice. What’s your mother’s name?, A9: Vickie.) (T10: What was your mother’s name?, A10: The mother’s name isn’t important.) u: What’s your mother’s name? ?

26 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 26 The Say Something Smart engine M 4 : Time difference between trigger and answer (T: You're a joke! You're a joke! A: Linda Kasabian gives birth to a son. She names the child Angel.) u: Are you joking?

27 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016Título da apresentação 27 Index Introduction Building the Subtle Corpus The Say Something Smart Engine –Corpora Indexing and candidate extraction –Choosing the answer Evaluation Meet Filipe Conclusions and Future Work

28 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016Título da apresentação 28 Evaluation Evaluation Setup –Filipe, online since September 2013 –103, user requests 20 were randomly selected

29 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016Título da apresentação 29 Evaluation Experiment 1: Are subtitles adequate? –Three human annotators –First 25 interactions returned by Lucene to the 20 requests –Question: is there at least one plausible answer in the 25 candidates?

30 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 30 Evaluation Experiment 1: Are subtitles adequate? Results –Evaluator 1: “What country do you live?” not ok; –Evaluator 3 consider “it depends” as a plausible answer –Evaluator 2: “What country do you live?” not ok; “Are you a loser?” not ok; –Evaluators 2 and 3 considered that “So what? You want to hit me?”, or “Shut up.” were plausible answers –Evaluator 3: “Where is the capital of Japan?” not ok; –Evaluators 1 and 2 considered that “58% don’t know” was a plausible answer

31 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 31 Evaluation Experiment 1: Are subtitles adequate? The three annotators agreed that 17 out of 20 turns had a plausible answer

32 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 32 Evaluation Experiment 2: Answer selection –Settings (S1,...,S5) : S1 – Only takes into account M1. S2 – Only takes into account M2. S3 – Takes into account M1 and M2. S4 – Takes into account M1, M2 and M3. S5 – Takes into account all four measures. –Weights: S1−4: the same weight was given to the measures. S5: –40% weight for M1 –30% weight for M2 –20% weight for M3 –10% weight for M4.

33 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 33 Evaluation Experiment 2: Answer selection –21 people evaluated the returned response, given the 20 requests

34 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 34 Evaluation Experiment 2: Answer selection –Results S4 – Takes into account M1, M2 and M3. S1S2S3S4S5 39,29%45,24%46,90%61,67%51,19%

35 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016Título da apresentação 35 Index Introduction Building the Subtle Corpus The Say Something Smart Engine –Corpora Indexing and candidate extraction –Choosing the answer Evaluation Meet Filipe Conclusions and Future Work

36 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed Meet Filipe (or “Filaipe”) 03-02-2016Título da apresentação 36

37 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 37 Index Introduction Building the Subtle Corpus The Say Something Smart Engine –Corpora Indexing and candidate extraction –Choosing the answer Evaluation Meet Filipe Conclusions and Future Work

38 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 38 Conclusions and Future Work We have built the Subtle corpus (PT and EN) Tested several techniques to extract a plausible answer in Say Something Smart engine Still much room for improvement –Organizing data Detecting paraphrases … –Text processing Synonyms Named entities –Combining the measures –Adding other corpus –Tanking context into consideration –…

39 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016Título da apresentação 39 technology from seed


Carregar ppt "Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed 03-02-2016 1 Using subtitles to deal with."

Apresentações semelhantes


Anúncios Google