Projeto de Experimento Controlado com Quadrado Latino by Example

Projeto de Experimento Controlado com Quadrado Latino by Example
Elder Cirilo Pontifícia Universidade Católica do Rio de Janeiro Laboratório de Engenharia de Software Conteúdo baseado na apresentação de Eduardo Aranha, 1º Encontro de Engenharia de Software Experimental Natal

Objetivos Apresentar e motivar o uso de experimentos controlados baseados em engenharia de software baseados na técnica do quadrado latino. Ilustrar projetos de experimentos que utilizaram o quadrado latino como técnica de controle de fatores Ilustrar o processo de análise estatística em experimentos controlados baseados no quadrado latino.

Agenda Exemplo ilustrativo Experimentos controlados
Projetos estatísticos de experimento Plano completamente aleatorizado Plano aleatorizado em blocos completos Plano de quadrado latinos Exemplos de experimentos baseados no quadrado latino Análise estatística em projetos baseados no quadrado latino, utilizando a ferramenta R.

Exemplo Ilustrativo Como avaliar diferentes tecnologias de Model-based Testing (MBT)? A B C

Contexto do Experimento
Escolha a melhor tecnologia de MBT: A, B ou C Eficiência Efetividade Facilidade de Uso Facilidade de Manutenção Industria de Software Profissionais variado Projetos diferentes Recursos limitados Requer resultados cientificamente embasados

Projeto de Experimento Ilustrativo
Avaliar MBT no uso de três projetos em andamento Escopo: projeto de testes – Visando minimizar os custos. Requisitos Desenvolvedor A B C Abordagem

Possíveis Resultados Projeto 1 10 15 20 25 30 40 h Projeto 2 10 15 20 25 30 40 h Projeto 3 10 15 20 25 30 40 h Tecnologia de MBT foi realmente a responsável pelos resultados? Se rodar o estudo em novos projetos, o resultado será o mesmo?

Como obter resultados mais relevantes?
Controlar os elementos existentes no ambiente de execução do experimento. Elminar, reduzir ou diluir o efeito desses elementos Podemos controlar: ambiente de desenvolvimento experiência dos participantes complexidade do projeto

Projeto 1 – Controle do Ambiente
Um único projeto Um único desenvolvedor Uso de todas as técnicas Ordem no uso das técnicas A B C 1o 2o 3o

Considerações Sobre a Proposta
Fixar projeto e desenvolver elimina alguns efeitos indesejáveis Complexidade do projeto Expertise e experiência dos diferentes desenvolvedores ... Existem algum efeito de aprendizado por parte do desenvolvedor? Treinamentos podem amenizar esse tipo de efeito, mas existem alguma outra ameaça associada ao projeto do experimento?

Projeto 2 – Controlando Aprendizagem
Controle Mesmo perfil de desenvolvedor Interesse pessoal determina técnica Treinamento na abordagem escolhida A B C

Efeito dos requisitos continua eliminado Efeito do desenvolvedor voltou a existir Porém, foi reduzido ao se fixar perfil do participante Escolha de quem usa a técnica pode estar tendenciosa?

Projeto 3 – Evitando Viés
Controle Mesmo perfil de desenvolvedor Sorteio determina técnica Treinamento na técnica sorteada B A C

Baixa probabilidade de ocorrer resultado tendencioso para o sorteio. É suficiente ter apenas uma observação para cada abordagem?

Projeto 4 – Aumentando Observações
Número maior de desenvolvedores com mesmo perfil Sorteio da técnica a ser utilizada acompanhado de treinamento B A C B A C

Experimento Controlado
Procedimento que mudam de forma proposital as variáveis de um processo/sistema Observam mudanças na saída e identificar as causas x1 x2 xn … Variáveis controladas Processo/Sistema Coletam evidências contra hipótese formulada Entradas Saída … Variáveis não controladas (e possivelmente desconhecidas) z1 z2 zn

Princípios Fundamentais dos Experimentos Controlados
Controle Local Eliminar, reduzir, diluir ou isolar o efeito de fatores de ruído Fixar certos níveis para variáveis não investigadas Ex.: experiência dos participantes, complexidade do projeto Replicação Aplicação de um tratamento em mais de uma unidade experimental Dilui efeito da variabilidade existente entre pessoas, projetos e artefatos similares Diminui chances de se obter resultados ao acaso Aleatorização Com réplicas, dilui o efeito de diferenças de motivação, experiência Elimina possível viés do pesquisador e reduz efeito de aprendizado

Análise de Causa-Efeito
Possível apenas quando utilizado os princípios de réplica com aleatorização Estudos observaicionais ou quase-experimentos Ausência de aleatorização e/ou controle local

Análise Estatísticas Como Analisar os Dados
Como avaliar situações onde análise visual não mostra claramente se houve ou não melhora significativa? Será que com mais observações as conclusões mudariam? Análise Estatísticas

Plano Aleatorizado em Blocos Completos
Aplicado quando: Existem um fator não investigado com influência significante na variável de saída Não é possível ou interessante fixar um único nível para esse fator Bloco Grupo homogêno de unidades experimentais Aleatorização feita dentro dos blocos

… … … Exemplo de Blocos em ES Nível de experiência em desenvolvimento
Baixo … Blocos (6 desenvolvedores) Médio … Alto 18 participantes

Até que Ponto Poderemos Generalizar os Resultados?
Resultado do experimento limitado a um único tipo de projeto Simples ou complexos Tamanho e complexidade de projetos na prática

Quadrados Latinos Aplicado quando: Bloco
Existem dois fatores de ruído com influência significante na variável de saída Bloco Combinação de níveis dos dois fatores de ruído (linha, coluna) Número de participantes cresce significamente neste cenário.

Cruzamento Entre Experiência e Tamanho
Tamanho do Projeto B A C Nível de Experiência C B A Réplicas mudando-se Desenvolvedores e/ou projetos A C B

Limitações do Quadrado Latino
Requer mesma quantidade de tratamento, linha e colunas Alguns quadrados precisam de mais de 2 réplicas Ex: quadrado de tamanho 2

Quadrado Latino by Example
Exemplo 1 Cirilo, E. et al.

Empirical Evaluation Our main goal is to investigate whether different techniques for product line implementation influence the correct comprehension of the configuration knowledge. Similar to related efforts two dimensions were evaluated in the empirical evaluation: Correctness Time

Research Questions We distinguish the following research questions.
RQ1: Does the availability of domain-specific models increase the correct comprehension of the configuration knowledge? RQ2: Does the availability of domain-specific models reduce the time that is needed to correctly comprehend the configuration knowledge? RQ3: Does the individual differences among the expertise of product line engineers impact on the correct comprehension of the configuration knowledge? RQ4: Which types of configuration knowledge comprehension task benefit most from the use of domain-specific and from other code-oriented techniques?

Hypotheses Associated to the first two research questions are two null hypotheses H10: The correct comprehension of the configuration knowledge does not depend on the different specification techniques. H20: The time to correctly comprehend the configuration knowledge does not depend on the different specification techniques. The alternative hypotheses are the following: H11: The correct comprehension of the configuration knowledge depends on the different specification techniques. H21: The time to correctly comprehend the configuration knowledge depends on the different specification techniques.

Empirical Evaluation Correct Answers and Time Analysis
The correspondence between participant’s number of correct answers and tools/product lines The influence of each approach in the time that each participant spend answering the questionnaire. Expertise Analysis The influence of participant’s expertise in the number of correct answers Now I will present the results of our study and some general observations.

First Evaluation The study involved six post-graduate answering three questionnaires, one for each product line following the Latin Square Design Which abstraction(s)/code asset(s) is(are) related to the feature X? How many abstraction(s)/code asset(s) is(are) mapped to the feature Y? The Latin square design gave us a random allocation of the tools in such a way that each one is used once for each participant (row) and once for each product line (column). Overall, most of them have satisfactory expertise in the product lines field. However they were not familiar with the evaluated approaches. Therefore, they were given a short 60-minute demonstration of pure::variants, CIDE, and GenArch+. In this training session, we demonstrated specific functionalities of the tools and examples of configuration knowledge specification. Participants E-Shop OLIS Buyer P1 and P4 G+ PV C P2 and P5 P3 and P6

Product Lines vs. Correct Answers
Buyer - highest number of correct answers Lowest number of feature and no diversity of frameworks First, comparing CIDE and pure::variants, we observed that there is no apparent difference between these techniques.

OLIS - intermediate number of correct answers Well modularized features First, comparing CIDE and pure::variants, we observed that there is no apparent difference between these techniques.

E-Shop - lowest number of correct answers Features no-well modularized First, comparing CIDE and pure::variants, we observed that there is no apparent difference between these techniques. Despite of the use of different techniques, factors such as simplicity (number of frameworks and lines of code) and modularization have direct influence on the results.

Techniques vs. Correct Answers
CIDE - lowest number of correct answers in the E-Shop product line First, comparing CIDE and pure::variants, we observed that there is no apparent difference between these techniques. For example, participants 2 and 5 achieved the lowest number of correct answers in the E-Shop product line. Therefore, it seems that CIDE does not helped the participants to appropriated understand not-well modularized features.

pure::variants – better number of hits for the E-Shop product line than CIDE On the other hand, pure::variants supported the participants 3 and 6 to better comprehend the E-Shop product line.

CIDE – better number of hits for the OLIS product line than pure::variants However, comparing CIDE and pure::variants, we observed that for the OLIS product line, CIDE supported a highest number of hits than pure::variants. Therefore, we are not able to conclude, if in fact, which are better, CIDE or pure::variants.

✓ ✓ ✗ However, analyzing the GenArch results, we observe that, in general, the use of domain-specific abstractions helped the participants to comprehend the configuration knowledge. Participant 1 was the only exception, when pure::variants and CIDE were superior than GenArch+ in this aspect. To other participants, this tool presented intermediate and better results.

Techniques vs. Time Time demanded to answer the questionnaire.
G+ PV C e-Shop 1:35:47 1:43:29 1:33:45 OLIS 1:27:51 1:45:42 1:31:09 Buyer 0:43:05 1:17:42 1:14:42 Average time to correct answer a question. We also analyzed the time spent by the participants to localize and understand the different configuration knowledge using the tree different product line tools. We observed that GenArch+ required the lowest time (3:40:45) for the participants to answer the questionnaires, followed by CIDE (4:19:56) and pure::variants (4:46:53). By analyzing only correct answers, we observed that GenArch+ was the tool that presented the lower average (0:02:57), and CIDE (0:03:10) and pure::variants (0:04:39) presented superior values. Based on these results, we conclude that The time to correctly comprehend the configuration knowledge depends on the different specification techniques. , but only with respect to some of the techniques. We observe that there is no significant difference between GenArch+ and CIDE in terms of the time needed to localize and understand the configuration knowledge specification. But there is a significant difference between these techniques and G+ and CIDE. G+ PV C 0:02:57 0:04:39 0:03:10

Participant’s Expertise
Spring 4 2 1 3 Struts Spring MVC 5 Hibernate iBatis Spring-DM Jadex e-Shop – 1.5 OLIS – 2.87 Buyer – 1.5 We prepared two Table summarizing the background of the participants. This first table indicates the expertise that each participant claimed to have about each implementation technology. In general, most of them claimed to have little knowledge about the development using Jadex, In addition, all the participants have not previously worked with service-oriented development using Spring Dynamic Modules (SpringMD). However, in general, all participants have at least basic knowledge about the relevant frameworks of the experiment. We also calculate the participant’s degree of expertise for each product line. What we observe is that there is no relevant difference among the evaluated software product lines in terms of the participants expertise. This table summarizes the background of the participants. Rows 4-10 indicate the expertise that participants claimed to have about each technology used to implement the different product lines. Most of them claimed to have little knowledge about the development of agent-oriented software systems with Jadex. Additionally, all the participants have not previously worked with service-oriented development using Spring Dynamic Modules (SpringDM). However, in general all participants have at least basic skills in the relevant frameworks of the experiment. Rows indicate the degree of expertise of the participants. The expertise is a value ranging from 1 to 5, where 1 means no expertise in a given framework and 5 means a high expertise. The degree of expertise in a given product line is the average of the expertise of the participant in the frameworks used to implement that product line. Table 2: Degree of Expertise of the participants It is important to note that a different product line was used to avoid biasing the experiment results. e-Shop 1.5 1.25 1 2.25 2 OLIS 3 2.75 3.5 1.75 3.25 Buyer

Expertise Results This chart relates the degree of expertise of each participant and his/her number of correct answers for each product line. Each bar in this chart indicates the expertise (x-axis) of participants (y-axis) about the product lines (bars). The bullets exhibit the total number of correct answers (CA/Total - secondary x-axis) of each participant. Note that there is no relation between the expertise and the number of correct answers. For example, participants 1, 2 and 3 claimed to have similar expertise; however, the participant 3 presented a superior number of correct answers. Correspondingly, participant 5, who has a limited expertise, presented a number of correct answers very close to the one achieved by participants 2 and 4, who claimed to have superior expertise. We also compared the degree of expertise, number of correct answers and the product line tools. A high degree of expertise in the frameworks used to implement the OLIS product line combined with the annotative approach provided by the CIDE tool may have helped participants to correctly answer the questionnaire, however the same behavior was not observed in the other two product lines with the same tool. In contrast, the participants that use pure::variants to answer questions about the E-Shop product line presented a high number of correct answers despite they claimed to have a low degree of expertise. However, for the other two product lines the participants presented the same behavior described above, i.e. claimed to have a high degree of expertise but achieved a low number of correct answers. For GenArch+, we observed the same behavior described above: the expertise in the frameworks was not fundamental to correctly answer the questionnaire. As a result, we can conclude that there is no relation between the expertise and the number of correct answers, accepting the hypothesis H3.

Expertise Analysis This chart shows the chart that relates the degree of expertise of each participant and his/her number of correct answers for each product line. Each bar in this chart indicates the expertise (x-axis) of participants (y-axis) about the product lines (bars). The bullets exhibit the total number of correct answers (CA/Total - secondary x-axis) of each participant. Note that there is no relation between the expertise and the number of correct answers. For example, participants 1, 2 and 3 claimed to have similar expertise; however, the participant 3 presented a superior number of correct answers. Correspondingly, participant 5, who has a limited expertise, presented a number of correct answers very close to the one achieved by participants 2 and 4, who claimed to have superior expertise. We also compared the degree of expertise, number of correct answers and the product line tools. A high degree of expertise in the frameworks used to implement the OLIS product line combined with the annotative approach provided by the CIDE tool may have helped participants to correctly answer the questionnaire, however the same behavior was not observed in the other two product lines with the same tool. In contrast, the participants that use pure::variants to answer questions about the E-Shop product line presented a high number of correct answers despite they claimed to have a low degree of expertise. However, for the other two product lines the participants presented the same behavior described above, i.e. claimed to have a high degree of expertise but achieved a low number of correct answers. For GenArch+, we observed the same behavior described above: the expertise in the frameworks was not fundamental to correctly answer the questionnaire. As a result, we can conclude that there is no relation between the expertise and the number of correct answers, accepting the hypothesis H3.

Statistical Results – Answers
Kruskal-Wallis chi-squared df p-value Kruskal-Wallis 14 0.2217 25/03/2017 Nome do Autor © LES/PUC-Rio

Statistical Results - Time
ANOVA ANOVA DF Sum Sq Mean Sq F Value Pr(>F) Tool 2 45259 2.6310 0.1324 25/03/2017 Nome do Autor © LES/PUC-Rio

Second Evaluation Our study involved fifteen post-graduate answering three questionnaires, one for each product line following the Latin Square Design Questions were devised into four different comprehensibility tasks: Identifying all files in which source code of a feature occurs Identifying all features that occurs in a certain file Identifying all framework-concept instances that are implementing a certain feature Investigating dependencies between framework-concept instances The Latin square design gave us a random allocation of the tools in such a way that each one is used once for each participant (row) and once for each product line (column). Overall, most of them have satisfactory expertise in the product lines field. However they were not familiar with the evaluated approaches. Therefore, they were given a short 60-minute demonstration of pure::variants, CIDE, and GenArch+. In this training session, we demonstrated specific functionalities of the tools and examples of configuration knowledge specification.

GenArch+ ~26.28% ~34.53% ANOVA DF Sum Sq F Value Pr(>F) Tools 2 24.808 1.075e-05

GenArch+ ~26.28% ~34.53% Tukey diff lwr upr p adj G,-C P,-C P,-G

Statistical Results – Time
GenArch+ ~19.41% ~62.65% Kruskal-Wallis chi-squared df p-value Kruskal-Wallis 4.0495 2 0.1320

Individual Task Performance - Anwsers
50% 87%

Individual Task Performance - Time
5x

Quadrado Latino by Example
Exemplo 2 Ribeiro, M. et al.

Empirical Evaluation To better understand the use of Emergent Interfaces in preprocessor-based software product lines in maintenance activity. Hypotheses H1: With and without Emergent Interfaces, developers spend on average the same time to complete a maintenance task involving feature dependences. H2: With and without Emergent Interfaces, developers commit on average the same number of errors when performing a maintenance task involving feature dependencies.

Design The study involved 24 under/post-graduate one for each product line following the Latin Square Design Participants Bestlap Mobile Media P1 … n EI Wout IE P2 … n The Latin square design gave us a random allocation of the tools in such a way that each one is used once for each participant (row) and once for each product line (column). Overall, most of them have satisfactory expertise in the product lines field. However they were not familiar with the evaluated approaches. Therefore, they were given a short 60-minute demonstration of pure::variants, CIDE, and GenArch+. In this training session, we demonstrated specific functionalities of the tools and examples of configuration knowledge specification.

Collecting the metrics
Eclipse plug-in that consists of two buttons Play/Pause Finish

Maintenance Tasks Implementation of a New requirement
In the Best Lap, the game score be not only positive, but also negative. In the Mobile Media, subjects should replace the actual web images server for another one that is able to provide more and different image formats. Fixing of Unused-variable

Experiment Execution To make subjects ware of preprocessor, VSoC, Feature Dependencies, Emergent Interfaces, and Emergo, a one hour training was provided before running the experiment. One toy example, Jcacl, was used to explain these concepts and exemplify the task New requirement and Unused-variable They performed the experiment with 10 MSc/PhD students at Federal University of Pernambuco, Brazil (Round 2) and replicated the experiment with 14 undergraduate students at Federal University of Alagoas, Brazil (Round 3).

Data Interpretation New requirement task – Round 2

Data Interpretation New requirement task – Round 3

Data Interpretation Unused Variable – Time Penalty

Data Interpretation Unused Variable – Round 2

Data Interpretation Unused Variable – Round 3

Conclusions Question 1: Do Emergent Interfaces reduce effort during maintenance tasks involving feature code dependencies in preprocessor-based systems? We conclude that Emergent Interfaces reduce the time spent to accomplish the New- requirement task. Without them, subjects are 3 and 3.1 times slower. When considering the Unused-variable task, the time difference with and without Emergent Interfaces is smaller when compared to the New-requirement task. On average, subjects are 1.5 and 1.68 times slower without Emergent Inter- faces.

Conclusions Question 2: Do Emergent Interfaces reduce the number of errors during maintenance tasks involving feature code dependencies in preprocessor-based systems? The results show that, with Emergent Interfaces, subjects might be aware of feature dependencies. Hence, the probability of changing the impacted features increases, leading them to press the Finish button not rashly. Without Emergent Interfaces, subjects committed 84% and 81% of the errors. without Emergent Interfaces tend to write more feature expressions wrongly when compared to with Emergent Interfaces: 75% and 78%

Análise estatística em projetos baseados no quadrado latino
Ferramenta R

Ferramenta R Ferramenta para análise estatística gratuita
Baseada na Linguagem R Utilização é realizada através de comando em um console Comandos realizados sobre dados ou resultados de função Dados podem ser dispostos em um vetor, matriz ou data frame.

Ferramenta R

Repesentando Quadrado Latino
Replica, Estudante, EstudoDeCaso, Tecnica, Resposta 1, 1, bY, G, , 1, oL, C, , 1, eS, P, , 2, bY, P, , 2, oL, G, , 2, eS, C, , 3, bY, C, , 3, oL, P, , 3, eS, G, 6.90 2, 4, bY, G, 8.00 2, 4, oL, C, 2.10 2, 4, eS, P, 4.00 2, 5, bY, P, 4.60 2, 5, oL, G, 6.15 2, 5, eS, C, 5.15 2, 6, bY, C, 6.00 2, 6, oL, P, 4.90 2, 6, eS, G, 7.20 …

Comandos Carregando os dados Definindo elementos do quadrado latino
data.ql = read.table(file=”dados-resposta.txt",header = T) attach(data.ql) Definindo elementos do quadrado latino Replica <- factor(Replica.) Estudante <- factor(Estudante.) EstudoDeCaso <- factor(EstudoDeCaso.) Tecnica <- factor(Tecnica.) Plotando os resultados plot(Resposta~Tecnica,col="gray", xlab="SPL Tool",ylab="Answers")

Comando – Teste de Variança
anova.ql = aov(Resposta~Replica+Estudante:Replica EstudoDeCaso+Tecnica) summary(anova.ql) kw <- kruskal.test(Resposta~Estudante+EstudoDeCaso Tecnica,data.ql) Verificar se amostra possui distribuição normal e mesma variança. Distribuição Normal: Shapiro-Wik shapiro.test(Resposta) Se p-value > 0.05 = OK, a amostra é normal Mesma viriança: Levene levene.teste(Resposta) Se p-value > 0.05 = OK, a amostra possui mesma variança

Comando – Comparações Multíplas
ANOVA Método Tukey fmTukey=TukeyHSD(anova.ql,"Tecnica") fmTukey Kruskal Método Nemenyi-Damico-Wolfe-Dunn oneway_test(Dificuldade ~ Tecnica, data = data.ql)

Projeto de Experimento Controlado com Quadrado Latino by Example
Elder Cirilo Pontifícia Universidade Católica do Rio de Janeiro Laboratório de Engenharia de Software Conteúdo baseado na apresentação de Eduardo Aranha, 1º Encontro de Engenharia de Software Experimental Natal

Projeto de Experimento Controlado com Quadrado Latino by Example

Apresentações semelhantes

Apresentação em tema: "Projeto de Experimento Controlado com Quadrado Latino by Example"— Transcrição da apresentação:

Apresentações semelhantes

Sobre projeto

Feedback

Login

Autorizar-se através da rede social:

Projeto de Experimento Controlado com Quadrado Latino by Example

Apresentações semelhantes

Apresentação em tema: "Projeto de Experimento Controlado com Quadrado Latino by Example"— Transcrição da apresentação:

Apresentações semelhantes

Sobre projeto

Feedback