A apresentação está carregando. Por favor, espere # Tópicos Especiais em Aprendizagem Reinaldo Bianchi Centro Universitário da FEI 2012.

## Apresentação em tema: "Tópicos Especiais em Aprendizagem Reinaldo Bianchi Centro Universitário da FEI 2012."— Transcrição da apresentação:

Tópicos Especiais em Aprendizagem Reinaldo Bianchi Centro Universitário da FEI 2012

4a. Aula Parte B

O algoritmo K-means

K-Means n Algoritmo muito conhecido para agrupamento (clustering) de padrões. n Usado quando se pode definir o número de agrupamentos: –Escolha o número de agrupamentos desejado. –Escolha centros e membros dos agrupamentos de modo a minimizar o erro. –Não pode ser feito por busca: muitos parâmetros.

K-Means n Algoritmo: –Fixe os centros dos agrupamentos. –Aloque os pontos para o agrupamento mais próximo. –Recalcule os centros dos clusters, como sendo a média dos pontos que ele representa. –Repita até que os centros parem de se mover.

K-Means n Pode ser usado para qualquer atributo para o qual se pode calcular uma distância…

Clustering n Partitioning Clustering Approach: –a typical clustering analysis approach via partitioning data set iteratively –construct a partition of a data set to produce several non-empty clusters (usually, the number of clusters given in advance) –in principle, partitions achieved via minimising the sum of squared distance in each cluster

Clustering n Given a K, find a partition of K clusters to optimise the chosen partitioning criterion: –global optimal: exhaustively enumerate all partitions –Heuristic method: K-means algorithm (MacQueen67): each cluster is represented by the center of the cluster and the algorithm converges to stable centers of clusters.

Algorithm n Initialisation: set seed points n Assign each object to the cluster with the nearest seed point; n Compute seed points as the centroids of the clusters of the current partition (the centroid is the centre, i.e., mean point, of the cluster) n Go back to Step 1), n stop when no more new assignment Given the cluster number K, the K-means algorithm is carried out in three steps:

Example n Suppose we have 4 types of medicines and each has two attributes: –pH and –weight index. n Our goal is to group these objects into K=2 group of medicine.

Example AB C D MedicineWeightpH-Index A11 B21 C43 D54

Step 1: Use initial seed points for partitioning Assign each object to the cluster with the nearest seed point Euclidean distance

Step 2: Compute new centroids of the current partition Knowing the members of each cluster, now we compute the new centroid of each group based on these new memberships.

Step 2: Renew membership based on new centroids 14 Compute the distance of all objects to the new centroids Assign the membership to objects

Step 3: Repeat the first two steps until its convergence Knowing the members of each cluster, now we compute the new centroid of each group based on these new memberships.

Repeat the first two steps until its convergence Compute the distance of all objects to the new centroids Stop due to no new assignment

K-means Demo 17 1.User set up the number of clusters theyd like. (e.g. k=5)

K-means Demo 18 1.User set up the number of clusters theyd like. (e.g. K=5) 2.Randomly guess K cluster Center locations

K-means Demo 19 1.User set up the number of clusters theyd like. (e.g. K=5) 2.Randomly guess K cluster Center locations 3.Each data point finds out which Center its closest to. (Thus each Center owns a set of data points)

K-means Demo 20 1.User set up the number of clusters theyd like. (e.g. K=5) 2.Randomly guess K cluster centre locations 3.Each data point finds out which centre its closest to. (Thus each Center owns a set of data points) 4.Each centre finds the centroid of the points it owns

K-means Demo 21 1.User set up the number of clusters theyd like. (e.g. K=5) 2.Randomly guess K cluster centre locations 3.Each data point finds out which centre its closest to. (Thus each centre owns a set of data points) 4.Each centre finds the centroid of the points it owns 5.…and jumps there

K-means Demo 22 1.User set up the number of clusters theyd like. (e.g. K=5) 2.Randomly guess K cluster centre locations 3.Each data point finds out which centre its closest to. (Thus each centre owns a set of data points) 4.Each centre finds the centroid of the points it owns 5.…and jumps there 6.…Repeat until terminated!

Exemplo K-means no Matlab 23

Relevant Issues n Efficient in computation –O(tKn), where n is number of objects, K is number of clusters, and t is number of iterations. Normally, K, t << n. n Local optimum –sensitive to initial seed points –converge to a local optimum that may be unwanted solution

Relevant Issues n Other problems –Need to specify K, the number of clusters, in advance –Unable to handle noisy data and outliers (K-Medoids algorithm) –Not suitable for discovering clusters with non-convex shapes –Applicable only when mean is defined, then what about categorical data? (K-mode algorithm)

Cluster Validity n With different initial conditions, the K- means algorithm may result in different partitions for a given data set. n Which partition is the best one for the given data set? n In theory, no answer to this question as there is no ground-truth available in unsupervised learning

Cluster Validity n Example: the ratio of the total between- cluster to the total within-cluster distances: –Between-cluster distance (BCD): the distance between means of two clusters –Within-cluster distance (WCD): sum of all distance between data points and the mean in a specific cluster –A large ratio of BCD:WCD suggests good compactness inside clusters and good separability among different clusters!

Conclusion n K-means algorithm is a simple yet popular method for clustering analysis n There are several variants of K-means to overcome its weaknesses –K-Medoids: resistance to noise and/or outliers –K-Modes: extension to categorical data clustering analysis –CLARA: dealing with large data sets –Mixture models (EM algorithm): handling uncertainty of clusters

E no Matlab? 30

E no Matlab? n Sintaxe: –IDX = kmeans(X,k) n Descrição: –Partitions the points in the n-by-p data matrix X into k clusters. –This iterative partitioning minimizes the sum, over all clusters, of the within-cluster sums of point-to-cluster-centroid distances. –returns an n-by-1 vector IDX containing the cluster indices of each point.

Ransac

RANSAC n RANdom SAmple Consensus. n Alternativa para procurar bons pontos para gerar o ajuste da reta. n Idéia: –Escolha um subconjunto uniforme de maneira aleatória (pontos de suporte). –Ajuste a reta para esses pontos. –Tudo que se encontra longe do ajuste é ruído. –Repita muitas vezes e escolha o melhor ajuste.

RANSAC n Problemas: –Quantas vezes executar? O mínimo possível… –Qual o tamanho do subconjunto? O menor possível… –O que é próximo? Basta estimar a ordem de magnitude… –O que é um bom ajuste? Um que o número de pontos próximos é tão grande que seja improvável que todos sejam ruído.

RANSAC – Example 11 supports 4 supports How many samples do we need to draw?

RANSAC – How many samples n How many samples we need to ensure with a probability p, that at least one of the random samples of S points is free from outliners. (w: inlier probability)

The Ransac Song 38

Conclusão

n Terminamos de ver os métodos de aprendizado de máquina puramente estatísticos. –K-NN, Mínimos Quadrados, PCA, LDA, k- Means n A partir da próxima aula veremos métodos não mais estatísticos, mas probabilísticos. 40

Links n Exemplos extraidos de: –www.cs.manchester.ac.uk/ugt/FCOMP241 11/materials/slides/K-means.ppt 41

Carregar ppt "Tópicos Especiais em Aprendizagem Reinaldo Bianchi Centro Universitário da FEI 2012."

Apresentações semelhantes