A apresentação está carregando. Por favor, espere

A apresentação está carregando. Por favor, espere

Condor: Overview and User Guide to the Condor Biostatistics Environment.

Apresentações semelhantes


Apresentação em tema: "Condor: Overview and User Guide to the Condor Biostatistics Environment."— Transcrição da apresentação:

1 Condor: Overview and User Guide to the Condor Biostatistics Environment

2 2 Autoria Autores –Patrícia Kayser Vargas –Setembro de 2002 –Palestra na Biostat, Wisconsin, EUA Revisões –V1 C. Geyer PDP/2005-2, PPGC, UFRGS Dezembro 2005

3 3 Topics Introduction –What is Condor? –Why and when use Condor? –What are Condor Universes? Running Jobs on Condor –C programs YAP –Java Programs Final Remarks

4 4 Introduction

5 5 What is Condor? Condor –is a distributed batch scheduling system The goal of Condor is to provide the highest feasible throughput by executing the most jobs over extended periods of time. [1] What is a job? –Several possibilities

6 6 What is Condor? Condor –is composed of a collection of different daemons that provide various services, such as mecanismo de fila de jobs, políticas de escalonamento, esquema de prioridades, monitoramento, resource management, job management, matchmaking...

7 7 What is Condor? Architecture [1]

8 What is Condor? Architecture Tipos de máquinas –Central Manager Gerente de uma rede (grade) Condor Uma por pool Ponto de falha central ( ) –Submit Machines Máquinas de usuários Usuário submete, monitora e controla execução de 1 job –Execution Machine (escravo) Executa jobs –Uma máquina pode ter vários papéis

9 What is Condor? Architecture Tipos de máquinas (cont.) –CheckPoint Server Opcional Armazena arquivos com checkpoints

10 10 What is Condor? Architecture Condor has four daemons On Central Manager and on Submit Machines –startd: monitors the conditions of the resource where it runs publishes ClassAds resource offer, and is responsible for enforcing the resource owners policy for starting, suspending, and evicting jobs. –schedd: maintains a persistent job queue publishes ClassAds resource request, and negotiates for available resources

11 11 What is Condor? Architecture Only on Central Manager: –collector: is the central repository of information startd and schedd send periodic updates to the collector –negotiator: periodically performs a negotiation cycle –process of matchmaking –negotiator tries to find matches between various ClassAds, –of resource offers and requests, and –once a match is made, both parties are notified and are responsible for acting on that match

12 12 What is Condor? Architecture [1]

13 13 What is Condor? Architecture [1] SubmitterExecuting

14 14 What is Condor? Architecture Publicação de ClassAds de recursos e de jobs que são enviados ao collector –Startd envia (de) recursos –Schedd envia (de) jobs O collector tudo envia ao negotiator que faz o matchmaking

15 15 What is Condor? Architecture Algoritmo de matchmaking –o negotiator pode descobrir recursos no qual um job pode ser executado –ele avisa ao daemon schedd, da máquina que submeteu, com quem ela deve se comunicar para exportar o job –ele avisa o daemon startd da máquina escolhida para executar (recurso ocioso que tem os requisitos) que vai receber um tarefa

16 16 What is Condor? Architecture Neste ponto o central manager não age mais, são as duas máquinas que vão executar o job –a máquina de submissão cria um processo shadown para enviar a tarefa e receber os resultados –a máquina que vai executar cria um processo starter que recebe a tarefa e um user job que por sua vez executa a tarefa e ao final os resultados são enviados à máquina de submissão

17 17 Why and when use Condor? Condor is useful when –there are several jobs to be submitted –there is one executable and several different input data

18 18 Why and when use Condor? Condor is useful because –can use different available machines opportunistic scheduling –controls file transfers the job must be able to access the data files from any machine on which it can potentially run –send email notifying when job has completed except if jobs submitted from a Linux machine

19 19 What are Condor Universes? Types of universes –standard –vanilla –java –parallel The Universe attribute is specified in the submit description file –the default is standard

20 20 What are Condor Universes? standard –provides checkpointing and remote system calls –job more reliable and uniform access to resources from anywhere in the pool –to prepare a program as a standard universe job, it must be relinked with condor_ compile

21 21 What are Condor Universes? standard –there are a few restrictions –complete list in manual http://www.cs.wisc.edu/condor/manual/v6.4/2_4Road_map_running.html –examples no multi-process jobs (no fork(), exec(), and system()) no inter-process communication (includes pipes, semaphores, and shared memory) no sending or receiving the SIGUSR2 or SIGTSTP all files must be opened read-only or write-only

22 22 What are Condor Universes? vanilla –used for programs which cannot be successfully re-linked –useful for shell scripts –cannot checkpoint or use remote system calls –sometimes a job must restart from the beginning on another machine in the pool sem checkpoint

23 23 What are Condor Universes? java –can execute on any machine in the pool that will run the Java Virtual Machine –at the moment it does not work at Biostat departamento de Wisconsin –compiled Java programs can be submitted –creating jar file for programs with several classes is recommended

24 24 What are Condor Universes? parallel –MPI and PVM used for parallel programs using message passing –Globus must have Condor-G installed –I did not check if they work at Biostats

25 25 Running Jobs on Condor

26 26 Running Jobs on Condor You can submit your jobs from any biostat machine, since all run schedd and startd You must –set PATH environment variable –prepare a submission file –compile your job with condor_compile if using standard universe –submit your job(s) with condor_submit command

27 27 Running Jobs on Condor Submission file –o submit description file é o arquivo que diz qual é o executável diretório onde vão ser colocados os arquivos de saída quantos jobs vão ser instanciados, etc

28 28 Running Jobs on Condor Submission file –esse arquivo é transformado em um ClassAdd para cada job que precisa ser instanciado p.ex. se no arq tiver o comando 'queue 50', vão ter que ser executados 50 jobs daquele programa portanto vão ser publicados 50 ClassAds no central manager

29 29 Running Jobs on Condor Setting PATH environment variable Change PATH to find Condor commands (conforme shell) bash: source /s/pkg/condor/condor.sh PATH=$PATH:/s/pkg/`/s/share/ostoken`/condor/bin; export PATH csh: source /s/pkg/condor/condor.csh set path = ( $path /s/pkg/`/s/share/ostoken`/condor/bin ) rehash

30 30 Running Jobs on Condor Preparing a submission file ClassAds (Classified Advertisement) –pairs of values –syntax similar to C/Java The commands are case insensitive, i.e., executable = fact Executable = fact

31 31 Running Jobs on Condor Preparing a submission file At least, must have the executable attribute: your program/binary Executable = fact Other useful attribute: input file – your data input = test.data

32 32 Running Jobs on Condor Compiling your job with condor_compile If using standard universe: –use condor_compile it is necessary to relink the program with the Condor library condor_compile gcc fact.c -o fact

33 33 Running Jobs on Condor Submitting your job(s) with condor_submit In any Condor Universe –jobs submitted using condor_submit command with submission file as parameter condor_submit condor1.sub –-v option to see information about submission (full ClassAd generated) somente uma lista e encerra (não interativo) condor_submit -v condor1.sub

34 Example of C Program

35 35 Running Jobs on Condor C programs options: –gcc (the GNU C compiler) –cc (the system C compiler) –acc (ANSI C compiler, on Sun systems) –CC (the system C++ compiler) –… (http://www.cs.wisc.edu/condor/manual/v6.4/condor_compile.html) bash-2.03$ condor_compile gcc fact.c -o fact

36 36 Running Jobs on Condor C programs – exemplo de submission file #################### # C Example: demonstrate use of multiple directories # "Arguments = 5" to pass integer 5 as parameter # #################### Executable = fact Universe = standard output = loop.out error = loop.error Log = loop.log Arguments = 5 Initialdir = run_1 Queue Initialdir = run_2 Queue

37 37 Running Jobs on Condor C programs Log –contém informações importantes para avaliar a execução/desempenho da aplicação –para um usuário comum talvez não seja tão relevante –descreve cada evento que ocorre com o job, contendo informações de data/hora/máquina quando: foi submetido, iniciou execução, foi suspendido, foi migrado, terminou (com erro ou com sucesso

38 38 Running Jobs on Condor C programs Arguments –parâmetros para o executável –no exemplo; arguments = 5 equivaleria a executar no terminal 'fact 5' Initialdir –onde os arquivos output/erro/log vão ser armazenados –initialdir= run_1 Diretório run_1

39 39 Running Jobs on Condor C programs Queue –roda uma única instância de job, usando run_1 como initialdir –diretório deve ser criado antes de rodar o condor_sub senão dá erro Initialdir = run_2 e Queue –mais uma instância do job agora em outro diretório

40 40 Running Jobs on Condor C programs outro exemplo de submission file #################### # C Example: # each job runs with a different argument and # store results in different files #################### Executable = fact notify_user = kayser@cos.ufrj.br Input = in.$(Process) Output = out.$(Process) Error = err.$(Process) Log = fact.log Queue 2

41 41 Running Jobs on Condor C programs notify_user = kayser@cos.ufrj.brkayser@cos.ufrj.br –diz para enviar msg avisando do término do job Input = in.$(Process) –$(Process): variável do condor Process que é instanciada com número inteiro sequencial para cada job criado assim: vai criar in.0, in.1, in.2 e

42 42 Running Jobs on Condor C programs Log = fact.log –um único arquivo de log apesar de vários jobs –eventos são anotados com número do job Queue 2 –cria dois jobs –pode ser colocado qq nro inteiro –Queue 100 cria 100 tarefas

43 43 Running Jobs on Condor C programs – YAP To configure YAP with Condor: configure --enable-depth-limit --enable-condor make

44 44 Running Jobs on Condor C programs – YAP condor.sub Universe = standard Executable = /u/dutra/Yap-4.3.20/condor/yap.$$(Arch).$$(OpSys) Initialdir = /u/dutra/App/f1/train_best Log = /u/dutra/App/f1/train_best/log Requirements = ((Arch == "INTEL" && OpSys == "LINUX") && (Mips >= 500) || (IsDedicated && UidDomain == "cs.wisc.edu")) Arguments = -b /u/dutra/Yap-4.3.20/condor/../pl/boot.yap Input = condor.in.$(Process) Output = /dev/null Error = /dev/null Queue 300

45 45 Running Jobs on Condor C programs – YAP condor.in.0 [~/Yap-4.3.20/condor/../pl/init.yap']. module(user). [~/Aleph/aleph.pl']. read_all(~/App/f1/train_best/train'). set(i,5). set(minacc,0.7). set(clauselength,5). set(recordfile,~/App/f1/train_best/trace-0.7-5.0'). set(test_pos,~/App/f1/train_best/test.f'). set(test_neg,~/App/f1/train_best/test.n'). set(evalfn,coverage). induce. write_rules(~/App/f1/train_best/theory-0.7-5.0'). halt.

46 Example of Java Program

47 47 Running Jobs on Condor Java programs Using Java Universe Does not need to compile with Condor Use jar file to programs with several classes: http://java.sun.com/docs/books/tutorial/jar/ If using Computer Science environment, must grant access of files to be used on AFS http://www.cs.wisc.edu/condor/uwcs/

48 48 Running Jobs on Condor Java programs #################### # Example in Java Universe # executable must have the.class file and # arguments must have the main class as first argument #################### universe = java executable = Fact.class arguments = Fact notify_user = kayser@cos.ufrj.br output = loop.out error = loop.error log = loop.log Queue

49 49 Running Jobs on Condor Java programs #################### # Example in Java Universe using jar file #################### universe = java executable = jgfSection2.jar arguments = JGFAllSizeA 4 jar_files = jgfSection2.jar transfer_files = ALWAYS output = logAllSection2f.out error = logAllSection2f.error log = logAllSection2f.log Queue

50 50 Running Jobs on Condor Java programs executable = jgfSection2.jar –é um jar –não um.class como no exemplo anterior arguments = JGFAllSizeA 4 –dois argumentos –exemplo gerado a partir do JavaGrand jar_files = jgfSection2.jar –parece redundante –mas sem esse argumento arquivo não é transferido

51 51 Running Jobs on Condor Java programs transfer_files = ALWAYS –idem: para transferir.jar –talvez um erro que tenha sido resolvido

52 52 Running Jobs on Condor Inspecting Condor Jobs Some useful commands: –condor_q mostra fila de jobs submetidos localmente – condor_q -analyze mais informações permitindo entender se um job não está executando pq teve algum problema nos requisitos ou se não há recurso condor_q –submitter

53 53 Running Jobs on Condor Inspecting Condor Jobs condor_q -run –mostra apenas os jobs que estão em execução condor_q -submitter –filtra pra mostrar informações apenas dos jobs submetidos pelo user

54 54 Running Jobs on Condor Inspecting Condor Jobs condor_status –mostra cada uma das máquinas da condor_pool –mostrando informações estáticas (p.ex. qual o SO) dinâmicas (p.ex. se está ociosa ou ocupada)

55 55 Running Jobs on Condor Inspecting Condor Jobs condor_rm –se resolver remover um job ou conjunto de jobs da fila –parecido como o kill –precisa dar o número do job condor_q -global –mostra informações de todas as filas –em todas as máquinas onde houve submissão

56 56 Final Remarks

57 57 Final Remarks So, Condor... –controls execution of several jobs –can really improve your runtime Yap+Aleph: during three months: 53,000 CPU hours (peak of 400 machines) But, Condor... –does not automatically parallelize your job

58 58 Final Remarks Running Jobs on Condor - Observations: –input data file and directory used to output/log/error must be previously created, otherwise an error will be reported and no job will be executed –for each execution, the outputs are appended to log files the results are overwritten to out files –error, log and out files must have different names to avoid race conditions

59 59 Final Remarks Trabalhos sobre gerenciamento de dados –mas não sei até que ponto integrados ao Condor? –Stork (Data Placement Scheduler): http://www.cs.wisc.edu/condor/stork http://www.cs.wisc.edu/condor/stork –Kangaroo (parece que esse foi abandonado): http://www.cs.wisc.edu/condor/kangaroo http://www.cs.wisc.edu/condor/kangaroo –NeST: Network Storage : http://www.cs.wisc.edu/condor/nest/

60 60 Final Remarks Trabalho sobre monitoração –Hawkeye System Monitoring Tool: http://www.cs.wisc.edu/condor/hawkeye/

61 61 Final Remarks More information about Condor: http://www.cs.wisc.edu/condor/ Tutoriais –http://www.cs.wisc.edu/condor/CondorWeek2006/http://www.cs.wisc.edu/condor/CondorWeek2006/ –http://www.cs.wisc.edu/condor/CondorWeek2005/ presentations.html More information about running Condor: http://www.cs.wisc.edu/condor/manual/v6.4/

62 62 Final Remarks References: –[1] WRIGHT, Derek. Cheap cycles from the desktop to the dedicated cluster: combining opportunistic and dedicated scheduling with Condor. In: Conference on Linux Clusters: The HPC Revolution, June, 2001, Champaign - Urbana, IL - USA. http://www.cs.wisc.edu/condor/doc/cheap-cycles.pdf

63

64 NMR-Star file to ClassAd Patrícia Kayser Vargas Mangan kayser@cos.ufrj.br September, 2002

65 65 NMR-Star to ClassAd BioMagResBank (http://www.bmrb.wisc.edu) –an international repository for biological NMR (nuclear magnetic resonance) data –uses the NMR Self-defining Text Archival and Retrieval (NMR-STAR) format to store its data NMR-STAR is characterized by a set of information organized as a hierarchical tree –stored as plain text file –some may have inconsistencies that are manually verified

66 66 NMR-Star to ClassAd ClassAds –a simple representation language used first in the Condor context, Steps: –conversion of NMR-STAR data to ClassAds format using starlibj (Java package) –use to detect inconsistencies on NMR-STAR files

67 67 NMR-Star to ClassAd Future work: –Matchmaking as consistency checker –try to learn similarities among NMR data Working with R. Kent Wenger from the Condor team of UW-Madison

68 68

69 TALK 1: Condor: Managing Resources in the Biostatistics Department Environment TALK 2: Using ClassAds to Represent NMR Data

70 70 What is Condor? Architecture After schedd receives a match for a given job, the schedd enters into a claiming protocol directly with the startd Through this protocol, the schedd presents the job ClassAd to the startd and requests temporary control over the resource


Carregar ppt "Condor: Overview and User Guide to the Condor Biostatistics Environment."

Apresentações semelhantes


Anúncios Google