Cuneiform A Functional Workflow Language Implementation in Erlang Jörgen Brandt Humboldt-Universität zu Berlin 2015-12-01 Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 1 / 27
Cuneiform Cuneiform A Functional Language for Large Scale Scientific Data Analysis Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 2 / 27
Cuneiform Cuneiform A Functional Language for Large Scale Scientific Data Analysis Black-box operator model Pro: Operators can be any piece of software Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 2 / 27
Cuneiform Cuneiform A Functional Language for Large Scale Scientific Data Analysis Black-box operator model Pro: Operators can be any piece of software Black-box data model Pro: Input and Output data can be anything Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 2 / 27
Cuneiform Cuneiform A Functional Language for Large Scale Scientific Data Analysis Black-box operator model Pro: Operators can be any piece of software Black-box data model Pro: Input and Output data can be anything Features of advanced workflow languages Abstractions, lists, operations on lists, conditions Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 2 / 27
Cuneiform Cuneiform A Functional Language for Large Scale Scientific Data Analysis Black-box operator model Pro: Operators can be any piece of software Black-box data model Pro: Input and Output data can be anything Features of advanced workflow languages Abstractions, lists, operations on lists, conditions Light-weight Foreign Function Interface (FFI) Wrapping in R, Matlab, Octave, Python, Lisp, Perl, Bash Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 2 / 27
Cuneiform Cuneiform A Functional Language for Large Scale Scientific Data Analysis Black-box operator model Pro: Operators can be any piece of software Black-box data model Pro: Input and Output data can be anything Features of advanced workflow languages Abstractions, lists, operations on lists, conditions Light-weight Foreign Function Interface (FFI) Wrapping in R, Matlab, Octave, Python, Lisp, Perl, Bash Automatic parallelization Scalability with large data sets Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 2 / 27
Motivation New hardware is increasingly parallel, so new programming languages must support concurrency or they will die. Joe Armstrong Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 3 / 27
New Hardware Jo rgen Brandt (HU Berlin) Cuneiform 2015-12-01 4 / 27
DNA Sequencing is becoming cheap Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 5 / 27
Decentralized software development Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 6 / 27
Scientific Workflow Systems Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 7 / 27
Scientific Workflow Systems Workflows as DAGs Scientific Workflows are DAGs Jo rgen Brandt (HU Berlin) Cuneiform 2015-12-01 8 / 27
Scientific Workflow Systems Workflows as DAGs Scientific Workflows are DAGs Nodes are tasks Jo rgen Brandt (HU Berlin) Cuneiform 2015-12-01 8 / 27
Scientific Workflow Systems Workflows as DAGs Scientific Workflows are DAGs Nodes are tasks Edges are data dependencies Jo rgen Brandt (HU Berlin) Cuneiform 2015-12-01 8 / 27
Example: Galaxy Workflow System Focus on Usability Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 9 / 27
Example: Galaxy Workflow System Focus on Usability Integration of tools/libraries Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 9 / 27
Example: Galaxy Workflow System Focus on Usability Systematic documentation Integration of tools/libraries Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 9 / 27
Example: Galaxy Workflow System Focus on Usability Systematic documentation Integration of tools/libraries Reproducibility Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 9 / 27
The Next Generation Sequencing use case Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 10 / 27
The Next Generation Sequencing use case Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 10 / 27
The Next Generation Sequencing use case Jo rgen Brandt (HU Berlin) Cuneiform 2015-12-01 10 / 27
The Next Generation Sequencing use case Jo rgen Brandt (HU Berlin) Cuneiform 2015-12-01 10 / 27
The Next Generation Sequencing use case Jo rgen Brandt (HU Berlin) Cuneiform 2015-12-01 10 / 27
Desired Features in a Language Is there a language that is... Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 11 / 27
Desired Features in a Language Is there a language that is... Like a workflow language So we can integrate all the tools Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 11 / 27
Desired Features in a Language Is there a language that is... Like a workflow language So we can integrate all the tools Like MapReduce So we can derive parallelism and distribute the work Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 11 / 27
Desired Features in a Language Is there a language that is... Like a workflow language So we can integrate all the tools Like MapReduce So we can derive parallelism and distribute the work Like a functional programming language So we can write arbitrary programs using lists and operations on lists Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 11 / 27
Cuneiform example deftask gunzip( out( File ) : gz( File ) )in bash *{ gzip -c -d $gz > $out }* Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 12 / 27
Cuneiform example deftask gunzip( out( File ) : gz( File ) )in bash *{ gzip -c -d $gz > $out }* gunzip gunzip( gz: 'myarchive1.gz' ); Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 12 / 27
Cuneiform example deftask gunzip( out( File ) : gz( File ) )in bash *{ gzip -c -d $gz > $out }* gunzip gunzip( gz: 'myarchive1.gz' 'myarchive2.gz' ); Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 12 / 27
Cuneiform example deftask gunzip( out( File ) : gz( File ) )in bash *{ gzip -c -d $gz > $out }* gunzip gunzip( gz: 'myarchive1.gz' 'myarchive2.gz' 'myarchive3.gz' ); Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 12 / 27
Workflow Implementations Available Available example workflows: Variant calling https://www.github.com/ joergen7/variant-call samtools-faidx tophat-single fastq-dump cuffdiff cufflinks cummerbund cuffmerge Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 13 / 27
Workflow Implementations Available Available example workflows: Variant calling https://www.github.com/ joergen7/variant-call Methylation https://www.github.com/ joergen7/methylation samtools-faidx fastq-dump tophat-single cuffdiff cufflinks cummerbund cuffmerge Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 13 / 27
Workflow Implementations Available Available example workflows: Variant calling https://www.github.com/ joergen7/variant-call Methylation https://www.github.com/ joergen7/methylation RNA-Seq https://www.github. com/joergen7/rna-seq fastq-dump samtools-faidx tophat-single cuffdiff cufflinks cummerbund cuffmerge Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 13 / 27
Workflow Implementations Available Available example workflows: Variant calling https://www.github.com/ joergen7/variant-call Methylation https://www.github.com/ joergen7/methylation RNA-Seq https://www.github. com/joergen7/rna-seq etc (ChIP-Seq, mirna detection, consensus prediction,... ) fastq-dump samtools-faidx tophat-single cuffdiff cufflinks cummerbund cuffmerge Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 13 / 27
Cuneiform Operational Semantics Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 14 / 27
Program Interpretation in Cuneiform Current design in Java implementation: EBNF Parser Generator (ANTLR) Scanner/ Transcrip on Abstract Program Parse tree Interpreter Result Parser Visitor Program Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 15 / 27
Program Interpretation in Cuneiform Designated design but never implemented: EBNF Opera onal Seman cs Parser Generator (ANTLR) Program Extrac on (?) Scanner/ Transcrip on Abstract Program Parse tree Interpreter Result Parser Visitor Program Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 15 / 27
Program Interpretation in Cuneiform Designated design in Erlang: EBNF Opera onal Seman cs Parser Generator (ANTLR) Program Extrac on (?) Scanner/ Transcrip on Abstract Program Parse tree Interpreter Result Parser Visitor Program Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 15 / 27
The eval Function fun eval(expr, ρ, GetFuture, Global, Fin) Result when Expr :: expr ρ :: string => expr GetFuture :: fun Global :: string => lam Fin :: id => expr Result :: expr Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 16 / 27
The eval Function fun eval(expr, ρ, GetFuture, Global, Fin) Result when Expr :: expr ρ :: string => expr GetFuture :: fun Global :: string => lam Fin :: id => expr Result :: expr Expr The expression to be evaluated Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 16 / 27
The eval Function fun eval(expr, ρ, GetFuture, Global, Fin) Result when Expr :: expr ρ :: string => expr GetFuture :: fun Global :: string => lam Fin :: id => expr Result :: expr Expr The expression to be evaluated ρ Current scope Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 16 / 27
The eval Function fun eval(expr, ρ, GetFuture, Global, Fin) Result when Expr :: expr ρ :: string => expr GetFuture :: fun Global :: string => lam Fin :: id => expr Result :: expr Expr The expression to be evaluated ρ Current scope GetFuture A function returning a future for a foreign task application Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 16 / 27
The eval Function fun eval(expr, ρ, GetFuture, Global, Fin) Result when Expr :: expr ρ :: string => expr GetFuture :: fun Global :: string => lam Fin :: id => expr Result :: expr Expr The expression to be evaluated ρ Current scope GetFuture A function returning a future for a foreign task application Global Task definitions Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 16 / 27
The eval Function fun eval(expr, ρ, GetFuture, Global, Fin) Result when Expr :: expr ρ :: string => expr GetFuture :: fun Global :: string => lam Fin :: id => expr Result :: expr Expr The expression to be evaluated ρ Current scope GetFuture A function returning a future for a foreign task application Global Task definitions Fin Results of foreign task applications Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 16 / 27
The eval Function fun eval(expr, ρ, GetFuture, Global, Fin) Result when Expr :: expr ρ :: string => expr GetFuture :: fun Global :: string => lam Fin :: id => expr Result :: expr Expr The expression to be evaluated ρ Current scope GetFuture A function returning a future for a foreign task application Global Task definitions Fin Results of foreign task applications Result The result of evaluation (may contain futures) Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 16 / 27
Computational Semantics eval(expr, ρ, GetFuture, Global, Fin) Next = step(expr, ρ, GetFuture, Global, Fin) case Next of Expr Expr eval(next, ρ, GetFuture, Global, Fin) end Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 17 / 27
Computational Semantics eval(expr, ρ, GetFuture, Global, Fin) Next = step(expr, ρ, GetFuture, Global, Fin) case Next of Expr Expr eval(next, ρ, GetFuture, Global, Fin) end single step is computed Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 17 / 27
Computational Semantics eval(expr, ρ, GetFuture, Global, Fin) Next = step(expr, ρ, GetFuture, Global, Fin) case Next of Expr Expr eval(next, ρ, GetFuture, Global, Fin) end single step is computed If the step has no effect evaluation terminates Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 17 / 27
Computational Semantics fun eval(expr, ρ, GetFuture, Global, Fin) Next = step(expr, ρ, GetFuture, Global, Fin) case Next of Expr Expr eval(next, ρ, GetFuture, Global, Fin) end single step is computed If the step has no effect evaluation terminates Otherwise eval is called recursively Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 17 / 27
Languages Suitable for Operational Semantics Common Lisp Choosing a language Functional Language Haskell Scala Erlang ML Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 18 / 27
Formalisms Suitable for Operational Semantics Common Lisp Choosing a language Functional Language With Pattern Matching Haskell Scala Erlang ML Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 18 / 27
Formalisms Suitable for Operational Semantics Common Lisp Choosing a language Functional Language With Pattern Matching Concurrency Orientation Haskell Scala Erlang ML Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 18 / 27
Program Interpretation in Cuneiform New design: EBNF Opera onal Seman cs Parser Generator (Leex/Yecc) Compiler (Erlang) Scanner/ Abstract Program Interpreter Result Parser Program Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 19 / 27
Program Interpretation in Cuneiform New design: EBNF Opera onal Seman cs Parser Generator (Leex/Yecc) Compiler (Erlang) Scanner/ Abstract Program Interpreter Result Parser Program Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 19 / 27
Fault Tolerance Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 20 / 27
Two types of complexity n Processes Distributed applications can be complex in two independent ways: n Tasks Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 21 / 27
Two types of complexity n Processes Distributed applications can be complex in two independent ways: in the number of processes involved to compute one task the more system components the more likely one component fails n Tasks Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 21 / 27
Two types of complexity n Processes n Tasks Distributed applications can be complex in two independent ways: in the number of processes involved to compute one task the more system components the more likely one component fails in the number of tasks contributing to a workflow the more tasks the more likely one task fails Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 21 / 27
Two types of complexity n Processes n Tasks Distributed applications can be complex in two independent ways: in the number of processes involved to compute one task the more system components the more likely one component fails in the number of tasks contributing to a workflow the more tasks the more likely one task fails Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 21 / 27
Two types of complexity n Processes n Tasks Distributed applications can be complex in two independent ways: in the number of processes involved to compute one task the more system components the more likely one component fails in the number of tasks contributing to a workflow the more tasks the more likely one task fails Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 21 / 27
Distributed Application: Workflow System Query Query Query Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 22 / 27
Distributed Application: Workflow System Query Query Query Cache Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 22 / 27
Distributed Application: Workflow System Query Query Query Cache Scheduler Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 22 / 27
Distributed Application: Workflow System Query Query Query Cache Scheduler FS Work FS Work FS Work Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 22 / 27
How to achieve fault tolerance when Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 23 / 27
How to achieve fault tolerance when Workflow systems are complex in both Number of processes involved in computing one task Number of tasks in one workflow so failures are likely Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 23 / 27
How to achieve fault tolerance when Workflow systems are complex in both Number of processes involved in computing one task Number of tasks in one workflow so failures are likely All components need to maintain state so plain restarting of components is not enough Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 23 / 27
How to achieve fault tolerance when Workflow systems are complex in both Number of processes involved in computing one task Number of tasks in one workflow so failures are likely All components need to maintain state so plain restarting of components is not enough Restarting of workflow helps only if workflows are small and system has few components Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 23 / 27
Generic process behaviour Golden path: P 1 sends request to P 2 P1 request P2 Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 24 / 27
Generic process behaviour Golden path: P 1 sends request to P 2 Asynchronously P 1 receives reply P1 reply request P2 Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 24 / 27
Generic process behaviour P 2 may fail: P 1 sends request to P 2 P 2 fails P1 request P2 Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 25 / 27
Generic process behaviour monitor P 2 may fail: P 1 sends request to P 2 P 1 creates monitor on P 2 P1 request P2 Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 25 / 27
Generic process behaviour monitor P 2 may fail: P 1 sends request to P 2 P 1 creates monitor on P 2 P 1 memorizes request P1 request P2 { } Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 25 / 27
Generic process behaviour monitor P 2 may fail: P 1 sends request to P 2 P 1 creates monitor on P 2 P 1 memorizes request P 2 fails P1 request P2 { } Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 25 / 27
Generic process behaviour P 2 may fail: P 1 sends request to P 2 P 1 creates monitor on P 2 P 1 memorizes request P 2 fails P1 P2 P 2 supervisor restarts P 2 { } Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 25 / 27
Generic process behaviour P 2 may fail: monitor P 1 sends request to P 2 P 1 creates monitor on P 2 P 1 memorizes request P 2 fails P 2 supervisor restarts P 2 P1 request P2 request is replayed to P 2 monitor is recreated { } Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 25 / 27
Generic process behaviour P 1 may fail: P 1 sends request to P 2 P 1 fails P1 request P2 Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 26 / 27
Generic process behaviour monitor P 1 may fail: P 1 sends request to P 2 P 2 creates monitor on P 1 P1 request P2 Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 26 / 27
Generic process behaviour monitor P 1 may fail: P 1 sends request to P 2 P 2 creates monitor on P 1 P 1 fails P1 request P2 Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 26 / 27
Generic process behaviour P 1 may fail: P 1 sends request to P 2 P 2 creates monitor on P 1 P 1 fails request is canceled supervisor restarts P 1 P1 P2 Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 26 / 27
Conclusion untar Cuneiform: gunzip bowtie2-build bowtie2-align samtools-view samtools-sort samtools-faidx samtools-mpileup varscan annovar Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 27 / 27
Conclusion untar Cuneiform: Functional gunzip bowtie2-build bowtie2-align samtools-view samtools-sort samtools-faidx samtools-mpileup varscan annovar Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 27 / 27
Conclusion untar Cuneiform: Functional Integrate anything gunzip bowtie2-build bowtie2-align samtools-view samtools-sort samtools-faidx samtools-mpileup varscan annovar Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 27 / 27
Conclusion untar Cuneiform: Functional Integrate anything Parallelism gunzip bowtie2-build bowtie2-align samtools-view samtools-sort samtools-faidx samtools-mpileup varscan annovar Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 27 / 27
Conclusion untar Cuneiform: Functional Integrate anything Parallelism gunzip bowtie2-build bowtie2-align samtools-view Runs on Hadoop samtools-sort samtools-faidx samtools-mpileup varscan annovar Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 27 / 27
Conclusion untar Cuneiform: Functional Integrate anything Parallelism Runs on Hadoop Implementation in Erlang: gunzip bowtie2-build bowtie2-align samtools-view samtools-sort samtools-faidx samtools-mpileup varscan annovar Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 27 / 27
Conclusion untar Cuneiform: Functional Integrate anything Parallelism Runs on Hadoop Implementation in Erlang: Concise stateless semantics gunzip bowtie2-build bowtie2-align samtools-view samtools-sort samtools-faidx samtools-mpileup varscan annovar Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 27 / 27
Conclusion untar Cuneiform: Functional Integrate anything Parallelism Runs on Hadoop Implementation in Erlang: Concise stateless semantics Fine-grained fault tolerance gunzip bowtie2-build bowtie2-align samtools-view samtools-sort samtools-faidx samtools-mpileup varscan annovar Jörgen Brandt (HU Berlin) Cuneiform 2015-12-01 27 / 27