Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype for batch Python interface #4937

Merged
merged 12 commits into from
Jan 30, 2019
Merged

Conversation

jigold
Copy link
Contributor

@jigold jigold commented Dec 10, 2018

@cseed I got a version working!!!! I'd like to test on other potential pipelines and make a real example that will run for demo purposes.

In [1]: from pyapi import Pipeline
   ...: p = Pipeline()
   ...:
   ...: subset = (p.new_task()
   ...:            .label('subset')
   ...:            .command('plink --bfile {{bfile}} --make-bed {{tmp1}}')
   ...:            .command("awk '{ print $1, $2}' {{tmp1}}.fam | sort | uniq -c | awk '{ if ($1 != 1) print $2, $3 }' > {{tmp2}}")
   ...:            .command("plink --bfile {{bfile}} --remove {{tmp2}} --make-bed {{ofile}}"))
   ...:
   ...: shapeit_tasks = []
   ...: for contig in [str(x) for x in range(1, 4)]:
   ...:     shapeit = (p.new_task()
   ...:                 .label('shapeit')
   ...:                 .command('shapeit --bed-file {{bfile}} --chr ' + contig + ' --out {{ofile}}')
   ...:                 .inputs(bfile=subset.ofile))
   ...:     shapeit_tasks.append(shapeit)
   ...:
   ...: merger = (p.new_task()
   ...:            .label('merge')
   ...:            .command('cat {{files}} >> {{ofile}}')
   ...:            .inputs(files=[task.ofile for task in shapeit_tasks]))
   ...:
   ...:
   ...: p.write_output(merger.ofile + ".haps", "gs://jigold/final_output.txt")
   ...: p.run()
   ...:
#! /usr/bash
set -ex


# __TASK__0 subset
__RESOURCE__0=/tmp/9CiA1t
__RESOURCE__1=/tmp/y7HdVA
__RESOURCE__2=/tmp/l7skDb
__RESOURCE__3=/tmp/McFulO
plink --bfile $__RESOURCE__1 --make-bed $__RESOURCE__0
awk '{ print $1, $2}' $__RESOURCE__0.fam | sort | uniq -c | awk '{ if ($1 != 1) print $2, $3 }' > $__RESOURCE__2
plink --bfile $__RESOURCE__1 --remove $__RESOURCE__2 --make-bed $__RESOURCE__3


# __TASK__1 shapeit
__RESOURCE__4=/tmp/PQiR68
__RESOURCE__5=/tmp/McFulO
shapeit --bed-file $__RESOURCE__5 --chr 1 --out $__RESOURCE__4


# __TASK__2 shapeit
__RESOURCE__6=/tmp/sjoOQX
__RESOURCE__7=/tmp/McFulO
shapeit --bed-file $__RESOURCE__7 --chr 2 --out $__RESOURCE__6


# __TASK__3 shapeit
__RESOURCE__8=/tmp/gNw0he
__RESOURCE__9=/tmp/McFulO
shapeit --bed-file $__RESOURCE__9 --chr 3 --out $__RESOURCE__8


# __TASK__4 merge
__RESOURCE__10=/tmp/RY0Raq
__RESOURCE__11=(/tmp/PQiR68 /tmp/sjoOQX /tmp/gNw0he)
cat ${__RESOURCE__11[*]} >> $__RESOURCE__10


# __TASK__5
__RESOURCE__14=gs://jigold/final_output.txt
__RESOURCE__15=/tmp/RY0Raq.haps
cp $__RESOURCE__15 $__RESOURCE__14

@cseed
Copy link
Collaborator

cseed commented Dec 12, 2018

This is really great.

I have some thoughts below, mostly brain storming. Don't take any of it too seriously.

Some thoughts:

  1. I thought you wanted to support f-strings. By making the batch file quote double curly parens, that means if you use them in an f-string, you need to write f'{{{{foo}}}}' which is a bit much. But that means that non-batch uses of {} need to be double-quoted, so awk '{{ ... }}' and f"awk '{{{{ ... }}}}'". Hmm. Maybe using the same escape syntax as f-strings is not ideal.

I don't have a no-brainer suggestion. Happy to brainstorm ideas offline. I think ultimately this is a minor syntactic choice.

  1. Inputs seem ... almost redundant, because they also appear in the command strings. What about:
.command('shapeit --bed-file {{<subset.ofile}} --chr ' + contig + ' --out {{>ofile}}')

Then the question becomes, how do associate subset with the corresponding Python variable? You could use the task label, but then the user has to maintain two sets of names, which isn't ideal. Hmm, maybe this doesn't work.

  1. I like arrays of resources!

.command('cat {{files}} >> {{ofile}}')

I wonder, will we ever want arrays to be formatted other than joined with spaces? I worry the user will want more flexibility in formatting, and we'll want that in Python. What about if the argument is a function, it takes a dictionary from resource names to their string representation, and you can format however you want? Then you could write the last command as:

  .command(lambda rs: f'cat {' '.join(rs['files'])} >> {rs['ofile']}')
  1. I was confused by this:

p.write_output(merger.ofile + ".haps", ...)

What's the left hand side? Why isn't this just merger.ofile?

This suggests another issue: what if you want to use ofile in a plink command, but plink outputs some files with various extensions with ofile as the base? We might need an outputs that lists (docker local) output files based on a base path.

@jigold
Copy link
Contributor Author

jigold commented Dec 12, 2018

All of these things I completely agree with!

  1. I was lazy and used the Jinja template engine to parse and find the variable declarations. I need to write a custom parser, but wanted to figure out exactly what we're going to support. Which makes me worried that I don't want to implement an expr language or it should be minimal.

  2. Tim suggested something similar: %%IN bfile%% and `%%OUT ofile%%. Requires the custom parser. See comment 1 above.

  3. I was also concerned about the formatting of arrays. I tried using lambdas for comment 4 and it got complicated. I like your proposal but want to think about it more.

  4. PLINK, etc. output lots of files and you specify the file root and then it outputs a bunch of files with different extensions. We must be able to support this and make it easy for users. I agree with your suggestion. I'll try that in the example.

@jigold
Copy link
Contributor Author

jigold commented Dec 14, 2018

@cseed I don't really like how this looks with lists as the inputs to the commands. To me, it's much harder to read and modify. I like the commands looking as much like writing a shell script as possible. Maybe others feel differently though... Also, you can't do something like this ' '.join([task.ofile for task in shapeit_tasks]) because you'll lose information about the dependencies before command sees the original resource inputs. What I really want is a version of f-string interpolation where I parse and detect the variables (known and unknown), handle them properly by either creating new resources or adding dependencies to the Task, and then execute the Python formatting code inside the curly braces. I'm not sure if it is possible to do this. If it is, it's probably complicated and we'll have to use the Python ast and parser modules and call eval ourselves.

from pyapi import Pipeline
p = Pipeline()

bfile_root = 'gs://jigold/input'
bed = bfile_root + '.bed'
bim = bfile_root + '.bim'
fam = bfile_root + '.fam'

p.write_input(bed=bed, bim=bim, fam=fam)

subset = p.new_task()
subset = (subset
           .label('subset')
           .command(['plink', '--bed', p.bed, '--bim', p.bim, '--fam', p.fam, '--make-bed', '--out', subset.tmp1])
           .command(['awk', "'{ print $1, $2}'", subset.tmp1 + '.fam', "| sort | uniq -c | awk '{ if ($1 != 1) print $2, $3 }'",
                     '>', subset.tmp2])
           .command(['plink', '--bed', p.bed, '--bim', p.bim, '--fam', p.fam, '--remove', subset.tmp2,
                     '--make-bed', '--out', subset.tmp2]))

shapeit_tasks = []
for contig in [str(x) for x in range(1, 4)]:
    shapeit = p.new_task()
    shapeit = (shapeit
                .label('shapeit')
                .command(['shapeit', '--bed-file', subset.ofile, '--chr ', contig, '--out', shapeit.ofile]))
    shapeit_tasks.append(shapeit)

merger = p.new_task()
merger = (merger
           .label('merge')
           .command(['cat', ' '.join([task.ofile for task in shapeit_tasks]), '>>', merger.ofile))

p.write_output(merger.ofile + ".haps", "gs://jigold/final_output.txt")
p.run()

@danking
Copy link
Contributor

danking commented Dec 20, 2018

should this have an assigned reviewer?

@jigold
Copy link
Contributor Author

jigold commented Jan 3, 2019

@cseed Sorry if this doesn't make sense -- we can discuss in person.

Here's my attempt to fix the problems outlined above. The tradeoff made is that we have to refer to the object (ex: subset). So instead of thinking of writing the command as a template where inputs specifies how to substitute into the template, writing commands in this interface is the same as using Python to generate the correct string to output. I'm not sure that I like this better. One of the advantages of writing commands as templates is they are reusable. In the latter case, the strings are reusable if they are written as .format() templates instead of f-strings. So maybe it's approximately the same, but there's an extra step to define the inputs to format.

I tried hacking the Python AST to not have to refer to the object, but I think it's going to be difficult to get the AST parsing exactly right and not have too many implicit rules within our language. I also considered writing a DSL, but found that it's hard to specify the part with the shapeit_output in a DSL.

from pyapi import Pipeline, resource_group
p = Pipeline()

input_bfile = p.new_resource_group(bed="gs://jigold/input_root.bed",
                                   bim="gs://jigold/input_root.bim",
                                   fam="gs://jigold/input_root.fam")

def bfile(root):
    return resource_group(root, lambda x: {"bed": x + ".bed", "bim": x + ".bim", "fam": x + ".fam"})

subset = (p.new_task()
           .label('subset'))
subset = subset
           .command(f'plink --bfile {input_bfile} --make-bed {bfile(subset.tmp1)}')
           .command("awk '{print $1, $2}'" +
                    subset.tmp1.fam +
                    " | sort | uniq -c | awk '{ if ($1 != 1) print $2, $3 }' > " +
                    subset.tmp2)
           .command(f"plink --bfile {input_bfile} --remove {subset.tmp2} --make-bed {bfile(subset.ofile)}"))

def shapeit_output(root):
    return resource_group(root, lambda x: {"haps": x + ".haps", "log": x + ".log"})

for contig in [str(x) for x in range(1, 4)]:
    shapeit = (p.new_task()
		.label('shapeit'))
    shapeit = (shapeit
		.command(f'shapeit --bed-file {subset.ofile} --chr {contig} --out {shapeit_output(shapeit.ofile)}'))

merger = (p.new_task()
           .label('merge'))
merger = (merger
           .command('cat {files} >> {ofile}'.format(files=" ".join([task.ofile.haps for task in p.select_tasks("shapeit")]), ofile=merger.ofile))

p.write_output(merger.ofile, "gs://jigold/final_output.txt")
p.run()

@jigold
Copy link
Contributor Author

jigold commented Jan 9, 2019

@cseed I'm really happy with the interface now! Could you please look over this again and let me know if there are any suggestions you have before I write some tests and give this to someone to code review. I also called this pyapi for lack of a better name and it's currently in the batch module...

from pyapi import Pipeline, resource_group_builder

p = Pipeline() # initialize a pipeline

# Define resource group builders (used with `declare_resource_group`)
rgb_bfile = resource_group_builder(bed="{root}.bed",
                                   bim="{root}.bim",
                                   fam="{root}.fam")

rgb_shapeit = resource_group_builder(haps="{root}.haps",
                                     log="{root}.log")

# Import a file as a resource
file = p.write_input('gs://hail-jigold/random_file.txt')

# Import a set of input files as a resource group
input_bfile = p.write_input_group(bed='gs://hail-jigold/input.bed',
                                  bim='gs://hail-jigold/input.bim',
                                  fam='gs://hail-jigold/input.fam')

# Remove duplicate samples from a PLINK dataset
subset = p.new_task()
subset = (subset
          .label('subset')
          .declare_resource_group(tmp1=rgb_bfile, ofile=rgb_bfile)
          .command(f'plink --bfile {input_bfile} --make-bed {subset.tmp1}')
          .command(f"awk '{{ print $1, $2}}' {subset.tmp1.fam} | sort | uniq -c | awk '{{ if ($1 != 1) print $2, $3 }}' > {subset.tmp2}")
          .command(f"plink --bed {input_bfile.bed} --bim {input_bfile.bim} --fam {input_bfile.fam} --remove {subset.tmp2} --make-bed {subset.ofile}"
))

# Run shapeit for each contig from 1-3 with the output from subset
for contig in [str(x) for x in range(1, 4)]:
    shapeit = p.new_task()
    shapeit = (shapeit
                .label('shapeit')
                .declare_resource_group(ofile=rgb_shapeit)
                .command(f'shapeit --bed-file {subset.ofile} --chr {contig} --out {shapeit.ofile}'))

# Merge the shapeit output files together
merger = p.new_task()
merger = (merger
           .label('merge')
           .command('cat {files} >> {ofile}'.format(files=" ".join([t.ofile.haps for t in p.select_tasks('shapeit')]),
                                                    ofile=merger.ofile)))

# Write the result of the merger to a permanent location
p.write_output(merger.ofile, "gs://jigold/final_output.txt")

# Execute the pipeline
p.run()
#! /usr/bash
set -ex


# define tmp directory
__TMP_DIR__=/tmp//pipeline.yG41vqpS/


# __TASK__0 write_input
cp gs://hail-jigold/random_file.txt ${__TMP_DIR__}/rsfKylng


# __TASK__1 write_input
cp gs://hail-jigold/input.bed ${__TMP_DIR__}/xJONBVn7.bed


# __TASK__2 write_input
cp gs://hail-jigold/input.bim ${__TMP_DIR__}/xJONBVn7.bim


# __TASK__3 write_input
cp gs://hail-jigold/input.fam ${__TMP_DIR__}/xJONBVn7.fam


# __TASK__4 subset
__RESOURCE_GROUP__0=${__TMP_DIR__}/xJONBVn7
__RESOURCE_GROUP__1=${__TMP_DIR__}/TB7ZUbj8
__RESOURCE__6=${__TMP_DIR__}/TB7ZUbj8.fam
__RESOURCE__10=${__TMP_DIR__}/EVeRHf7V
__RESOURCE__1=${__TMP_DIR__}/xJONBVn7.bed
__RESOURCE__2=${__TMP_DIR__}/xJONBVn7.bim
__RESOURCE__3=${__TMP_DIR__}/xJONBVn7.fam
__RESOURCE_GROUP__2=${__TMP_DIR__}/MXBQugBx
plink --bfile ${__RESOURCE_GROUP__0} --make-bed ${__RESOURCE_GROUP__1}
awk '{ print $1, $2}' ${__RESOURCE__6} | sort | uniq -c | awk '{ if ($1 != 1) print $2, $3 }' > ${__RESOURCE__10}
plink --bed ${__RESOURCE__1} --bim ${__RESOURCE__2} --fam ${__RESOURCE__3} --remove ${__RESOURCE__10} --make-bed ${__RESOURCE_GROUP__2}


# __TASK__5 shapeit
__RESOURCE_GROUP__2=${__TMP_DIR__}/MXBQugBx
__RESOURCE_GROUP__3=${__TMP_DIR__}/YSm1XkKf
shapeit --bed-file ${__RESOURCE_GROUP__2} --chr 1 --out ${__RESOURCE_GROUP__3}


# __TASK__6 shapeit
__RESOURCE_GROUP__2=${__TMP_DIR__}/MXBQugBx
__RESOURCE_GROUP__4=${__TMP_DIR__}/1HyBvsdN
shapeit --bed-file ${__RESOURCE_GROUP__2} --chr 2 --out ${__RESOURCE_GROUP__4}


# __TASK__7 shapeit
__RESOURCE_GROUP__2=${__TMP_DIR__}/MXBQugBx
__RESOURCE_GROUP__5=${__TMP_DIR__}/jtM69Ahm
shapeit --bed-file ${__RESOURCE_GROUP__2} --chr 3 --out ${__RESOURCE_GROUP__5}


# __TASK__8 merge
__RESOURCE__11=${__TMP_DIR__}/YSm1XkKf.haps
__RESOURCE__13=${__TMP_DIR__}/1HyBvsdN.haps
__RESOURCE__15=${__TMP_DIR__}/jtM69Ahm.haps
__RESOURCE__17=${__TMP_DIR__}/z6ccazmC
cat ${__RESOURCE__11} ${__RESOURCE__13} ${__RESOURCE__15} >> ${__RESOURCE__17}


# __TASK__9 write_output
__RESOURCE__17=${__TMP_DIR__}/z6ccazmC
cp ${__RESOURCE__17} gs://jigold/final_output.txt


# remove tmp directory
rm -r ${__TMP_DIR__}

@jigold
Copy link
Contributor Author

jigold commented Jan 9, 2019

@danking suggested we move this to a separate project from batch. Possibly call it pipeline.

@cseed
Copy link
Collaborator

cseed commented Jan 9, 2019

This really does look great! I have two small suggestions:

  1. I feel like you should say read_input and rather than write_input. I'm thinking these commands are from the perspective of the pipeline since they are on Pipeline.

  2. Rather than building a group and then declaring it, I think you can do both at once:

# Remove duplicate samples from a PLINK dataset
subset = p.new_task()
subset.declare_resource_groups(tmp1={bed="{root}.bed", bim="{root}.bim", fam="{root}.fam"}, 
    ofile={...})
subset = (subset
          .label('subset')
          .command(f'plink --bfile {input_bfile} --make-bed {subset.tmp1}')
          .command(f"awk '{{ print $1, $2}}' {subset.tmp1.fam} | sort | uniq -c | awk '{{ if ($1 != 1) print $2, $3 }}' > {subset.tmp2}")
          .command(f"plink --bed {input_bfile.bed} --bim {input_bfile.bim} --fam {input_bfile.fam} --remove {subset.tmp2} --make-bed {subset.ofile}"
))

@jigold
Copy link
Contributor Author

jigold commented Jan 10, 2019

This is now ready to be reviewed. @danking Could you please help me setup the tests to run on the CI?

@catoverdrive This is an example of the interface and the output generated. There's also a tests file in there. I'm happy to explain the design to you if you'd like.

from pipeline import Pipeline

p = Pipeline() # initialize a pipeline

# Define mapping for taking a file root to a set of output files
bfile = {'bed': '{root}.bed', 'bim': '{root}.bim', 'fam': '{root}.fam'}

# Import a file as a resource
file = p.read_input('gs://hail-jigold/random_file.txt')

# Import a set of input files as a resource group
input_bfile = p.read_input_group(bed='gs://hail-jigold/input.bed',
                                                      bim='gs://hail-jigold/input.bim',
                                                      fam='gs://hail-jigold/input.fam')

# Remove duplicate samples from a PLINK dataset
subset = p.new_task()
subset = (subset
          .label('subset')
          .docker('ubuntu')
          .declare_resource_group(tmp1=bfile, ofile=bfile)
          .command(f'plink --bfile {input_bfile} --make-bed {subset.tmp1}')
          .command(f"awk '{{ print $1, $2}}' {subset.tmp1.fam} | sort | uniq -c | awk '{{ if ($1 != 1) print $2, $3 }}' > {subset.tmp2}")
          .command(f"plink --bed {input_bfile.bed} --bim {input_bfile.bim} --fam {input_bfile.fam} --remove {subset.tmp2} --make-bed {subset.ofile}"

))

# Run shapeit for each contig from 1-3 with the output from subset
for contig in [str(x) for x in range(1, 4)]:
    shapeit = p.new_task()
    shapeit = (shapeit
                .label('shapeit')
                .declare_resource_group(ofile={'haps': "{root}.haps", 'log': "{root}.log"})
                .command(f'shapeit --bed-file {subset.ofile} --chr {contig} --out {shapeit.ofile}'))

# Merge the shapeit output files together
merger = p.new_task()
merger = (merger
           .label('merge')
           .command('cat {files} >> {ofile}'.format(files=" ".join([t.ofile.haps for t in p.select_tasks('shapeit')]),
                                                    ofile=merger.ofile)))

# Write the result of the merger to a permanent location
p.write_output(merger.ofile, "gs://jigold/final_output.txt")

# Execute the pipeline
p.run(dry_run=True)
#!/bin/bash
set -ex


# change cd to tmp directory
cd /tmp//pipeline.jlQrNJZW/


# __TASK__0 read_input
cp gs://hail-jigold/random_file.txt nfVpMp4n


# __TASK__1 read_input
cp gs://hail-jigold/input.bed 33qZtfwg.bed


# __TASK__2 read_input
cp gs://hail-jigold/input.bim 33qZtfwg.bim


# __TASK__3 read_input
cp gs://hail-jigold/input.fam 33qZtfwg.fam


# __TASK__4 subset
__RESOURCE_GROUP__0=33qZtfwg
__RESOURCE_GROUP__1=yibUlBkL
__RESOURCE__6=yibUlBkL.fam
__RESOURCE__10=29aBQihd
__RESOURCE__1=33qZtfwg.bed
__RESOURCE__2=33qZtfwg.bim
__RESOURCE__3=33qZtfwg.fam
__RESOURCE_GROUP__2=YXS0tQKi
plink --bfile ${__RESOURCE_GROUP__0} --make-bed ${__RESOURCE_GROUP__1}
awk '{ print $1, $2}' ${__RESOURCE__6} | sort | uniq -c | awk '{ if ($1 != 1) print $2, $3 }' > ${__RESOURCE__10}
plink --bed ${__RESOURCE__1} --bim ${__RESOURCE__2} --fam ${__RESOURCE__3} --remove ${__RESOURCE__10} --make-bed ${__RESOURCE_GROUP__2}


# __TASK__5 shapeit
__RESOURCE_GROUP__2=YXS0tQKi
__RESOURCE_GROUP__3=gidGmbcC
shapeit --bed-file ${__RESOURCE_GROUP__2} --chr 1 --out ${__RESOURCE_GROUP__3}


# __TASK__6 shapeit
__RESOURCE_GROUP__2=YXS0tQKi
__RESOURCE_GROUP__4=W5hjCmPK
shapeit --bed-file ${__RESOURCE_GROUP__2} --chr 2 --out ${__RESOURCE_GROUP__4}


# __TASK__7 shapeit
__RESOURCE_GROUP__2=YXS0tQKi
__RESOURCE_GROUP__5=ySM8T0lZ
shapeit --bed-file ${__RESOURCE_GROUP__2} --chr 3 --out ${__RESOURCE_GROUP__5}


# __TASK__8 merge
__RESOURCE__11=gidGmbcC.haps
__RESOURCE__13=W5hjCmPK.haps
__RESOURCE__15=ySM8T0lZ.haps
__RESOURCE__17=Z5OLJG6Y
cat ${__RESOURCE__11} ${__RESOURCE__13} ${__RESOURCE__15} >> ${__RESOURCE__17}


# __TASK__9 write_output
__RESOURCE__17=Z5OLJG6Y
cp ${__RESOURCE__17} gs://jigold/final_output.txt

@jigold
Copy link
Contributor Author

jigold commented Jan 11, 2019

Other things to add in separate PRs:

  • logging
  • concept of a ResourceDirectory where you want to copy the files in/out from a directory
  • change the temp dir to be per task
  • Support environment variables, cpu, memory

Copy link
Contributor

@catoverdrive catoverdrive left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Let me know when you get the tests working and I'll approve it.

setup(
name = 'pipeline',
version = '0.0.1',
url = 'https://github.com/hail-is/pipeline.git',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this URL works

@jigold
Copy link
Contributor Author

jigold commented Jan 14, 2019

@catoverdrive Here's the output with docker commands:

#!/bin/bash



# change cd to tmp directory
cd /tmp//pipeline.S9YTZap5/


# __TASK__0 read_input
cp gs://hail-jigold/random_file.txt DWRmR1Lh


# __TASK__1 read_input
cp gs://hail-jigold/input.bed Aw2arWP9.bed


# __TASK__2 read_input
cp gs://hail-jigold/input.bim Aw2arWP9.bim


# __TASK__3 read_input
cp gs://hail-jigold/input.fam Aw2arWP9.fam


# __TASK__4 subset
docker run -v /tmp//pipeline.S9YTZap5/:/tmp//pipeline.S9YTZap5/ -w /tmp//pipeline.S9YTZap5/ ubuntu /bin/bash -c '__RESOURCE_GROUP__0=Aw2arWP9; __RESOURCE_GROUP__1=srXTmGQE; __RESOURCE__6=srXTmGQE.fam; __RESOURCE__10=8ueGZQqn; __RESOURCE__1=Aw2arWP9.bed; __RESOURCE__2=Aw2arWP9.bim; __RESOURCE__3=Aw2arWP9.fam; __RESOURCE_GROUP__2=ESEFn8Tm; plink --bfile ${__RESOURCE_GROUP__0} --make-bed ${__RESOURCE_GROUP__1}&& awk '"'"'{ print $1, $2}'"'"' ${__RESOURCE__6} | sort | uniq -c | awk '"'"'{ if ($1 != 1) print $2, $3 }'"'"' > ${__RESOURCE__10}&& plink --bed ${__RESOURCE__1} --bim ${__RESOURCE__2} --fam ${__RESOURCE__3} --remove ${__RESOURCE__10} --make-bed ${__RESOURCE_GROUP__2}'


# __TASK__5 shapeit
docker run -v /tmp//pipeline.S9YTZap5/:/tmp//pipeline.S9YTZap5/ -w /tmp//pipeline.S9YTZap5/ gcr.io/shapeit /bin/bash -c '__RESOURCE_GROUP__2=ESEFn8Tm; __RESOURCE_GROUP__3=K1TfWX3n; shapeit --bed-file ${__RESOURCE_GROUP__2} --chr 1 --out ${__RESOURCE_GROUP__3}'


# __TASK__6 shapeit
docker run -v /tmp//pipeline.S9YTZap5/:/tmp//pipeline.S9YTZap5/ -w /tmp//pipeline.S9YTZap5/ gcr.io/shapeit /bin/bash -c '__RESOURCE_GROUP__2=ESEFn8Tm; __RESOURCE_GROUP__4=8dRi0LwZ; shapeit --bed-file ${__RESOURCE_GROUP__2} --chr 2 --out ${__RESOURCE_GROUP__4}'


# __TASK__7 shapeit
docker run -v /tmp//pipeline.S9YTZap5/:/tmp//pipeline.S9YTZap5/ -w /tmp//pipeline.S9YTZap5/ gcr.io/shapeit /bin/bash -c '__RESOURCE_GROUP__2=ESEFn8Tm; __RESOURCE_GROUP__5=NIqfevqS; shapeit --bed-file ${__RESOURCE_GROUP__2} --chr 3 --out ${__RESOURCE_GROUP__5}'


# __TASK__8 merge
docker run -v /tmp//pipeline.S9YTZap5/:/tmp//pipeline.S9YTZap5/ -w /tmp//pipeline.S9YTZap5/ ubuntu /bin/bash -c '__RESOURCE__11=K1TfWX3n.haps; __RESOURCE__13=8dRi0LwZ.haps; __RESOURCE__15=NIqfevqS.haps; __RESOURCE__17=GLxOwBss; cat ${__RESOURCE__11} ${__RESOURCE__13} ${__RESOURCE__15} >> ${__RESOURCE__17}'


# __TASK__9 write_output
__RESOURCE__17=GLxOwBss
cp ${__RESOURCE__17} gs://jigold/final_output.txt

Copy link
Contributor

@tpoterba tpoterba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requesting changes so this doesn't keep getting tested. Batch testing is broken.

@danking danking merged commit c883545 into hail-is:master Jan 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants