Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline still unstable with large .fastqs #72

Open
Firedrops opened this issue Feb 18, 2019 · 9 comments
Open

Pipeline still unstable with large .fastqs #72

Firedrops opened this issue Feb 18, 2019 · 9 comments
Labels

Comments

@Firedrops
Copy link
Contributor

Firedrops commented Feb 18, 2019

I have tried increasing the provisioning MACHINE_TYPE to n1-standard-8, which is 8 vCPUs and 30 GB RAM, should be more than enough for any of the reference files.

Large files (>~100 kb?) still get stuck with these error logs. If these appear, the pipeline appears to be unsalvageable and need to be cancelled and restarted.

2019-02-18 (11:33:28) Processing stuck in step Alignment for at least 05m00s without outputting or completing in state pro...

Processing stuck in step Alignment for at least 05m00s without outputting or completing in state process
  at java.net.SocketInputStream.socketRead0(Native Method)
  at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
  at java.net.SocketInputStream.read(SocketInputStream.java:170)
  at java.net.SocketInputStream.read(SocketInputStream.java:141)
  at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
  at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
  at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282)
  at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
  at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
  at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
  at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
  at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165)
  at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
  at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
  at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
  at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
  at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
  at org.apache.http.impl.execchain.ServiceUnavailableRetryExec.execute(ServiceUnavailableRetryExec.java:85)
  at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
  at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72)
  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:221)
  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:165)
  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:140)
  at com.theappsolutions.nanostream.util.HttpHelper.executeRequest(HttpHelper.java:105)
  at com.theappsolutions.nanostream.http.NanostreamHttpService.generateAlignData(NanostreamHttpService.java:58)
  at com.theappsolutions.nanostream.aligner.MakeAlignmentViaHttpFn.processElement(MakeAlignmentViaHttpFn.java:49)
  at com.theappsolutions.nanostream.aligner.MakeAlignmentViaHttpFn$DoFnInvoker.invokeProcessElement(Unknown Source)

2019-02-18 (11:38:28) Processing stuck in step Alignment for at least 10m00s without outputting or completing in state pro...

Processing stuck in step Alignment for at least 10m00s without outputting or completing in state process
  at java.net.SocketInputStream.socketRead0(Native Method)
  at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
  at java.net.SocketInputStream.read(SocketInputStream.java:170)
  at java.net.SocketInputStream.read(SocketInputStream.java:141)
  at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
  at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
  at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282)
  at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
  at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
  at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
  at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
  at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165)
  at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
  at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
  at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
  at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
  at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
  at org.apache.http.impl.execchain.ServiceUnavailableRetryExec.execute(ServiceUnavailableRetryExec.java:85)
  at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
  at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72)
  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:221)
  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:165)
  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:140)
  at com.theappsolutions.nanostream.util.HttpHelper.executeRequest(HttpHelper.java:105)
  at com.theappsolutions.nanostream.http.NanostreamHttpService.generateAlignData(NanostreamHttpService.java:58)
  at com.theappsolutions.nanostream.aligner.MakeAlignmentViaHttpFn.processElement(MakeAlignmentViaHttpFn.java:49)
  at com.theappsolutions.nanostream.aligner.MakeAlignmentViaHttpFn$DoFnInvoker.invokeProcessElement(Unknown Source)

2019-02-18 (11:38:38) org.apache.http.client.ClientProtocolException: Unexpected response status: 502

org.apache.http.client.ClientProtocolException: Unexpected response status: 502
        com.theappsolutions.nanostream.http.NanostreamResponseHandler.handleResponse(NanostreamResponseHandler.java:39)
        com.theappsolutions.nanostream.http.NanostreamResponseHandler.handleResponse(NanostreamResponseHandler.java:17)
        org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:223)
        org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:165)
        org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:140)
        com.theappsolutions.nanostream.util.HttpHelper.executeRequest(HttpHelper.java:105)
        com.theappsolutions.nanostream.http.NanostreamHttpService.generateAlignData(NanostreamHttpService.java:58)
        com.theappsolutions.nanostream.aligner.MakeAlignmentViaHttpFn.processElement(MakeAlignmentViaHttpFn.java:49)
@lachlancoin
Copy link

lachlancoin commented Feb 18, 2019 via email

@allenday
Copy link
Owner

allenday commented Feb 18, 2019 via email

@Firedrops
Copy link
Contributor Author

Firedrops commented Feb 18, 2019

The last stack trace indicates http 502. You may have flooded the alignment cluster. How many reads are you submitting per batch?

Just 1.
On further testing, it seems the file size is not the main issue. 20170731_GP01_MNP_nohuman.fastq, 866kb, always causes that error.
A cassava file, test_Cassava_KE.barcode1_KE.barcode1.fasta, 1,111kb, did not cause the error.
Another cassava, test_Cassava_UG.Barcode1_UG.Barcode1.fastq 43,940kb, caused the 5 minutes error.

I'm still further testing, it's a bit slow since it takes the 5 minutes to see this error pop up. For now it looks like big .fasta files are OK, but .fastq files are not.

UPDATE:
Testing with another large fastq file also causes the 502 errors, as well as multiples of this:

** 2019-02-18 (15:16:55) java.net.SocketException: Broken pipe**

java.net.SocketException: Broken pipe
        java.net.SocketOutputStream.socketWrite0(Native Method)
        java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
        java.net.SocketOutputStream.write(SocketOutputStream.java:153)
        org.apache.http.impl.io.SessionOutputBufferImpl.streamWrite(SessionOutputBufferImpl.java:124)
        org.apache.http.impl.io.SessionOutputBufferImpl.flushBuffer(SessionOutputBufferImpl.java:136)
        org.apache.http.impl.io.SessionOutputBufferImpl.write(SessionOutputBufferImpl.java:167)
        org.apache.http.impl.io.ContentLengthOutputStream.write(ContentLengthOutputStream.java:113)
        org.apache.http.entity.mime.content.StringBody.writeTo(StringBody.java:174)
        org.apache.http.entity.mime.AbstractMultipartForm.doWriteTo(AbstractMultipartForm.java:134)
        org.apache.http.entity.mime.AbstractMultipartForm.writeTo(AbstractMultipartForm.java:157)
        org.apache.http.entity.mime.MultipartFormEntity.writeTo(MultipartFormEntity.java:113)
        org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:156)
        org.apache.http.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:160)
        org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:238)
        org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
        org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
        org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
        org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
        org.apache.http.impl.execchain.ServiceUnavailableRetryExec.execute(ServiceUnavailableRetryExec.java:85)
        org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
        org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72)
        org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:221)
        org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:165)
        org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:140)
        com.theappsolutions.nanostream.util.HttpHelper.executeRequest(HttpHelper.java:105)
        com.theappsolutions.nanostream.http.NanostreamHttpService.generateAlignData(NanostreamHttpService.java:58)
        com.theappsolutions.nanostream.aligner.MakeAlignmentViaHttpFn.processElement(MakeAlignmentViaHttpFn.java:49)

As @lachlancoin suggested, this might be a batching issue, possibly implemented in a way that works well with .fasta, but not with .fastq??

@Firedrops
Copy link
Contributor Author

For specifics, I am using the current provisioning script (provision_species.sh), directly calling Allen's bwa-http-docker, and the dataflow command in the README.

The following modifications were made:

  1. change project name to nano-stream1
  2. change pubsub subscription name to ours (dataflow_species)
  3. specify region (asia-northeast-1c)
  4. change firestore names and tokens (mostly in the visualizer app) to ours.
  5. change provisioner machine type to n1-standard-8, it did not appear to help with issues, so I should change it back to n1-standard-4 but not yet.

I wonder if this issue might have been solved previously but not yet committed to the main branch? Most of the commits there are about a week old or more, and these issues have been mentioned in #23 so @obsh and @Pseverin would have known about them for a while.

@obsh
Copy link
Collaborator

obsh commented Feb 20, 2019

I think we'll make batch size configurable, to try smaller fastq batches with the aligner.
Meanwhile you can try to decrease it in the code and recompile jar file.
https://github.com/allenday/nanostream-dataflow/blob/master/NanostreamDataflowMain/src/main/java/com/google/allenday/nanostream/NanostreamApp.java#L54

Also there is a new build of allenday/bwa-http-docker:http container available. It's not a performance improvement, just more correct error handling.

@Firedrops
Copy link
Contributor Author

I agree, we just ran into the problem again with the EDTA sample. We'll try 100 and maybe 50 tomorrow, it'd be a good idea to pull the batch size out into an argument, since our builds seemed imperfect the last few times.

@Firedrops
Copy link
Contributor Author

Firedrops commented Feb 22, 2019

Have tried down to batch size 25, seems to slow down the entire pipeline, no firestore results generated after ~30 mins run time on alignment step. We got the 5 min error in the end and the whole thing had to be cancelled.

Also, it seems that once the 5 min pipeline occurs, the whole provisioning cluster needs to be restarted. If we only restart the dataflow, we would immediately get broken pipe errors:

image

UPDATE: Nevermind, it seems restarting the provisioning cluster doesn't help either. It seems very random, sometimes works sometimes doesn't, even with exact same builds and fastq files. Occasionally also getting 404 errors

image

@obsh
Copy link
Collaborator

obsh commented Feb 26, 2019

it'd be a good idea to pull the batch size out into an argument

done now, see optional - --alignmentBatchSize parameter.

Have tried down to batch size 25, seems to slow down the entire pipeline, no firestore results generated after ~30 mins run time on alignment step.

I've experimented with batch size, looks that bigger batch size actually improves performance as in this case bwa starting time adds less overhead. Default value is 2000 as it worked well on "dogbite" dataset in my tests.

I assume that at least n1-highmem-8 machine size is required for aligner when using species reference database. With less memory it seems that OS buffer cache is not working, while withn1-highmem-8 bwa loading time improves significantly on subsequent calls.

Also in #95 we introduced optional --bwaArguments parameter. With default value '-t 4' - bwa now uses 4 threads. For n1-highmem-8 you can try even --bwaArguments='-t 8' for better aligner performance.

@lachlancoin
Copy link

I am still having a problem with large fastq, see #98 (connection refused during alignment step). So basically the dataflow stores at the alignment step and nothing comes out of it. This fastq had 4000 records, and I set a batch size of 500 (and using the standard bwa docker). The scripts I use are here:
https://github.com/lachlancoin/gcloud/blob/master/init.sh
I set the target-cpu-utliisation to 0.5 (to manage costs!), and use default '-t 4' .

I was wondering, if its possible to avoid the CGI step, which is problematic by instead using Pubsub. I have some thoughts which I will put in a new issue.

@obsh obsh added the tracked label Sep 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants