[go: up one dir, main page]

Menu

#33 Downloading individual job's output using ASM in a faster way

trunk
open
nobody
ASM (1)
1
2013-07-26
2013-02-05
No

Dear all.

I have got a workflow consisting of three jobs:
1, a Generator job (local script),
2, a BOINC job,
3, a Collector job (local script).

My workflow inputs are relatively big: 20M / work unit (WU). After the execution, I only want to retrieve a specific output file of the collector job. Unfortunately, it takes a lot of time if I have a lot of WUs.

For example: assuming there are 100 BOINC tasks => more than 2G has to be copied if I use the method getFiletoPortalServer() of ASMService. Once the workflow zip file is copied over, I extract the information I need, and write it to the ouputstream of my portlet's resource response. On my virtual machine, it takes about 5 minutes.

I have made some investigations and figured out that getFiletoPortalServer() calls a private method, getFileStreamFromStorage(), in the following way:

is = getFileStreamFromStorage(userID, workflowID, DownloadTypeConstants.InstanceAll);

Then it outputs the contents of the retrieved inputstream (i.e. is) to the user's worklfow_outputs directory ([TOMCAT_DIR]/temp/tmp/users/[USER_ID]/workflow_outputs/[WF_NAME]). So basically getFiletoPortalServer() copies everything (all jobs output + the first job's input) and then returns filepath to the caller.

ASMService provides us with another method getFileStream(), but unfortunately it also passes InstanceAll to getFileStreamFromStorage(), which results in a long execution time. As of writing, a new function getSingleOutputFileStream is also available on the trunk version, but again it is quite slow, because of similar reasons (it's implementation is similar to the former two functions).

As far as I know, the gUSE portal supports downloading outputs of individual jobs: Concrete/workflow + details/select instance + details/select job + details/Download file output. It is much faster than the ASM approach.

It would be nice if ASM could provide a function that worked in a similar fashion (it should download only what the caller requests). This way we could avoid slow download speed.

Many thanks in advance!

Best regards,
Attila Sasvari

Discussion

  • Akos Balasko

    Akos Balasko - 2013-07-26

    Dear Attila,

    Thanks for describing the problem, the next ASM release (3.4.5) will contain a method that returns InputStream coming from the servlet interface of Storage component (so, it does not use the slow web-service interface)

    Cheers,
    akos

     
  • Peter Kacsuk

    Peter Kacsuk - 2013-07-26

    Akos Balasko writes:

    Dear Attila,

    Thanks for describing the problem, the next ASM release (3.4.5) will
    contain a method that returns InputStream coming from the servlet
    interface of Storage component (so, it does not use the slow web-service interface)

    Cheers,
    akos


    ** [feature-requests:#33] Downloading individual job's output using ASM in a faster way**

    Status: open
    Labels: ASM
    Created: Tue Feb 05, 2013 12:05 PM UTC by CPC Westminster
    Last Updated: Tue Feb 05, 2013 12:05 PM UTC
    Owner: nobody

    Dear all.

    I have got a workflow consisting of three jobs:
    1, a Generator job (local script),
    2, a BOINC job,
    3, a Collector job (local script).

    My workflow inputs are relatively big: 20M / work unit (WU). After the
    execution, I only want to retrieve a specific output file of the
    collector job. Unfortunately, it takes a lot of time if I have a lot of WUs.

    For example: assuming there are 100 BOINC tasks => more than 2G has to be
    copied if I use the method getFiletoPortalServer() of ASMService. Once
    the workflow zip file is copied over, I extract the information I need,
    and write it to the ouputstream of my portlet's resource response. On my
    virtual machine, it takes about 5 minutes.

    I have made some investigations and figured out that
    getFiletoPortalServer() calls a private method, getFileStreamFromStorage(
    g, in the following way:

    is = getFileStreamFromStorage(userID, workflowID, DownloadTypeConstants.InstanceAll);

    Then it outputs the contents of the retrieved inputstream (i.e. is) to
    the user's worklfow_outputs directory ([TOMCAT_DIR]/temp/tmp/users/
    [USER_ID]/workflow_outputs/[WF_NAME]). So basically
    getFiletoPortalServer() copies everything (all jobs output + the first
    job's input) and then returns filepath to the caller.

    ASMService provides us with another method getFileStream(), but
    unfortunately it also passes InstanceAll to getFileStreamFromStorage(),
    which results in a long execution time. As of writing, a new function
    getSingleOutputFileStream is also available on the trunk version, but
    again it is quite slow, because of similar reasons (it's implementation
    is similar to the former two functions).

    As far as I know, the gUSE portal supports downloading outputs of
    individual jobs: Concrete/workflow + details/select instance +
    details/select job + details/Download file output. It is much faster than
    the ASM approach.

    It would be nice if ASM could provide a function that worked in a similar
    fashion (it should download only what the caller requests). This way we
    could avoid slow download speed.

    Many thanks in advance!

    Best regards,
    Attila Sasvari


    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/guse/feature-requests/33/

    To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

    I will be on holiday until the 20th of August. In urgent case please, contact my substitute Dr. Robert Lovas (rlovas@sztaki.hu). In case of SCI-BUS, please, contact Zoltan Farkas (zfarkas@sztaki.hu) and/or Eva Feuer (feuer@sztaki.hu).

    Regards,
    Peter

     

Log in to post a comment.