Refactored Dispatcher Actions

The refactored dispatcher uses a Pipeline structure, see Pipeline construction and flow. Actions can have internal pipelines containing more actions. Actions are selected for a particular job using a Strategy (see Using strategy classes) which uses the parameters in the job submission and the device configuration to build the top level pipeline.

The refactored dispatcher does not make assumptions or guesses - if the job submission does not specify a piece of data, that piece of data will not be available to the pipeline. This may cause the job submission to be rejected if one or more Actions selected by the Strategy require this information. See Keep the dispatcher dumb.

Dispatcher Actions

Job submissions for the refactored dispatcher use YAML and can create a pipeline of actions based on five basic types. Parameters in the YAML and in the device configuration are used to select the relevant Strategy for the job and this determines which actions are added to the pipeline.

In addition, the job has some general parameters, including a job name and Timeouts.

Deploy

Many deployment strategies will run on the dispatcher. As such, these actions may contain commands which cannot be overridden in the job. See Only protect the essential components.

In general, the deployments do not modify the downloaded files. Where the LAVA scripts and test definitions need to be added, these are first prepared as a standalone tarball which is also retained within the final job data and is available for download later. Exceptions include specific requirements of bootloaders (like u-boot) to have a bootloader-specific header on a ramdisk to which LAVA needs to add the LAVA extensions.

  • Download files required by the job to the dispatcher, decompressing only if requested.
  • Prepare a LAVA extensions tarball containing the test definitions and LAVA API scripts, only if a Test action is defined.
  • Depending on the deployment, apply the LAVA extensions tarball to the deployment.
  • Deploy does not support repeat blocks but does support Retry on failure.

Parameters

Every deployment must specify a to parameter. This value is then used to select the appropriate Strategy class for the deployment which, in turn, will require other parameters to provide the data on how to deploy to the requested location.

  • to

    • tmpfs: Used to support QEMU device types which run on a dispatcher. The file is downloaded to a temporary directory and made available as an image to a predetermined QEMU command line:

      to: tmpfs
      
      • Requires an image parameter:

        image: http://images.validation.linaro.org/kvm-debian-wheezy.img.gz
        
      • The operating system of the image must be specified so that the LAVA scripts can install packages and identify other defaults in the deployment data. Supported values are android, ubuntu, debian or oe:

        os: debian
        
      • If the image is compressed, the compression method must be specified if any test actions are defined in the job. Supported values are gz, bz2 and xz:

        compression: gz
        
    • tftp: Used to support TFTP deployments, e.g. using UBoot. Files are downloaded to a temporary directory in the TFTP tree and the filenames are substituted into the bootloader commands specified in the device configuration or overridden in the job. The files to download typically include a kernel but can also include any file which the substitution commands need for this deployment. URL support is handled by the python requests module.

      to: tftp
      
      • kernel - in an appropriate format to what the commands require:

        kernel: http://images.validation.linaro.org/functional-test-images/bbb/zImage
        
      • dtb - in an appropriate format to what the commands require:

        dtb: http://images.validation.linaro.org/functional-test-images/bbb/am335x-bone.dtb
        
      • ramdisk - in an appropriate format to what the commands require. If a UBoot header is required, it must have already been added prior to download and the ramdisk-type: u-boot option added. The original header is removed before unpacking so that the LAVA scripts can be overlaid and the header replaced:

        ramdisk: http://images.validation.linaro.org/functional-test-images/common/linaro-image-minimal-initramfs-genericarmv7a.cpio.gz.u-boot
        ramdisk-type: u-boot
        
      • nfsrootfs - must be a tarball and supports either gz or bz2 compression using the standard python tarfile support. The NFS is unpacked into a temporary directory onto the dispatcher in a location supported by NFS exports:

        nfsrootfs: http://images.validation.linaro.org/debian-jessie-rootfs.tar.gz
        
      • os - The operating system of the NFS must be specified so that the LAVA scripts can install packages and identify other defaults in the deployment data. Supported values are android, ubuntu, debian or oe:

        os: debian
        
    • usb: Deploy unchanged images to secondary USB media. Any bootloader inside the image will not be used. Instead, the files needed for the boot are specified in the deployment. The entire physical device is available to the secondary deployment. Secondary relates to the expected requirement of a primary boot (e.g. ramdisk or NFS) which provides a suitable working environment to deploy the image directly to the secondary device. See Secondary media.

      Not all devices support USB media.

      The test writer needs to provide the following information about the image:

      • kernel: The path, within the image, to the kernel which will be used by the bootloader.
      • ramdisk: (optional). If used, must be a path, within the image, which the bootloader can use.
      • dtb: The path, within the image, to the dtb which will be used by the bootloader.
      • UUID: The UUID of the partition which contains the root filesystem of the booted image.
      • boot_part: the partition on the media from which the bootloader can read the kernel, ramdisk & dtb.

      Note

      If the image mounts the boot partition at a mounpoint below the root directory of the image, the path to files within that partition must not include that mountpoint. The bootloader will read the files directly from the partition.

      The UUID can be obtained by writing the image to local media and checking the contents of /dev/disk/by-uuid

      The ramdisk may need adjustment for some bootloaders (like UBoot), so mount the local media and use something like:

      mkimage -A arm -T ramdisk -C none -d /mnt/boot/init.. /mnt/boot/init..u-boot
      
    • sata: Deploy unchanged images to secondary SATA media. Any bootloader inside the image will not be used. Instead, the files needed for the boot are specified in the deployment. The entire physical device is available to the secondary deployment. Secondary relates to the expected requirement of a primary boot (e.g. ramdisk or NFS) which provides a suitable working environment to deploy the image directly to the secondary device. See Secondary media.

      Not all devices support SATA media.

      The test writer needs to provide the following information about the image:

      • kernel: The path, within the image, to the kernel which will be used by the bootloader.
      • ramdisk: (optional). If used, must be a path, within the image, which the bootloader can use.
      • dtb: The path, within the image, to the dtb which will be used by the bootloader.
      • UUID: The UUID of the partition which contains the root filesystem of the booted image.
      • boot_part: the partition on the media from which the bootloader can read the kernel, ramdisk & dtb.

      Note

      If the image mounts the boot partition at a mounpoint below the root directory of the image, the path to files within that partition must not include that mountpoint. The bootloader will read the files directly from the partition.

Deploy example

actions:

   - deploy:
       timeout:
         minutes: 2
       to: tmpfs
       image: http://images.validation.linaro.org/kvm-debian-wheezy.img.gz
       compression: gz
       os: debian

Boot

Cause the device to boot using the deployed files. Depending on the Strategy class, this could be by executing a command on the dispatcher (for example qemu) or by connecting to the device. Depending on the power state of the device and the device configuration, the device may be powered up or reset to provoke the boot.

Every boot action must specify a method which is used by the Strategy classes to determine how to boot the deployed files on the device. Depending on the method, other parameters will be required.

  • method

    • qemu - boot the downloaded image from the deployment action using QEMU. This is the kvm device type and runs on the dispatcher. The QEMU command line is not available for modification. See Only protect the essential components.
    • media is ignored for the qemu method.
    - boot:
        method: qemu
    
    • u-boot - boot the downloaded files using UBoot commands.
    • commands - the predefined set of UBoot commands into which the location of the downloaded files can be substituted (along with details like the SERVERIP and NFS location, where relevant). See the device configuration for the complete set of commands.
    • type - the type of boot, dependent on the UBoot configuration. This needs to match the supported boot types in the device configuration, e.g. it may change the load addresses passed to UBoot.
    - boot:
       method: u-boot
       commands: nfs
       type: bootz
    

Boot example

- boot:
    method: qemu
    media: tmpfs
    failure_retry: 2

Test

The refactoring has retained compatibility with respect to the content of Lava-Test-Shell Test Definitions although the submission format has changed:

  1. The Test will never boot the device - a Boot must be specified. Multiple test operations need to be specified as multiple definitions listed within the same test block.
  2. The LAVA support scripts are prepared by the Deploy action and the same scripts will be used for all test definitions until another deploy block is encountered.

Note

There is a FIXME outstanding to ensure that only the test definitions listed in this block are executed for that test action - this allows different tests to be run after different boot actions, within the one deployment.

- test:
   failure_retry: 3
   name: kvm-basic-singlenode  # is not present, use "test $N"

Definitions

  • repository - a publicly readable repository location.
  • from - the type of the repository is not guessed, it must be specified explicitly. Support is planned for bzr, url, file and tar.
    • git - a remote git repository which needs to be cloned by the dispatcher.
    • inline - a simple test definition present in the same file as the job submission, allowing tests to run based on a single file. When combined with file:// URLs to the deploy parameters, this allows tests to run without needing external access. See Inline test definition example.
  • path - the path within that repository to the YAML file containing the test definition.
  • name (optional) if not present, use the name from the YAML. The name can also be overriden from the actual commands being run by calling the lava-test-suite-name API call (e.g. lava-test-suite-name FOO).
definitions:
    - repository: git://git.linaro.org/qa/test-definitions.git
      from: git
      path: ubuntu/smoke-tests-basic.yaml
      name: smoke-tests
    - repository: http://git.linaro.org/lava-team/lava-functional-tests.git
      from: git
      path: lava-test-shell/single-node/singlenode03.yaml
      name: singlenode-advanced

Test example

- test:
    failure_retry: 3
    name: kvm-basic-singlenode
    definitions:
        - repository: git://git.linaro.org/qa/test-definitions.git
          from: git
          path: ubuntu/smoke-tests-basic.yaml
          name: smoke-tests

Repeat

See Handling repeats.

Submit

Warning

As yet, pipeline data cannot be submitted - any details here are ignored.

Handling repeats

Selected Actions within the dispatcher support repeating an individual action (along with any internal pipelines created by that action) - these are determined within the codebase.

Blocks of actions can also be repeated to allow a boot and test cycle to be repeated. Only Boot and Test are supported inside repeat blocks.

Repeating single actions

Selected actions (RetryAction) within a pipeline (as determined by the Strategy) support repetition of all actions below that point. There will only be one RetryAction per top level action in each pipeline. e.g. a top level Boot action for UBoot would support repeating the attempt to boot the device but not the actions which substitute values into the UBoot commands as these do not change between boots (only between deployments).

Any action which supports failure_retry can support repeat but not in the same job. (failure_retry is a conditional repeat if the action fails, repeat is an unconditional repeat).

Retry on failure

Individual actions can be retried a specified number of times if the a JobError Exception or InfrastructureError Exception is raised during the run step by this action or any action within the internal pipeline of this action.

Specify the number of retries which are to be attempted if a failure is detected using the failure_retry parameter.

- deploy:
   failure_retry: 3

RetryActions will only repeat if a JobError Exception or InfrastructureError Exception exception is raised in any action inside the internal pipeline of that action. This allows for multiple actions in any one deployment to be RetryActions without repeating unnecessary tasks. e.g. download is a RetryAction to allow for intermittent internet issues with third party downloads.

Unconditional repeats

Individual actions can be repeated unconditionally using the repeat parameter. This behaves similarly to Retry on failure except that the action is repeated whether or not a failure was detected. This allows a device to be booted repeatedly or a test definition to be re-run repeatedly. This repetition takes the form:

- actions:
  - deploy:
      # deploy parameters
  - boot:
      method: qemu
      media: tmpfs
      repeat: 3
  - test:
      # test parameters

Resulting in:

[deploy], [boot, boot, boot], [test]

Repeating blocks of actions

To repeat a specific boot and a specific test definition as one block ([boot, test], [boot, test], [boot, test] ...), nest the relevant Boot and Test actions in a repeat block.

actions:

   - deploy:
       timeout:
         minutes: 20
       to: tmpfs
       image: http://images.validation.linaro.org/kvm-debian-wheezy.img.gz
       os: debian
       root_partition: 1

   - repeat:
       count: 6

       actions:
       - boot:
           method: qemu
           media: tmpfs

       - test:
           failure_retry: 3
           name: kvm-smoke-test
           timeout:
             minutes: 5
           definitions:

This provides a shorthand which will get expanded by the parser into a deployment and (in this case) 6 identical blocks of boot and test.

Timeouts

Refactored timeouts now provide more detailed support. Individual actions have uniquely addressable timeouts.

Timeouts are specified explicitly in days, hours, minutes and seconds. Any unspecified value is set to zero.

The pipeline automatically records the amount of time elapsed for the complete run of each action class as duration as well as the action which sets the current timeout. Server side processing can now identify when jobs are submitted with excessively long timeouts and highlight exactly which actions can use shorter timeouts.

Job timeout

The entire job will have an overall timeout - the job will fail if this timeout is exceeded, whether or not any other timeout is longer.

A timeout for a job means that the current action will be allowed to complete and the job will then fail.

timeouts:
  job:
    minutes: 15

Action timeout

Each action has a default timeout which is handled differently according to whether the action has a current connection to the device.

Note

This is per call made by each action class, not per top level action. i.e. the top level boot action includes many actions, from interrupting the bootloader and substituting commands to waiting for a shell session or login prompt once the boot starts. Each action class within the pipeline is given the action timeout unless overridden using Individual action timeouts.

Think of the action timeout as:

  • no single operation of this class should possibly take longer than ...

along with

  • the pipeline should wait no longer than ... to determine that the device is not responding.

When changing timeouts, review the pipeline logs for each top level action, deploy, boot and test. Check the duration of each action within each section and set the timeout for that top level action. Specific actions can be extended using the Individual action timeouts support.

Action timeouts behave differently, depending on whether the action has a connection or not. This allows quicker determination of whether the device has failed to respond. The type of action timeout can be determined from the logs.

If no action timeout is given in the job, the default action timeout of 30 seconds will be used.

Actions with connections

These actions use the timeout to wait for a prompt after sending a command over the connection. If the action times out, no further commands are sent and the job is marked as Incomplete.

  • Log message: ${name}: Wait for prompt:

    log: "expect-shell-connection: Wait for prompt. 24 seconds"
    

If the action has an active connection to a device, the timeout is set for each operation on that connection. e.g. u-boot-commands uses the same timeout for each line sent to UBoot.

Individual actions may make multiple calls on the connection - different actions are used when a particular operation is expected to take longer than other calls, e.g. boot.

Actions without connections

A timeout for these actions interrupts the executing action and marks the job as Incomplete.

  • Log message: ${name}: timeout:

    log: "git-repo-action: timeout. 45 seconds"
    

If the action has no connection (for example a deployment action), the timeout covers the entire operation of that action and the action will be terminated if the timeout is exceeded.

The log structure shows the action responsible for the command running within the specified timeout.

action:
  seconds: 45

Note

Actions which create a connection operate as actions without a connection. boot_qemu_image and similar actions will use the specified timeout for the complete operation, which is typically followed by an action (with a connection) which explicitly waits for the prompt (or performs an automatic login).

Individual action timeouts

Individual actions can also be specified by name - see the pipeline description output by the validate command to see the full name of action classes:

extract-nfsrootfs:
 seconds: 60

This allows typical action timeouts to be as short as practical, so that jobs fail quickly, whilst allowing for individual actions to take longer.

Typical actions which may need timeout extensions:

  1. lava-test-shell - unless changed, the Action timeout applies to running the all individual commands inside each test definition. If install: deps: are in use, it could take a lot longer to update, download, unpack and setup the packages than to run any one test within the definition.
  2. expect-shell-connection - used to allow time for the device to boot and then wait for a standard prompt (up to the point of a login prompt or shell prompt if no login is offered). If the device is expected to raise a network interface at boot using DHCP, this could add an appreciable amount of time.

Examples

Note

The unit tests supporting the refactoring contain a number of example jobs. However, these have been written to support the tests and might not be appropriate for use on actual hardware - the files specified are just examples of a URL, not a URL of a working file.

KVM x86 example

https://git.linaro.org/lava/lava-dispatcher.git/blob/HEAD:/lava_dispatcher/pipeline/test/sample_jobs/kvm.yaml

device_type: kvm

job_name: kvm-pipeline
timeouts:
 job:
   minutes: 5
 action:
   minutes: 1
 test:
   minutes: 3
priority: medium

actions:

   - deploy:
       timeout:
         minutes: 2
       to: tmpfs
       image: http://images.validation.linaro.org/kvm-debian-wheezy.img.gz
       compression: gz
       os: debian

   - boot:
       method: qemu
       media: tmpfs
       failure_retry: 2

   - test:
       failure_retry: 3
       name: kvm-basic-singlenode
       definitions:
           - repository: git://git.linaro.org/qa/test-definitions.git
             from: git
             path: ubuntu/smoke-tests-basic.yaml
             name: smoke-tests

Inline test definition example

https://git.linaro.org/lava/lava-dispatcher.git/blob/HEAD:/lava_dispatcher/pipeline/test/sample_jobs/kvm-inline.yaml

- test:
    failure_retry: 3
    name: kvm-basic-singlenode  # is not present, use "test $N"
    definitions:
        - repository:
            metadata:
                format: Lava-Test Test Definition 1.0
                name: smoke-tests-basic
                description: "Basic system test command for Linaro Ubuntu images"
                os:
                    - ubuntu
                scope:
                    - functional
                devices:
                    - panda
                    - panda-es
                    - arndale
                    - vexpress-a9
                    - vexpress-tc2
            run:
                steps:
                    - lava-test-case linux-INLINE-pwd --shell pwd
                    - lava-test-case linux-INLINE-uname --shell uname -a
                    - lava-test-case linux-INLINE-vmstat --shell vmstat
                    - lava-test-case linux-INLINE-ifconfig --shell ifconfig -a
                    - lava-test-case linux-INLINE-lscpu --shell lscpu
                    - lava-test-case linux-INLINE-lsusb --shell lsusb
                    - lava-test-case linux-INLINE-lsb_release --shell lsb_release -a
          from: inline
          name: smoke-tests-inline
          path: inline/smoke-tests-basic.yaml

TFTP deployment example

NFS

https://git.linaro.org/lava/lava-dispatcher.git/blob/HEAD:/lava_dispatcher/pipeline/test/sample_jobs/uboot.yaml

actions:
 - deploy:
    timeout:
      minutes: 4
    to: tftp
    kernel: http://images.validation.linaro.org/functional-test-images/bbb/zImage
    nfsrootfs: http://images.validation.linaro.org/debian-jessie-rootfs.tar.gz
    os: oe
    dtb: http://images.validation.linaro.org/functional-test-images/bbb/am335x-bone.dtb

Ramdisk

https://git.linaro.org/lava/lava-dispatcher.git/blob/HEAD:/lava_dispatcher/pipeline/test/sample_jobs/panda-ramdisk.yaml

# needs to be a list of hashes to retain the order
- deploy:
   timeout: 2m
   to: tftp
   kernel: http://images.validation.linaro.org/functional-test-images/panda/uImage
   ramdisk: http://images.validation.linaro.org/functional-test-images/common/linaro-image-minimal-initramfs-genericarmv7a.cpio.gz.u-boot
   ramdisk-type: u-boot
   dtb: http://images.validation.linaro.org/functional-test-images/panda/omap4-panda-es.dtb

Protocols

Protocols are similar to a Connection but operate over a known API instead of a shell connection. The protocol defines which API calls are available through the LAVA interface and the Pipeline determines when the API call is made.

Not all protocols can be called from all actions. Not all protocols are able to share data between actions.

A Protocol operates separately from any Connection, generally over a predetermined layer, e.g. TCP/IP sockets. Some protocols can access data passing over a Connection.

Multinode Protocol

The initial protocol available with the refactoring is Multinode. This protocol allows actions within the Pipeline to make calls using the MultiNode API outside of a test definition by wrapping the call inside the protocol. Wrapped calls do not necessarily have all of the functionality of the same call available in the test definition.

The Multinode Protocol allows data to be shared between actions, including data generated in one test shell definition being made available over the protocol to a deploy or boot action of jobs with a different role. It does this by adding handlers to the current Connection to intercept API calls.

The Multinode Protocol can underpin the use of other tools without necessarily needing a dedicated Protocol class to be written for those tools. Using the Multinode Protocol is an extension of using the existing MultiNode API calls within a test definition. The use of the protocol is an advanced use of LAVA and relies on the test writer carefully planning how the job will work.

protocols:
  lava-multinode:
    action: umount-retry
    request: lava-sync
    messageID: test

This snippet would add a lava-sync call at the start of the UmountRetry action:

  • Actions which are too complex and would need data mid-operation need to be split up.

  • When a particular action is repeatedly used with the protocol, a dedicated action needs to be created. Any Strategy which explicitly uses protocol support must create a dedicated action for each protocol call.

  • To update the value available to the action, ensure that the key exists in the matching lava-send and that the value in the job submission YAML starts with $

    protocols:
    lava-multinode:
      action: execute-qemu
      request: lava-wait
      messageID: test
      message:
        ipv4: $IPV4
    

    This results in this data being available to the action:

    {'message': {'ipv4': '192.168.0.3'}, 'messageID': 'test'}
    
  • Actions check for protocol calls at the start of the run step before even the internal pipeline actions are run.

  • Only the named Action instance inside the Pipeline will make the call

  • The MultiNode API asserts that repeated calls to lava-sync with the same messageID will return immediately, so this protocol call in a Retry action will only synchronise the first attempt at the action.

  • Some actions may make the protocol call at the end of the run step.

The Multinode Protocol also exposes calls which are not part of the test shell API, which were formerly hidden inside the job setup phase.

lava-start API call

lava-start determines when Multinode jobs start, according to the state of other jobs in the same Multinode group. This allows jobs with one role to determine when jobs of a different role start, so that the delayed jobs can be sure that particular services required for those jobs are available. For example, if the server role is actually providing a virtualisation platform and the client is a VM to be started on the server, then a delayed start is necessary as the first action of the client role will be to attempt to connect to the server in order to boot the VM, before the server has even been deployed. The lava-start API call allows the test writer to control when the client is started, allowing the server test image to setup the virtualisation support in a way that allows attaching of debuggers or other interventions, before the VM starts.

The client enables a delayed start by declaring which role the client can expect to send the signal to start the client.

protocols:
  lava-multinode:
    request: lava-start
    expect_role: server
    timeout:
      minutes: 10

The timeout specified for lava_start is the amount of time the job will wait for permission to start from the other jobs in the group.

Internally, lava-start is implemented as a lava-send and a lava-wait-all for the role of the action which will make the lava_start API call using the message ID lava_start.

It is an error to specify the same role and expect_role to lava-start.

Note

Avoid confusing host_role with expect_role. host_role is used by the scheduler to ensure that the job assignment operates correctly and does not affect the dispatcher or delayed start support. The two values may often have the same value but do not mean the same thing.

It is an error to specify lava-start on all roles within a job or on any action without a role specified.

All jobs without a lava-start API call specified for the role of that job will start immediately. Other jobs will write to the log files that the start has been delayed, pending a call to lava-start by actions with the specified role(s).

Subsequent calls to lava-start for a role which has already started will still be sent but will have no effect.

If lava-start is specified for a test action, the test definition is responsible for making the lava-start call.

run:
  steps:
    - lava-send lava_start

Passing data at startup

Various delayed start jobs will need dynamic data from the “server” job in order to be able to start, like an IP address. This is achieved by adding the lava-start call to the test action of the server where the test definition initiates a lava-send message. When this test action completes, the protocol will send the lava-start. The first thing the delayed start job does is a lava-wait which would be added to the deploy action of that job.

Server role Delayed client role
deploy  
boot  
test  
  • lava-send ipv4 ipaddr=$(IP)
 
  • lava-start
deploy
 
  • lava-wait ipv4
  • lava-test-case
boot
deploy:
  role: client
  protocols:
    lava-multinode:
      api: lava-wait
      id: ipv4
      key: ipaddr

Depending on the implementation of the deploy action, determined by the Strategy class, the lava-wait call will be made at a suitable opportunity within the deployment. In the above example, the lava-send call is made before lava-start - this allows the data to be stored in the lava coordinator and the lava-wait will receive the data immediately.

The specified id and key must exactly match the message ID used for the lava-send call in the test definition. (So an inline test definition could be useful for the test action of the job definition for the server role. See Inline test definition example)

test:
  role: server
  protocols:
    lava-multinode:
      api: lava-start
      roles:
        - client