Skip to main content
Version: 2.6

Basic-Components

8 Basic Components

Most of these basic components follows standard GNU/Linux commands, their purpose is obvious immediately.

Standard ETL components


8.1 Assign

(since EVL 1.2)

Assign the content of input flow or file <f_in> into shell variable <varname>, which is then exported into environment. Don’t forget to apply ‘--text-output’ on preceding component to get text content in the <variable>.

This component doesn’t work for partitioned flow.

Assign
is to be used in EVS job structure definition file. <f_in> is either input file or flow name.

There is no standalone version of this component as you can use standard Bash behaviour for this purpose. For example:

VARNAME=$(evl cat filename some.evd --text-output)

EVS is EVL job structure definition file, for details see evl-evs(5).

Synopsis

Assign
<f_in> <varname>

evl assign
( --help | --usage | --version )

Options

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

--version
print version and exit

Examples

  1. EVL job (an ‘evs’ file) which reads content of a binary file ‘hwm.bin’ into variable ‘HWM’:

    Read    hwm.bin   FLOW_HWM  evd/some.evd  --text-output
    Assign FLOW_HWM HWM

    Such a value can be then used (after ‘Wait’ component!) within mapping by:

    static int hwm = getenv_int("HWM",0);   // use 0 when $HWM is empty
    *out->incremental_id = ++hwm;
  2. To get a value from text file:

    Assign  hwm.txt  HWM
  3. To assign flow content into a ‘NATCO’ variable:

    Map     FLOW_01  FLOW_02 in.evd out.evd map.evm  --text-output
    Assign FLOW_02 NATCO

8.2 Cat

(since EVL 1.0)

Concatenate flows or files.

Cat
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl cat
is intended for standalone usage, i.e. to be invoked from command line.

EVD is EVL data definition file, for details see ‘man 5 evd’.

Synopsis

Cat
<f_in>... <f_out> (<evd>|-d <inline_evd>)
[ --validate ]
[ -x|--text-input | --text-input-dos-eol | --text-input-mac-eol ]
[ -y|--text-output | --text-output-dos-eol | --text-output-mac-eol ]

evl cat
[<file>...] (<evd>|-d <inline_evd>)
[ --validate ]
[ -x|--text-input | --text-input-dos-eol | --text-input-mac-eol ]
[ -y|--text-output | --text-output-dos-eol | --text-output-mac-eol ]
[ -v|--verbose ]

evl cat
( --help | --usage | --version )

Options

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: ‘-d 'id int, user_id string enc=iso-8859-1'

--validate
without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input
suppose the input as text, not binary

--text-input-dos-eol
suppose the input as text with CRLF as end of line

--text-input-mac-eol
suppose the input as text with CR as end of line

-y, --text-output
write the output as text, not binary

--text-output-dos-eol
produce the output as text with CRLF as end of line

--text-output-mac-eol
produce the output as text with CR as end of line

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples

Print to stdout binary input in text format:

evl cat example.evd -y <input.bin

8.3 Cmd

(since EVL 1.2)

Basicly it calls:

cat <f_in> | <command> > <f_out>

When <f_in> is empty, then it runs:

<command> > <f_out>

and when <f_out> is empty:

cat <f_in> | <command>

<command> can be also a pipeline.

If <f_in> is partitioned, then <command> is applied on all partitions and keep the output <f_out> also partitioned.

Synopsis

Cmd
<f_in> <f_out> <command>

evl cmd
( --help | --usage | --version )

Options

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples

  1. Write 10 times ’repeat some error message’ to the STDERR and into EVL job log:

    Cmd "" /dev/stderr "yes repeat some error message | head"
  2. Suppose from ‘SOME_FLOW’ we obtain integers, one by line, then median can be obtained from R and be written into ‘/some/file’:

    Cmd SOME_FLOW /some/file "Rscript median.R"

    The file median.R might look like this:

    f <- file('stdin'); open(f); x <- c();
    while ( length( line <- readLines(f) ) > 0 ) x <- c(x,as.integer(line));
    write(median(x), stdout());

8.4 Component

(since EVL 1.0)

Run <component> from the project’s evc directory with arguments <comp_arg>. In the <component> these arguments are available as the array ‘COMP_ARG[1]’, ‘COMP_ARG[2]’, ... ‘COMP_ARG[0]’ is the component’s name.

When the <component> is not in current project subdirectory ‘evc/’, it tries the folder ‘$EVL_EVC_DIR/’.

You can also specify the full path to the component. Check examples.

Flow names within a component have unique prefixes, so cannot be in conflict with those in the job. However if you need to connect output flow(s) of the component, you need to use variable ‘$COMP_FLOW’ which is set by the component to such a prefix. So then flow from the component, e.g. ‘FLOW_IN_COMP’, can be read in parent job as ‘$COMP_FLOW.FLOW_IN_COMP’. Check examples.

For input flow there is a variable ‘$PARENT_FLOW’ which can be used in the component. Parent flow ‘FLOW_INTO_COMP’ can be reference within a component as ‘$PARENT_FLOW.FLOW_INTO_COMP’. Check examples for better understanding.

Comp
is to be used in EVS job structure definition file.

evl comp
is intended for standalone usage, i.e. to be invoked from command line and reading records from standard input and writing to standard output.

EVS is EVL job structure definition file, for details see evl-evs(5).

Synopsis

Comp
<component> [<comp_arg>...]

evl comp
( --help | --usage | --version )

Options

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

--version
print version and exit

Examples

  1. Run custom component ‘evc/prepare_lkp.evc’ with neither input nor output:

    Comp prepare_lkp
  2. Run component from EVL Data Hub template project with three arguments:

    Comp $EVL_TEMPLATE_DIR/data-hub/evc/scd2_read_increments.evc party.*.csv evd/party.evd id
  3. Reading output from the component. Suppose you have custom generic component ‘evc/read_files.evc’ which do some magic with json files, e.g.:

    jsons="${COMP_ARG[1]}"
    evd="${COMP_ARG[2]}"
    key="${COMP_ARG[3]}"
    Read "$jsons" JSONS "$evd"
    Tee JSONS A B "$evd" --key="$key"

    And you need to connect these output flows ‘A’ and ‘B’ into your job, e.g.:

    Comp read_files /landing/users.*.json evd/users.evd surname
    Sort $COMP_FLOW.A SORTED evd/users.evd
    Write $COMP_FLOW.B users.csv evd/users.evd
    ...
  4. Writing flow to the component. Suppose you have custom component ‘evc/write_log.evc’, e.g.:

    flow_in="${COMP_ARG[1]}"
    Write $flow_in some_file.log -d "X string" --text-output

    In the job it would look like this:

    Tail XXX LOG evd/XXX.evd -n 100
    Comp write_log.evc LOG

    Alternatively the component would look like this as well:

    Write $PARENT_FLOW.LOG some_file.log -d "X string" --text-output

8.5 Cut

(since EVL 1.0)

Remove columns from input records. Use this component when you want to reduce the number of columns.

Cut
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl cut
is intended for standalone usage, i.e. to be invoked from command line and read records from standard input and write to standard output.

EVD is EVL data definition file, for details see evl-evd(5).

Synopsis

Cut
<f_in> <f_out> (<evd_in>|-D <inline_evd) (<evd_out>|-d <inline_evd>)
[--validate] [-x|--text-input] [-y|--text-output]

evl cut
(<evd_in>|-D <inline_evd) (<evd_out>|-d <inline_evd>)
[--validate] [-x|--text-input] [-y|--text-output]
[-v|--verbose]

evl cut
( --help | --usage | --version )

Options

-D, --input-definition=<inline_evd>
either this option or the file <evd_in> must be presented. Example: -D ’id int, user_id string’

-d, --output-definition=<inline_evd>
either this option or the file <evd_out> must be presented. Example: -d ’user_sum long’

--validate
without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input
suppose the input as text, not binary

-y, --text-output
write the output as text, not binary

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples

  1. Print to stdout only integer field ‘id’:

    evl cut example.evd -d'id int' -xy <in.txt

8.6 Departition

(since EVL 1.2)

Gather or merge partitions into one output flow or file. When ‘-k <key>’ is specified, then sorted input of each partition is supposed and output will be again sorted (i.e. merged). With no ‘-k <key>’, it gather input partitions in round-robin fashion. Applying to only one partition simply write input to output. EVD is EVL data definition file, for details see evl-evd(5).

Synopsis

Departition
<f_in>... <f_out> (<evd>|-d <inline_evd>)
(--key=<key> | --round-robin)
[--validate] [-x|--text-input] [-y|--text-output]

evl departition
<file_in> <file_out> (<evd>|-d <inline_evd>)
(--key=<key> | --round-robin)
[-v|--validate] [-x|--text-input] [-y|--text-output]
[-v|--verbose]

evl departition
( --help | --usage | --version )

Options

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string enc=iso-8859-1’

-k, --key=<key>
merge partitioned flows/files according to the key, so the output is sorted by this key

-r, --round-robin
gather in round-robin fashion

--validate
without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input
suppose the input as text, not binary

-y, --text-output
write the output as text, not binary

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples

  1. To departition partitioned flow in the EVL job:

    Read  gs://my_bucket/cust.csv CUST $EVD_CUST
    Partition CUST CUST_P $EVD_CUST --round-robin
    Map CUST_P PROC_M $EVD_CUST $EVD_PROC $EVM_PROC
    Departition PROC_M PROC_G $EVD_PROC --round-robin
    Write PROC_G gdrive://proc.xlsx $EVD_PROC

8.7 Echo

(since EVL 2.0)

Write <string> into <f_out>. This component doesn’t produce partitioned flow.

Echo’ is to be used in EVS job structure definition file.

<f_out> is either output file or flow name.

There is no standalone version of this component as you can use standard ‘echo’.

EVS is EVL job structure definition file, for details see evl-evs(5).

Synopsis

Echo
<string> <f_out> [-e] [-n]

evl echo
( --help | --usage | --version )

Options

-n
do not output the trailing newline (standard Bash echo option)

-e
enable interpretation of backslash escapes (standard Bash echo option)

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

--version
print version and exit

Examples

  1. An EVL job (specified in ‘evs’ file) which run simple select statement from Postgreql table:

    Echo   "select max(id) from some_db.some_table;" SELECT
    RunPG SELECT MAX_ID
  2. To add two hardcoded records to the end of a flow:

    ...   ...    FLOW    -d "s string"
    Echo "Some string footer,\nwith two lines." FOOTER -e
    Cat FLOW FOOTER -d "s string"
    ...

8.8 Filter

(since EVL 1.0)

Filter records by the <condition>. Records for which the <condition> is false, are forwarded to a reject file or to a flow if specified.

In many cases filtering records would be better to do in ‘Map’ component using ‘discard()’ function. Having ‘Filter’ component right before or after a ‘Map’ is not perfomance optimal. Check ‘man evl-map’ for details.

Also using ‘Filter’ right after a ‘Read’ component is usually not performance optimal. It is usually better to shift filtering to the database for example. Check option ‘--where’ of ‘Read’ component for details.

Filter
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl filter
is intended for standalone usage, i.e. to be invoked from command line and read records from standard input and write to standard output.

EVD is EVL data definition file, for details see evl-evd(5).

Synopsis

Filter
<f_in> <f_out> (<evd>|-d <inline_evd>) <condition>
[-r|--reject=<f_out>]
[-x|--text-input] [-y|--text-output]

evl filter
(<evd>|-d <inline_evd>) <condition>
[-r|--reject=<f_out>]
[-x|--text-input] [-y|--text-output]
[-v|--verbose]

evl filter
( --help | --usage | --version )

Options

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string enc=iso-8859-1’

-r, –reject=<f_out> catch rejected records into file or flow.

-x, --text-input
suppose the input as text, not binary

-y, --text-output
write the output as text, not binary

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples

Command line invocation examples:

  1. To print to stdout only records from file ‘ID.txt’ with value of id less than 100:

    evl filter -d 'id int' -xy '*id<100' < ID.txt

Field ‘id’ is a pointer, so to get the value, ‘*id’ must be used.

2. Print to stdout only records from file ‘IDs.csv’ where ‘id1’ is different from ‘id2’, records with the same ids will be send into ‘same_IDs.csv’:

evl filter -d 'id1 int sep=",", id2 int' -xy -r same_IDs.csv \
'*id1 != *id2' < IDs.csv

EVL job examples:

  1. In an ‘evs’ file:

    ...     ...    SOURCE  evd/sample.evd
    Filter SOURCE OUTPUT evd/sample.evd "price && *currency == \"EUR\""
    ... OUTPUT ... evd/sample.evd

This example filter out records with NULL ‘price’ and with currency other than ‘EUR’. (‘price’ is a pointer, so simply specifying ‘price’ in the condition means ‘price != nullptr’.)

  1. If there would be a ‘Read’ component right before the ‘Filter’, then consider using option ‘--where’ instead, because in such case the filter is shifted to the source DB, e.g.:

    SRC_HOST_URI="postgres://tech_etl@pg_server:5432"
    SRC_PATH="dwh_db?schema=public&table=invoices"

    Read $SRC_HOST_URI/$SRC_PATH INVOICES_EUR evd/invoices.evd \
    --where "price is not null AND currency = 'EUR'"
    Map INVOICES_EUR EUR_MAP evd/invoices.evd ...

    will run the query in PostgreSQL database with where condition:

    WHERE price is not null AND currency = 'EUR'

    One can also use EVL notation with this ‘--where’ option, e.g.:

    SRC_HOST_URI="postgres://tech_etl@pg_server:5432"
    SRC_PATH="dwh_db?schema=public&table=invoices"

    Read $SRC_HOST_URI/$SRC_PATH INVOICES_EUR evd/invoices.evd \
    --where 'price && *currency == "EUR"'
    Map INVOICES_EUR EUR_MAP evd/invoices.evd ...

    so then it would work also in case of reading a file:

    Read    data/invoices.csv INVOICES_EUR evd/invoices.evd \
    --where 'price && *currency == "EUR"'
    Map INVOICES_EUR EUR_MAP evd/invoices.evd ...

    in such case it is then internally the same as:

    Read    data/invoices.csv INVOICES_SRC evd/invoices.evd
    Filter INVOICES_SRC INVOICES_EUR evd/invoices.evd \
    'price && *currency == "EUR"'
    Map INVOICES_EUR EUR_MAP evd/invoices.evd ...
  2. And using ‘Filter’ to split a flow:

    ...     ...    INV  evd/invoices.evd
    Filter INV EUR evd/invoices.evd -r NONEUR '*currency == "EUR"'
    Sort EUR EUR_SRT evd/invoices.evd --key "price"
    Sort NONEUR NONEUR_SRT evd/invoices.evd --key "currency,price"
    ...

8.9 Gather

(since EVL 1.2)

Gather several input flows or files into one output flow or file in round-robin fashion.

Gather
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl gather
is intended for standalone usage, i.e. to be invoked from command line. When <file> is ’-’, then read from stdin.

EVD is EVL data definition file, for details see evl-evd(5).

Synopsis

Gather
<f_in>... <f_out> (<evd>|-d <inline_evd>)
[--validate] [-x|--text-input] [-y|--text-output]

evl gather
[<file>...] (<evd>|-d <inline_evd>)
[--validate] [-x|--text-input] [-y|--text-output]
[-v|--verbose]

evl gather
( --help | --usage | --version )

Options

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string enc=iso-8859-1’

--validate
without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input
suppose the input as text, not binary

-y, --text-output
write the output as text, not binary

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples

  1. Following command:

    evl gather file.a file.b file.c file.evd -xy

print to stdout first record of ‘file.a’ then first record of ‘file.b’ then first record of ‘file.c’, then second records and so on

  1. To gather partitioned flow in the EVL job:

    Read      s3://my_bucket/cust.csv CUSTOMERS $EVD_CUST
    Partition CUSTOMERS CUST_P $EVD_CUST --round-robin
    Map CUST_P PROC_M $EVD_CUST $EVD_PROC $EVM_PROC
    Gather PROC_M PROC_G $EVD_PROC
    Write PROC_G sftp:///some/path/proc.csv.gz $EVD_PROC

8.10 Generate

(since EVL 1.3)

According to data definition (evd file) generates records to stdout or output flow or file. EVD is EVL data definition file, for details see evl-evd(5).

When no <config_file> is specified:

Number data types
values from the whole range of given data type are randomly generated

Date, timestamp
values between 1970-01-01 and 2199-12-31 are randomly generated

String
random characters [a-zA-Z0-9] of the length between 0 and 10 are generated

Vector
random number of elements between 0 and 10 are generated

When <config_file> in JSON format is specified:

Number data types
range, values, probability of nulls

Date, timestamp
range, values, probability of nulls

String
range, values, min-length, max-length, probability of nulls

Vector
range, values, min-elements, max-elements, probability of nulls

When both, probability of nulls and values with null is specified, then only probability is taken. When range(s) and values overlaps, then it has no effect on the probability, all values has the same probability of being generated. See examples of JSON below for details.

Synopsis

Generate
<f_out> (<evd>|-d <inline_evd>) [<config_file>]
[-n|--records <num>] [-y|--text-output]

evl generate
(<evd>|-d <inline_evd>) [<config_file>]
[-n|--records <num>] [-y|--text-output]
[-v|--verbose]

evl generate
( --help | --usage | --version )

Options

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string enc=iso-8859-1’

-n, --records=<num>
generate <num> number of records instead of the default one

-y, --text-output
write the output as text, not binary

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples

  1. Print to stdout one random uchar:

    evl generate -d 'value uchar' -y
  2. Example of config JSON file:

    {
    "int_field": {
    "values": [100, 200, 500],
    "range": { "min": 0, "max": 10 },
    "range": { "min": 50, "max": 60 },
    "null": 0.1
    },
    "float_field": {
    "range": { "min": 0, "max": 100 }
    },
    "date_field": {
    "values": [ null, "2018-03-07", "2018-03-08" ]
    },
    "struct_field.string_field1": {
    "min-length": 10,
    "max-length": 20
    },
    "struct_field.string_field2": {
    "values": ["abc", "def", "ghi", "jkl"]
    },
    "struct_field.decimal_field": {
    "range": { "min": "0.00", "max": "100.00" }
    },
    "vector_field": {
    "min-elements": 2,
    "max-elements": 5
    },
    "vector_field[]": {
    "range": { "min": "2018-03-07 05:00:00", "max": "2018-03-07 14:00:00" }
    }
    }

    where corresponding evd is:

    int_field        int           sep="|"  null=""
    float_field float sep="|"
    date_field date sep="|" null=""
    struct_field struct sep="|"
    string_field1 string sep=";"
    string_field2 string sep=";"
    decimal_field decimal(5.2) sep=";"
    vector_field vector sep="\n"
    timestamp sep=","

    For the ‘int_field’ it will generate randomly values 0,1,...,10,50,...,60,100,200,500, but in 10% cases there will be ‘NULL’ values generated.


8.11 Head

(since EVL 1.1)

Command prints to output first <num> records of input. Without option -n prints first 10 records.

Head
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl head
is intended for standalone usage, i.e. to be invoked from command line.

EVD is EVL data definition file, for details see evl-evd(5).

Synopsis

Head
<f_in> <f_out> [<evd>|-d <inline_evd>] [-n [-]<num>] [-s|--skip-parse]
[--validate] [--skip-bom]
[ -x|--text-input | --text-input-dos-eol | --text-input-mac-eol ]
[ -y|--text-output | --text-output-dos-eol | --text-output-mac-eol ]

evl head
[<evd>|-d <inline_evd>] [-n [-]<num>] [-s|--skip-parse]
[--validate] [--skip-bom]
[ -x|--text-input | --text-input-dos-eol | --text-input-mac-eol ]
[ -y|--text-output | --text-output-dos-eol | --text-output-mac-eol ]
[-v|--verbose]

evl head
( --help | --usage | --version )

Options

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string enc=iso-8859-1’

-n, --records=[-]<num>
output first <num> records instead of the default first 10; or use -n -<num> to output all records except last last <num>

-s, --skip-parse
this option has no effect with ’–records <num>’ (i.e. the case first <num> records are read and the rest is ignored). But with ’–records -<NUM>’ it does not parse all fields, but ’jump’ over record separator, i.e. the separator of the last field. Be careful with this option, it is particularly good for ’csv’ files, when you want to skip some weird formatted footer for example, but might be a wrong solution when some fields are separated by the same character as the last one.

--skip-bom
skip utf-8 BOM (Byte order mark) from the beginning of input, i.e. EF BB BF. Windows usually add it to files in UTF8 encoding

--validate
without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input
suppose the input as text, not binary

--text-input-dos-eol
suppose the input as text with CRLF as end of line

--text-input-mac-eol
suppose the input as text with CR as end of line

-y, --text-output
write the output as text, not binary

--text-output-dos-eol
produce the output as text with CRLF as end of line

--text-output-mac-eol
produce the output as text with CR as end of line

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples

  1. print to stdout only first 10 records:

    evl head example.evd -xy <in.txt
  2. read the binary input and omit last 3 records without parsing them (i.e. they no need to have the data structure defined by evd):

    cat input.bin | evl head -sy -n-3 \
    -d 'id int sep=",", updated date sep="\n"' \
    > output.txt

8.12 Lookup

(since EVL 2.0)

Prepare lookup from sorted input, which can be used after Wait command till ‘Lookup remove’. Input must be sorted by the <key>.

Lookup [remove]
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl lookup [remove]
is intended for standalone usage, i.e. to be invoked from command line.

EVD is EVL data definition file, for details see evl-evd(5).

Synopsis

Lookup
<f_in> <lookup_name> (<evd>|-d <inline_evd>) -k <key> [-x|--text-input]

Lookup remove
<lookup_name>

evl lookup
<lookup_name> (<evd>|-d <inline_evd>) -k <key> [-x|--text-input]
[-v|--verbose]

evl lookup remove
<lookup_name>

evl lookup
( --help | --usage | --version )

Options

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string enc=iso-8859-1’

-k, --key=<key>
key for looking up records

-x, --text-input
suppose the input as text, not binary

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples

  1. To prepare lookup at the beginning of the job:

    Read   dimension.csv DIM  evd/dim.evd  --text-input
    Sort DIM DIM_SRT evd/dim.evd --key="id"
    Lookup DIM_SRT dim_lkp evd/dim.evd --key="id"

8.13 Merge

(since EVL 1.2)

Merge sorted flows or files into one (sorted) output. In the case of only one input flow or file, it simply writes this file to output flow or file.

To merge based on all of the fields, use an empty <key>.

Merge
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl merge
is intended for standalone usage, i.e. to be invoked from command line. When <file> is ’-’, then read from stdin.

EVD is EVL data definition file, for details see evl-evd(5).

Synopsis

Merge
<f_in>... <f_out> [<evd>|-d <inline_evd>] -k|--key <key>
[-c|--check-sort] [-i|--ignore-case]
[--validate] [-x|--text-input] [-y|--text-output]

evl merge
[<file>...] [<evd>|-d <inline_evd>] -k|--key <key>
[-c|--check-sort] [-i|--ignore-case]
[--validate] [-x|--text-input] [-y|--text-output]
[-v|--verbose]

evl merge
( --help | --usage | --version )

Options

-c, --check-sort
check if the input is really sorted according to specified key

-d, --data-definition=<inline_evd>
either this option or the file <evd_out> must be presented. Example: -d ’some_id long sep="|", some_value string sep="\n"’

-i, --ignore-case
be case insensitive for key fields

-k, --key=<key>
group by this key, where <key> is comma separated list of fields with type (either DESC or ASC, default type is ASC). When the <key> is empty, it sorts based on the whole record.

--validate
without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input
suppose the input as text, not binary

-y, --text-output
write the output as text, not binary

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples

evl merge example.evd -k 'input_id' -y input1.bin input2.bin input3.bin
merge three (sorted) binary files, the output is in text and sorted by ’input_id’


8.14 Partition

(since EVL 1.2)

Read input flow or file and according to ‘--key’ or ‘--round-robin’ logic send to several number of output flows or files. The number of partitions depends on the ‘EVL_PARTITIONS’ environment variable and also on the EVL version/edition.

Partition
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl partition
is intended for standalone usage, i.e. to be invoked from command line.

EVD is EVL data definition file, for details see evl-evd(5).

Synopsis

Partition
<f_in> <f_out> (<evd>|-d <inline_evd>)
(--key=<key> | --round-robin)
[--validate] [-x|--text-input] [-y|--text-output]

evl partition
<file_in> <file_out> (<evd>|-d <inline_evd>)
(--key=<key> | --round-robin)
[--validate] [-x|--text-input] [-y|--text-output]
[-v|--verbose]

evl partition
( --help | --usage | --version | --max-partitions )

Options

-d, --data-definition=<inline_evd>
either this option or the file <evd_out> must be presented

-k, --key=<key>
key according to which to distribute data

-m, --max-partitions
return the number of maximal possible partitions

-r, --round-robin
split by round-robin, i.e. simply one record after another to one output flow/file after another

--validate
without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input
suppose the input as text, not binary

-y, --text-output
write the output as text, not binary

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples

  1. To partition flow in the EVL job:

    Read     s3://my_bucket/cust.csv CUST $EVD_CUST
    Partition CUST CUST_P $EVD_CUST --round-robin
    Map CUST_P PROC_M $EVD_CUST $EVD_PROC $EVM_PROC
    Departition PROC_M PROC_G $EVD_PROC --round-robin
    Write PROC_G sftp:///some/path/proc.csv.gz $EVD_PROC

8.15 Sort

(since EVL 1.0)

Command takes records from stdin or <f_in>, sort them via <key> and write them to stdout or <f_out>. With the ‘-u’ option it deduplicates the data. At the moment it uses only traditional sort order (i.e. like LC_ALL=C), not national.

To sort based on all of the fields, use an empty <key>.

Sort
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl sort
is intended for standalone usage, i.e. to be invoked from command line and reading records from standard input and writing to standard output.

EVD and EVS are EVL definition files, for details see evl-evd(5) and evl-evs(5).

Synopsis

Sort
<f_in> <f_out> (<evd>|-d <inline_evd) -k <key>
[-u <unique-key> [-t|--keep-first] [--reject=<file>]]
[-c|--check-sort] [-f|--file-storage] [-i|--ignore-case]
[--validate] [-x|--text-input] [-y|--text-output]

evl sort
(<evd>|-d <inline_evd) -k <key>
[-u <unique-key> [-t|--keep-first] [--reject=<file>]]
[-c|--check-sort] [-f|--file-storage] [-i|--ignore-case]
[--validate] [-x|--text-input] [-y|--text-output]
[-v|--verbose]

evl sort
( --help | --usage | --version )

Options

-c, --check-sort
only check if the input is sorted and fail if not

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string enc=iso-8859-1’

-f, --file-storage
store temporary files on disk instead of using memory

-i, --ignore-case
ignore case sensitivity for key fields

-k, --key=<key>
sort via a key, where <key> is comma separated list of fields with type (default type is ASC). When the <key> is empty, it sorts based on the whole record. Example: –key=’id,user_id DESC,modify_dt ASC’

-r, --reject=<reject_file>
being used with option -u it catch duplicated records into <reject_file>

-t, --keep-first
when deduplicate by –unique-key, keep the first record from the group

-u, --unique-key=<unique_key>
deduplicate the output via <unique_key>; take only the last value unless –keep-first is specified. Duplicated records are catched by -r option. Example: -u ’id,user_id’

--validate
without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input
suppose the input as text, not binary

-y, --text-output
write the output as text, not binary

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples

  1. Sort via the whole record (i.e. according to all fields) the text input and write into text output file:

    evl sort example.evd -k '' -xy <in.txt >out.txt
  2. Deduplicate the binary input (for example from another EVL component) by keeping the first record in each group with the same id (with the lowest updated date) and write the result into output.csv and duplicates into duplicates.csv:

    cat input.bin | \
    evl sort -ty -k'd,updated' -u'id' \
    -d'id int sep=",", updated date sep="\n"' -r duplicates.csv >output.csv
  3. Check sort (being case insensitive) of input text file input.txt and write into file output.bin in binary (i.e. not as text):

    evl sort -cix -k'name' -d'name string sep="|", personal_id int sep="\n"' \
    <input.txt >output.bin

8.16 Sortgroup

(since EVL 2.0)

By having sorted input by <group_key>, sort within groups defined by such <group_key> according to <key>. So output is sorted by <group_key>,<key>. At the moment it uses only traditional sort order (i.e. like LC_ALL=C), not national.

Sortgroup
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl sortgroup
is intended for standalone usage, i.e. to be invoked from command line and reading records from standard input and writing to standard output.

EVD and EVS are EVL definition files, for details see evl-evd(5) and evl-evs(5).

Synopsis

Sortgroup
<f_in> <f_out> (<evd>|-d <inline_evd)
-g|--group-key=<group_key>
-k|--key=<key>
[-c|--check-sort] [-i|--ignore-case]
[--validate] [-x|--text-input] [-y|--text-output]

evl sortgroup
(<evd>|-d <inline_evd)
-g|--group-key=<group_key>
-k|--key=<key>
[-c|--check-sort] [-i|--ignore-case]
[--validate] [-x|--text-input] [-y|--text-output]
[-v|--verbose]

evl sortgroup
( --help | --usage | --version )

Options

-c, --check-sort
check if the input is really sorted by ‘;<group_key>

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: ‘-d 'id int, user_id string'

-g, --group-key=<group_key>
input is sorted via this key, where <group_key> is comma separated list of fields with type (default type is ASC). Example: ‘-k 'id,user_id DESC'

-i, --ignore-case
ignore case sensitivity for key fields

-k, --key=<key>
sort via this key within each group of records with same <group_key>. <key> is comma separated list of fields with type (default type is ASC). Example: ‘-k 'modify_dt ASC'

--validate
without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input
suppose the input as text, not binary

-y, --text-output
write the output as text, not binary

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples

  1. Suppose having a dataset already sorted by field ‘customer’. TBA

8.17 Tail

(since EVL 1.1)

Command prints to output last <num> records of input. Without option ‘-n’ prints last 10 records.

Tail
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl tail
is intended for standalone usage, i.e. to be invoked from command line.

EVD is EVL data definition file, for details see evl-evd(5).

Synopsis

Tail
<f_in> <f_out> [<evd>|-d <inline_evd>] [-n [+]<num>] [-s|--skip-parse]
[--validate] [--skip-bom]
[ -x|--text-input | --text-input-dos-eol | --text-input-mac-eol ]
[ -y|--text-output | --text-output-dos-eol | --text-output-mac-eol ]


evl tail
[<evd>|-d <inline_evd>] [-n [+]<num>] [-s|--skip-parse]
[--validate] [--skip-bom]
[ -x|--text-input | --text-input-dos-eol | --text-input-mac-eol ]
[ -y|--text-output | --text-output-dos-eol | --text-output-mac-eol ]
[-v|--verbose]

evl tail
( --help | --usage | --version )

Options

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string enc=iso-8859-1’

-n, --records=[+]<num>
output the last <num> records instead of the default last 10; or use -n +<num> to output starting with record <num>

-s, --skip-parse
with this option it does not parse all fields, but ’jump’ over record separator, i.e. the separator of the last field. Be careful with this option, it is particularly good for ’csv’ files, when you want to skip some weird formatted header for example, but might be a wrong solution when some fields are separated by the same character as the last one.

--skip-bom
skip utf-8 BOM (Byte order mark) from the beginning of input, i.e. EF BB BF. Windows usually add it to files in UTF8 encoding

--validate
without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input
suppose the input as text, not binary

--text-input-dos-eol
suppose the input as text with CRLF as end of line

--text-input-mac-eol
suppose the input as text with CR as end of line

-y, --text-output
write the output as text, not binary

--text-output-dos-eol
produce the output as text with CRLF as end of line

--text-output-mac-eol
produce the output as text with CR as end of line

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples

  1. Print to stdout only last 10 records:

    evl tail example.evd -xy <in.txt
  2. Read the binary input and skip first 2 records without parsing them (i.e. they no need to have the data structure defined by evd):

    cat input.bin | evl tail -sy -n+3 \
    -d'id int sep=",", updated date sep="\n"'
    > output.txt

8.18 Tee

(since EVL 1.0)

Replicate <f_in> to multiple <f_out>

Tee
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

There is no standalone component version as one can use standard UNIX command ’tee’.

Synopsis

Tee
<f_in> <f_out>...

evl tee
( --help | --usage | --version )

Options

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

--version
print version and exit

Examples

Replicate to output flows (or files) A,B,C,D,E,F:

Tee IN_FLOW A B C D E F

8.19 Trash

(since EVL 1.0)

Send <f_in> into /dev/null. Try to avoid using it in production environment as redirecting to /dev/null also costs the resources.

Trash
is to be used in EVS job structure definition file. <f_in> is either input file or flow name, both can be partitioned.

There is no standalone version of this component as you can always use >/dev/null.

EVS is EVL job structure definition file, for details see evl-evs(5).

Synopsis

Trash
<f_in>...

evl trash
( --help | --usage | --version )

Options

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

--version
print version and exit


8.20 Uniq

(since EVL 2.1)

Read stdin or <f_in> and write to stdout or <f_out> last record in the group specified by the <key>. The input must be sorted according to this key.

Uniq
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl uniq
is intended for standalone usage, i.e. to be invoked from command line and reading records from standard input and writing to standard output.

EVD and EVS are EVL definition files, for details see evl-evd(5) and evl-evs(5).

Synopsis

Uniq
<f_in> <f_out> (<evd>|-d <inline_evd>) -k <key> [-c|--check-sort]
[-i|--ignore-case] [--reject=<file>] [-t|--keep-first]
[--validate] [-x|--text-input] [-y|--text-output]

evl uniq
[<evd>] -k <key> [-c|--check-sort]
[-i|--ignore-case] [--reject=<file>] [-t|--keep-first]
[--validate] [-x|--text-input] [-y|--text-output]
[-v|--verbose]

evl uniq
( --help | --usage | --version )

Options

-c, --check-sort
check if the input is sorted and fail if not

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string enc=iso-8859-1’

-i, --ignore-case
ignore case sensitivity for key fields

-k, --key=<key>
deduplicate via a key, where <key> is comma separated list of fields with type (default type is ASC). Example: -k ’id,user_id DESC,modify_dt ASC’

-r, --reject=<reject_file>
being used with option -u it catch duplicated records into <reject_file>

-t, --keep-first
keep the first record of the group instead of the last one

--validate
without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input
suppose the input as text, not binary

-y, --text-output
write the output as text, not binary

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples

  1. Uniq via the all fields and write into text output file:

    evl uniq example.evd -k'' -xy < in.txt > out.txt
  2. Deduplicate the binary input (for example from another EVL component) by keeping the first record in each group with the same id (with the lowest updated date) and write the result into output.csv and duplicates into duplicates.csv:

    cat input.bin | evl uniq -ty -k'id,updated' -u'id' \
    -d'id int sep=",", updated date sep="\n"' \
    -r duplicates.csv > output.csv
  3. Check uniq (being case insensitive) of input text file input.txt and write into file output.bin in binary (i.e. not as text):

    evl uniq -cix --key="name" \
    -d 'name string sep="|", personal_id int sep="\n"' \
    < input.txt > output.bin

8.21 Validate

(since EVL 1.1)

Fail in case invalid data type appear unless ‘--limit’ option is specified.

Validate
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl validate
is intended for standalone usage, i.e. to be invoked from command line and reading records from standard input and writing to standard output.

EVD and EVS are definition files, for details see evl-evd(5) and evl-evs(5).

Synopsis

Validate
<f_in> <f_out> (<evd>|-d <inline_evd>)
[-l|--limit <num>] [--text-output]

evl validate
<f_in> <f_out> (<evd>|-d <inline_evd>)
[-l|--limit <num>] [--text-output]
[-v|--verbose]

evl validate
( --help | --usage | --version )

Options

-l, --limit=<num>
fail after reaching <num> number of invalid records. If <num> is ‘0’, then never fails. Default value is ‘1’, i.e. fail immediatelly after first invalid record.

-y, --text-output
write the output as text, not binary

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit


8.22 Watcher

(since EVL 1.2)

This component writes records passing through the <flow> into <file> in text format.

Works only when variable ‘EVL_WATCHER’ is set to ‘1’, otherwise does nothing. One can use it for debugging data in ‘DEV’ or ‘TEST’ environment, but it would be switched off in ‘PROD’.

If not full path to the <file> is specified, it writes into directory defined by ‘EVL_WATCHER_DIR’ environment variable, which is by default ‘watcher’ subfolder of current project.

EVD is EVL data definition file, for details see evl-evd(5).

Synopsis

Watcher
<flow> <file> (<evd>|-d <inline_evd>) [-x|--text-input]

evl watcher
( --help | --usage | --version )

Options

-d, --output-definition=<inline_evd>
either this option or the file <evd_out> must be presented. Example: ‘-d 'user_sum long'

-x, --text-input
suppose the input as text, not binary

Standard options:

--help
print this help and exit

--usage
print short usage information and exit

--version
print version and exit

Examples

  1. In EVL job (‘evs’ file):

    Sort     FLOW_01 FLOW_02 some.evd --key='id'
    Watcher FLOW_02 sorted.csv some.evd