Version: 2.6

Basic-Components

8 Basic Components ¶

Most of these basic components follows standard GNU/Linux commands, their purpose is obvious immediately.

Standard ETL components ¶

Assign
Cat
Cmd
Comp
Cut
Departition
Echo
Filter
Gather
Generate
Head
Lookup
Merge
Partition
Sort
Sortgroup
Tail
Tee
Trash
Uniq
Validate
Watcher

8.1 Assign ¶

(since EVL 1.2)

Assign the content of input flow or file <f_in> into shell variable <varname>, which is then exported into environment. Don’t forget to apply ‘--text-output’ on preceding component to get text content in the <variable>.

This component doesn’t work for partitioned flow.

Assign
is to be used in EVS job structure definition file. <f_in> is either input file or flow name.

There is no standalone version of this component as you can use standard Bash behaviour for this purpose. For example:

VARNAME=$(evl cat filename some.evd --text-output)

EVS is EVL job structure definition file, for details see evl-evs(5).

Synopsis ¶

Assign
  <f_in> <varname>

evl assign
  ( --help | --usage | --version )

Options ¶

Standard options: ¶

--help
print this help and exit

--usage
print short usage information and exit

--version
print version and exit

Examples ¶

EVL job (an ‘evs’ file) which reads content of a binary file ‘hwm.bin’ into variable ‘HWM’:
```
Read    hwm.bin   FLOW_HWM  evd/some.evd  --text-output
Assign  FLOW_HWM  HWM
```
Such a value can be then used (after ‘Wait’ component!) within mapping by:
```
static int hwm = getenv_int("HWM",0);   // use 0 when $HWM is empty
*out->incremental_id = ++hwm;
```
To get a value from text file:
```
Assign  hwm.txt  HWM
```

To assign flow content into a ‘NATCO’ variable:

Map     FLOW_01  FLOW_02 in.evd out.evd map.evm  --text-output
Assign  FLOW_02  NATCO

8.2 Cat ¶

(since EVL 1.0)

Concatenate flows or files.

Cat
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl cat
is intended for standalone usage, i.e. to be invoked from command line.

EVD is EVL data definition file, for details see ‘man 5 evd’.

Synopsis ¶

Cat
  <f_in>... <f_out> (<evd>|-d <inline_evd>)
  [ --validate ]
  [ -x|--text-input | --text-input-dos-eol | --text-input-mac-eol ]
  [ -y|--text-output | --text-output-dos-eol | --text-output-mac-eol ]

evl cat
  [<file>...]  (<evd>|-d <inline_evd>)
  [ --validate ]
  [ -x|--text-input | --text-input-dos-eol | --text-input-mac-eol ]
  [ -y|--text-output | --text-output-dos-eol | --text-output-mac-eol ]
  [ -v|--verbose ]

evl cat
  ( --help | --usage | --version )

Options ¶

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: ‘-d 'id int, user_id string enc=iso-8859-1'’

--validate
without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input
suppose the input as text, not binary

--text-input-dos-eol
suppose the input as text with CRLF as end of line

--text-input-mac-eol
suppose the input as text with CR as end of line

-y, --text-output
write the output as text, not binary

--text-output-dos-eol
produce the output as text with CRLF as end of line

--text-output-mac-eol
produce the output as text with CR as end of line

Standard options: ¶

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples ¶

Print to stdout binary input in text format:

evl cat example.evd -y <input.bin

8.3 Cmd ¶

(since EVL 1.2)

Basicly it calls:

cat <f_in> | <command> > <f_out>

When <f_in> is empty, then it runs:

<command> > <f_out>

and when <f_out> is empty:

cat <f_in> | <command>

<command> can be also a pipeline.

If <f_in> is partitioned, then <command> is applied on all partitions and keep the output <f_out> also partitioned.

Synopsis ¶

Cmd
  <f_in> <f_out> <command>

evl cmd
  ( --help | --usage | --version )

Options ¶

Standard options: ¶

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples ¶

Write 10 times ’repeat some error message’ to the STDERR and into EVL job log:
```
Cmd "" /dev/stderr "yes repeat some error message | head"
```

Suppose from ‘SOME_FLOW’ we obtain integers, one by line, then median can be obtained from R and be written into ‘/some/file’:

Cmd SOME_FLOW /some/file "Rscript median.R"

The file median.R might look like this:

f <- file('stdin'); open(f); x <- c();
while ( length( line <- readLines(f) ) > 0 ) x <- c(x,as.integer(line));
write(median(x), stdout());

8.4 Component ¶

(since EVL 1.0)

Run <component> from the project’s evc directory with arguments <comp_arg>. In the <component> these arguments are available as the array ‘COMP_ARG[1]’, ‘COMP_ARG[2]’, ... ‘COMP_ARG[0]’ is the component’s name.

When the <component> is not in current project subdirectory ‘evc/’, it tries the folder ‘$EVL_EVC_DIR/’.

You can also specify the full path to the component. Check examples.

Flow names within a component have unique prefixes, so cannot be in conflict with those in the job. However if you need to connect output flow(s) of the component, you need to use variable ‘$COMP_FLOW’ which is set by the component to such a prefix. So then flow from the component, e.g. ‘FLOW_IN_COMP’, can be read in parent job as ‘$COMP_FLOW.FLOW_IN_COMP’. Check examples.

For input flow there is a variable ‘$PARENT_FLOW’ which can be used in the component. Parent flow ‘FLOW_INTO_COMP’ can be reference within a component as ‘$PARENT_FLOW.FLOW_INTO_COMP’. Check examples for better understanding.

Comp
is to be used in EVS job structure definition file.

evl comp
is intended for standalone usage, i.e. to be invoked from command line and reading records from standard input and writing to standard output.

EVS is EVL job structure definition file, for details see evl-evs(5).

Synopsis ¶

Comp
  <component> [<comp_arg>...]

evl comp
  ( --help | --usage | --version )

Options ¶

Standard options: ¶

--help
print this help and exit

--usage
print short usage information and exit

--version
print version and exit

Examples ¶

Run custom component ‘evc/prepare_lkp.evc’ with neither input nor output:
```
Comp prepare_lkp
```

Run component from EVL Data Hub template project with three arguments:

Comp $EVL_TEMPLATE_DIR/data-hub/evc/scd2_read_increments.evc party.*.csv evd/party.evd id

Reading output from the component. Suppose you have custom generic component ‘evc/read_files.evc’ which do some magic with json files, e.g.:

jsons="${COMP_ARG[1]}"
evd="${COMP_ARG[2]}"
key="${COMP_ARG[3]}"
Read "$jsons" JSONS "$evd"
Tee  JSONS A B "$evd" --key="$key"

And you need to connect these output flows ‘A’ and ‘B’ into your job, e.g.:

Comp read_files /landing/users.*.json evd/users.evd surname
Sort  $COMP_FLOW.A SORTED evd/users.evd
Write $COMP_FLOW.B users.csv evd/users.evd
...

Writing flow to the component. Suppose you have custom component ‘evc/write_log.evc’, e.g.:

flow_in="${COMP_ARG[1]}"
Write $flow_in some_file.log -d "X string" --text-output

In the job it would look like this:

Tail XXX LOG evd/XXX.evd -n 100
Comp write_log.evc LOG

Alternatively the component would look like this as well:

Write $PARENT_FLOW.LOG some_file.log -d "X string" --text-output

8.5 Cut ¶

(since EVL 1.0)

Remove columns from input records. Use this component when you want to reduce the number of columns.

Cut
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl cut
is intended for standalone usage, i.e. to be invoked from command line and read records from standard input and write to standard output.

EVD is EVL data definition file, for details see evl-evd(5).

Synopsis ¶

Cut
  <f_in> <f_out> (<evd_in>|-D <inline_evd) (<evd_out>|-d <inline_evd>)
  [--validate] [-x|--text-input] [-y|--text-output]

evl cut
  (<evd_in>|-D <inline_evd) (<evd_out>|-d <inline_evd>)
  [--validate] [-x|--text-input] [-y|--text-output]
  [-v|--verbose]

evl cut
  ( --help | --usage | --version )

Options ¶

-D, --input-definition=<inline_evd>
either this option or the file <evd_in> must be presented. Example: -D ’id int, user_id string’

-d, --output-definition=<inline_evd>
either this option or the file <evd_out> must be presented. Example: -d ’user_sum long’

--validate
without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input
suppose the input as text, not binary

-y, --text-output
write the output as text, not binary

Standard options: ¶

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples ¶

Print to stdout only integer field ‘id’:
```
evl cut example.evd -d'id int' -xy <in.txt
```

8.6 Departition ¶

(since EVL 1.2)

Gather or merge partitions into one output flow or file. When ‘-k <key>’ is specified, then sorted input of each partition is supposed and output will be again sorted (i.e. merged). With no ‘-k <key>’, it gather input partitions in round-robin fashion. Applying to only one partition simply write input to output. EVD is EVL data definition file, for details see evl-evd(5).

Synopsis ¶

Departition
  <f_in>... <f_out> (<evd>|-d <inline_evd>)
  (--key=<key> | --round-robin)
  [--validate] [-x|--text-input] [-y|--text-output]

evl departition
  <file_in> <file_out> (<evd>|-d <inline_evd>)
  (--key=<key> | --round-robin)
  [-v|--validate] [-x|--text-input] [-y|--text-output]
  [-v|--verbose]

evl departition
  ( --help | --usage | --version )

Options ¶

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string enc=iso-8859-1’

-k, --key=<key>
merge partitioned flows/files according to the key, so the output is sorted by this key

-r, --round-robin
gather in round-robin fashion

--validate
without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input
suppose the input as text, not binary

-y, --text-output
write the output as text, not binary

Standard options: ¶

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples ¶

To departition partitioned flow in the EVL job:

Read  gs://my_bucket/cust.csv CUST $EVD_CUST
Partition   CUST      CUST_P  $EVD_CUST --round-robin
Map         CUST_P    PROC_M  $EVD_CUST $EVD_PROC $EVM_PROC
Departition PROC_M    PROC_G  $EVD_PROC --round-robin
Write       PROC_G    gdrive://proc.xlsx $EVD_PROC

8.7 Echo ¶

(since EVL 2.0)

Write <string> into <f_out>. This component doesn’t produce partitioned flow.

‘Echo’ is to be used in EVS job structure definition file.

<f_out> is either output file or flow name.

There is no standalone version of this component as you can use standard ‘echo’.

EVS is EVL job structure definition file, for details see evl-evs(5).

Synopsis ¶

Echo
  <string> <f_out> [-e] [-n]

evl echo
  ( --help | --usage | --version )

Options ¶

-n
do not output the trailing newline (standard Bash echo option)

-e
enable interpretation of backslash escapes (standard Bash echo option)

Standard options: ¶

--help
print this help and exit

--usage
print short usage information and exit

--version
print version and exit

Examples ¶

An EVL job (specified in ‘evs’ file) which run simple select statement from Postgreql table:
```
Echo   "select max(id) from some_db.some_table;" SELECT
RunPG  SELECT MAX_ID
```

To add two hardcoded records to the end of a flow:

...   ...    FLOW    -d "s string"
Echo  "Some string footer,\nwith two lines." FOOTER -e
Cat   FLOW   FOOTER  -d "s string"
...

8.8 Filter ¶

(since EVL 1.0)

Filter records by the <condition>. Records for which the <condition> is false, are forwarded to a reject file or to a flow if specified.

In many cases filtering records would be better to do in ‘Map’ component using ‘discard()’ function. Having ‘Filter’ component right before or after a ‘Map’ is not perfomance optimal. Check ‘man evl-map’ for details.

Also using ‘Filter’ right after a ‘Read’ component is usually not performance optimal. It is usually better to shift filtering to the database for example. Check option ‘--where’ of ‘Read’ component for details.

Filter
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl filter
is intended for standalone usage, i.e. to be invoked from command line and read records from standard input and write to standard output.

EVD is EVL data definition file, for details see evl-evd(5).

Synopsis ¶

Filter
  <f_in> <f_out> (<evd>|-d <inline_evd>) <condition>
  [-r|--reject=<f_out>]
  [-x|--text-input] [-y|--text-output]

evl filter
  (<evd>|-d <inline_evd>) <condition>
  [-r|--reject=<f_out>]
  [-x|--text-input] [-y|--text-output]
  [-v|--verbose]

evl filter
  ( --help | --usage | --version )

Options ¶

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string enc=iso-8859-1’

-r, –reject=<f_out> catch rejected records into file or flow.

-x, --text-input
suppose the input as text, not binary

-y, --text-output
write the output as text, not binary

Standard options: ¶

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples ¶

Command line invocation examples: ¶

To print to stdout only records from file ‘ID.txt’ with value of id less than 100:
```
evl filter -d 'id int' -xy '*id<100' < ID.txt
```

Field ‘id’ is a pointer, so to get the value, ‘*id’ must be used.

2. Print to stdout only records from file ‘IDs.csv’ where ‘id1’ is different from ‘id2’, records with the same ids will be send into ‘same_IDs.csv’:

evl filter -d 'id1 int sep=",", id2 int' -xy -r same_IDs.csv \
    '*id1 != *id2' < IDs.csv

EVL job examples: ¶

In an ‘evs’ file:

...     ...    SOURCE  evd/sample.evd
Filter  SOURCE OUTPUT  evd/sample.evd  "price && *currency == \"EUR\""
...     OUTPUT ...     evd/sample.evd

This example filter out records with NULL ‘price’ and with currency other than ‘EUR’. (‘price’ is a pointer, so simply specifying ‘price’ in the condition means ‘price != nullptr’.)

If there would be a ‘Read’ component right before the ‘Filter’, then consider using option ‘--where’ instead, because in such case the filter is shifted to the source DB, e.g.:

SRC_HOST_URI="postgres://tech_etl@pg_server:5432"
SRC_PATH="dwh_db?schema=public&table=invoices"

Read    $SRC_HOST_URI/$SRC_PATH INVOICES_EUR evd/invoices.evd \
            --where "price is not null AND currency = 'EUR'"
Map     INVOICES_EUR            EUR_MAP      evd/invoices.evd ...

will run the query in PostgreSQL database with where condition:

WHERE price is not null AND currency = 'EUR'

One can also use EVL notation with this ‘--where’ option, e.g.:

SRC_HOST_URI="postgres://tech_etl@pg_server:5432"
SRC_PATH="dwh_db?schema=public&table=invoices"

Read    $SRC_HOST_URI/$SRC_PATH INVOICES_EUR evd/invoices.evd \
            --where 'price && *currency == "EUR"'
Map     INVOICES_EUR            EUR_MAP      evd/invoices.evd ...

so then it would work also in case of reading a file:

Read    data/invoices.csv INVOICES_EUR evd/invoices.evd \
            --where 'price && *currency == "EUR"'
Map     INVOICES_EUR      EUR_MAP      evd/invoices.evd ...

in such case it is then internally the same as:

Read    data/invoices.csv INVOICES_SRC evd/invoices.evd
Filter  INVOICES_SRC      INVOICES_EUR evd/invoices.evd \
            'price && *currency == "EUR"'
Map     INVOICES_EUR      EUR_MAP      evd/invoices.evd ...

And using ‘Filter’ to split a flow:

...     ...    INV  evd/invoices.evd
Filter  INV    EUR  evd/invoices.evd -r NONEUR '*currency == "EUR"'
Sort    EUR    EUR_SRT    evd/invoices.evd --key "price"
Sort    NONEUR NONEUR_SRT evd/invoices.evd --key "currency,price"
...

8.9 Gather ¶

(since EVL 1.2)

Gather several input flows or files into one output flow or file in round-robin fashion.

Gather
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl gather
is intended for standalone usage, i.e. to be invoked from command line. When <file> is ’-’, then read from stdin.

EVD is EVL data definition file, for details see evl-evd(5).

Synopsis ¶

Gather
  <f_in>... <f_out> (<evd>|-d <inline_evd>)
  [--validate] [-x|--text-input] [-y|--text-output]

evl gather
  [<file>...]  (<evd>|-d <inline_evd>)
  [--validate] [-x|--text-input] [-y|--text-output]
  [-v|--verbose]

evl gather
  ( --help | --usage | --version )

Options ¶

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string enc=iso-8859-1’

--validate
without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input
suppose the input as text, not binary

-y, --text-output
write the output as text, not binary

Standard options: ¶

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples ¶

Following command:

evl gather file.a file.b file.c file.evd -xy

print to stdout first record of ‘file.a’ then first record of ‘file.b’ then first record of ‘file.c’, then second records and so on

To gather partitioned flow in the EVL job:

Read      s3://my_bucket/cust.csv CUSTOMERS $EVD_CUST
Partition CUSTOMERS CUST_P  $EVD_CUST --round-robin
Map       CUST_P    PROC_M  $EVD_CUST $EVD_PROC $EVM_PROC
Gather    PROC_M    PROC_G  $EVD_PROC
Write     PROC_G    sftp:///some/path/proc.csv.gz $EVD_PROC

8.10 Generate ¶

(since EVL 1.3)

According to data definition (evd file) generates records to stdout or output flow or file. EVD is EVL data definition file, for details see evl-evd(5).

When no `<config_file>` is specified: ¶

Number data types
values from the whole range of given data type are randomly generated

Date, timestamp
values between 1970-01-01 and 2199-12-31 are randomly generated

String
random characters [a-zA-Z0-9] of the length between 0 and 10 are generated

Vector
random number of elements between 0 and 10 are generated

When `<config_file>` in JSON format is specified: ¶

Number data types
range, values, probability of nulls

Date, timestamp
range, values, probability of nulls

String
range, values, min-length, max-length, probability of nulls

Vector
range, values, min-elements, max-elements, probability of nulls

When both, probability of nulls and values with null is specified, then only probability is taken. When range(s) and values overlaps, then it has no effect on the probability, all values has the same probability of being generated. See examples of JSON below for details.

Synopsis ¶

Generate
  <f_out> (<evd>|-d <inline_evd>) [<config_file>]
  [-n|--records <num>] [-y|--text-output]

evl generate
  (<evd>|-d <inline_evd>) [<config_file>]
  [-n|--records <num>] [-y|--text-output]
  [-v|--verbose]

evl generate
  ( --help | --usage | --version )

Options ¶

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string enc=iso-8859-1’

-n, --records=<num>
generate <num> number of records instead of the default one

-y, --text-output
write the output as text, not binary

Standard options: ¶

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples ¶

Print to stdout one random uchar:
```
evl generate -d 'value uchar' -y
```

Example of config JSON file:

{
  "int_field": {
    "values": [100, 200, 500],
    "range": { "min": 0, "max": 10 },
    "range": { "min": 50, "max": 60 },
    "null": 0.1
  },
  "float_field": {
    "range": { "min": 0, "max": 100 }
  },
  "date_field": {
    "values": [ null, "2018-03-07", "2018-03-08" ]
  },
  "struct_field.string_field1": {
    "min-length": 10,
    "max-length": 20
  },
  "struct_field.string_field2": {
    "values": ["abc", "def", "ghi", "jkl"]
  },
  "struct_field.decimal_field": {
    "range": { "min": "0.00", "max": "100.00" }
  },
  "vector_field": {
    "min-elements": 2,
    "max-elements": 5
  },
  "vector_field[]": {
    "range": { "min": "2018-03-07 05:00:00", "max": "2018-03-07 14:00:00" }
  }
}

where corresponding evd is:

int_field        int           sep="|"  null=""
float_field      float         sep="|"
date_field       date          sep="|"  null=""
struct_field     struct        sep="|"
  string_field1  string        sep=";"
  string_field2  string        sep=";"
  decimal_field  decimal(5.2)  sep=";"
vector_field     vector        sep="\n"
  timestamp                    sep=","

For the ‘int_field’ it will generate randomly values 0,1,...,10,50,...,60,100,200,500, but in 10% cases there will be ‘NULL’ values generated.

8.11 Head ¶

(since EVL 1.1)

Command prints to output first <num> records of input. Without option -n prints first 10 records.

Head
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl head
is intended for standalone usage, i.e. to be invoked from command line.

EVD is EVL data definition file, for details see evl-evd(5).

Synopsis ¶

Head
  <f_in> <f_out> [<evd>|-d <inline_evd>] [-n [-]<num>] [-s|--skip-parse]
  [--validate] [--skip-bom]
  [ -x|--text-input | --text-input-dos-eol | --text-input-mac-eol ]
  [ -y|--text-output | --text-output-dos-eol | --text-output-mac-eol ]

evl head
  [<evd>|-d <inline_evd>] [-n [-]<num>] [-s|--skip-parse]
  [--validate] [--skip-bom]
  [ -x|--text-input | --text-input-dos-eol | --text-input-mac-eol ]
  [ -y|--text-output | --text-output-dos-eol | --text-output-mac-eol ]
  [-v|--verbose]

evl head
  ( --help | --usage | --version )

Options ¶

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string enc=iso-8859-1’

-n, --records=[-]<num>
output first <num> records instead of the default first 10; or use -n -<num> to output all records except last last <num>

-s, --skip-parse
this option has no effect with ’–records <num>’ (i.e. the case first <num> records are read and the rest is ignored). But with ’–records -<NUM>’ it does not parse all fields, but ’jump’ over record separator, i.e. the separator of the last field. Be careful with this option, it is particularly good for ’csv’ files, when you want to skip some weird formatted footer for example, but might be a wrong solution when some fields are separated by the same character as the last one.

--skip-bom
skip utf-8 BOM (Byte order mark) from the beginning of input, i.e. EF BB BF. Windows usually add it to files in UTF8 encoding

--validate
without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input
suppose the input as text, not binary

--text-input-dos-eol
suppose the input as text with CRLF as end of line

--text-input-mac-eol
suppose the input as text with CR as end of line

-y, --text-output
write the output as text, not binary

--text-output-dos-eol
produce the output as text with CRLF as end of line

--text-output-mac-eol
produce the output as text with CR as end of line

Standard options: ¶

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples ¶

print to stdout only first 10 records:
```
evl head example.evd -xy <in.txt
```

read the binary input and omit last 3 records without parsing them (i.e. they no need to have the data structure defined by evd):

cat input.bin | evl head -sy -n-3 \
                  -d 'id int sep=",", updated date sep="\n"' \
                    > output.txt

8.12 Lookup ¶

(since EVL 2.0)

Prepare lookup from sorted input, which can be used after Wait command till ‘Lookup remove’. Input must be sorted by the <key>.

Lookup [remove]
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl lookup [remove]
is intended for standalone usage, i.e. to be invoked from command line.

EVD is EVL data definition file, for details see evl-evd(5).

Synopsis ¶

Lookup
  <f_in> <lookup_name> (<evd>|-d <inline_evd>) -k <key> [-x|--text-input]

Lookup remove
  <lookup_name>

evl lookup
  <lookup_name> (<evd>|-d <inline_evd>) -k <key> [-x|--text-input]
  [-v|--verbose]

evl lookup remove
  <lookup_name>

evl lookup
  ( --help | --usage | --version )

Options ¶

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string enc=iso-8859-1’

-k, --key=<key>
key for looking up records

-x, --text-input
suppose the input as text, not binary

Standard options: ¶

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples ¶

To prepare lookup at the beginning of the job:

Read   dimension.csv DIM  evd/dim.evd  --text-input
Sort   DIM       DIM_SRT  evd/dim.evd  --key="id"
Lookup DIM_SRT   dim_lkp  evd/dim.evd  --key="id"

8.13 Merge ¶

(since EVL 1.2)

Merge sorted flows or files into one (sorted) output. In the case of only one input flow or file, it simply writes this file to output flow or file.

To merge based on all of the fields, use an empty <key>.

Merge
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl merge
is intended for standalone usage, i.e. to be invoked from command line. When <file> is ’-’, then read from stdin.

EVD is EVL data definition file, for details see evl-evd(5).

Synopsis ¶

Merge
  <f_in>... <f_out> [<evd>|-d <inline_evd>] -k|--key <key>
  [-c|--check-sort] [-i|--ignore-case]
  [--validate] [-x|--text-input] [-y|--text-output]

evl merge
  [<file>...]  [<evd>|-d <inline_evd>] -k|--key <key>
  [-c|--check-sort] [-i|--ignore-case]
  [--validate] [-x|--text-input] [-y|--text-output]
  [-v|--verbose]

evl merge
  ( --help | --usage | --version )

Options ¶

-c, --check-sort
check if the input is really sorted according to specified key

-d, --data-definition=<inline_evd>
either this option or the file <evd_out> must be presented. Example: -d ’some_id long sep="|", some_value string sep="\n"’

-i, --ignore-case
be case insensitive for key fields

-k, --key=<key>
group by this key, where <key> is comma separated list of fields with type (either DESC or ASC, default type is ASC). When the <key> is empty, it sorts based on the whole record.

--validate
without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input
suppose the input as text, not binary

-y, --text-output
write the output as text, not binary

Standard options: ¶

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples ¶

evl merge example.evd -k 'input_id' -y input1.bin input2.bin input3.bin
merge three (sorted) binary files, the output is in text and sorted by ’input_id’

8.14 Partition ¶

(since EVL 1.2)

Read input flow or file and according to ‘--key’ or ‘--round-robin’ logic send to several number of output flows or files. The number of partitions depends on the ‘EVL_PARTITIONS’ environment variable and also on the EVL version/edition.

Partition
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl partition
is intended for standalone usage, i.e. to be invoked from command line.

EVD is EVL data definition file, for details see evl-evd(5).

Synopsis ¶

Partition
  <f_in> <f_out> (<evd>|-d <inline_evd>)
  (--key=<key> | --round-robin)
  [--validate] [-x|--text-input] [-y|--text-output]

evl partition
  <file_in> <file_out> (<evd>|-d <inline_evd>)
  (--key=<key> | --round-robin)
  [--validate] [-x|--text-input] [-y|--text-output]
  [-v|--verbose]

evl partition
  ( --help | --usage | --version | --max-partitions )

Options ¶

-d, --data-definition=<inline_evd>
either this option or the file <evd_out> must be presented

-k, --key=<key>
key according to which to distribute data

-m, --max-partitions
return the number of maximal possible partitions

-r, --round-robin
split by round-robin, i.e. simply one record after another to one output flow/file after another

--validate
without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input
suppose the input as text, not binary

-y, --text-output
write the output as text, not binary

Standard options: ¶

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples ¶

To partition flow in the EVL job:

Read     s3://my_bucket/cust.csv CUST $EVD_CUST
Partition   CUST      CUST_P  $EVD_CUST --round-robin
Map         CUST_P    PROC_M  $EVD_CUST $EVD_PROC $EVM_PROC
Departition PROC_M    PROC_G  $EVD_PROC --round-robin
Write       PROC_G    sftp:///some/path/proc.csv.gz $EVD_PROC

8.15 Sort ¶

(since EVL 1.0)

Command takes records from stdin or <f_in>, sort them via <key> and write them to stdout or <f_out>. With the ‘-u’ option it deduplicates the data. At the moment it uses only traditional sort order (i.e. like LC_ALL=C), not national.

To sort based on all of the fields, use an empty <key>.

Sort
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl sort
is intended for standalone usage, i.e. to be invoked from command line and reading records from standard input and writing to standard output.

EVD and EVS are EVL definition files, for details see evl-evd(5) and evl-evs(5).

Synopsis ¶

Sort
  <f_in> <f_out> (<evd>|-d <inline_evd) -k <key>
  [-u <unique-key> [-t|--keep-first] [--reject=<file>]]
  [-c|--check-sort] [-f|--file-storage] [-i|--ignore-case]
  [--validate] [-x|--text-input] [-y|--text-output]

evl sort
  (<evd>|-d <inline_evd) -k <key>
  [-u <unique-key> [-t|--keep-first] [--reject=<file>]]
  [-c|--check-sort] [-f|--file-storage] [-i|--ignore-case]
  [--validate] [-x|--text-input] [-y|--text-output]
  [-v|--verbose]

evl sort
  ( --help | --usage | --version )

Options ¶

-c, --check-sort
only check if the input is sorted and fail if not

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string enc=iso-8859-1’

-f, --file-storage
store temporary files on disk instead of using memory

-i, --ignore-case
ignore case sensitivity for key fields

-k, --key=<key>
sort via a key, where <key> is comma separated list of fields with type (default type is ASC). When the <key> is empty, it sorts based on the whole record. Example: –key=’id,user_id DESC,modify_dt ASC’

-r, --reject=<reject_file>
being used with option -u it catch duplicated records into <reject_file>

-t, --keep-first
when deduplicate by –unique-key, keep the first record from the group

-u, --unique-key=<unique_key>
deduplicate the output via <unique_key>; take only the last value unless –keep-first is specified. Duplicated records are catched by -r option. Example: -u ’id,user_id’

--validate
without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input
suppose the input as text, not binary

-y, --text-output
write the output as text, not binary

Standard options: ¶

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples ¶

Sort via the whole record (i.e. according to all fields) the text input and write into text output file:
```
evl sort example.evd -k '' -xy <in.txt >out.txt
```
Deduplicate the binary input (for example from another EVL component) by keeping the first record in each group with the same id (with the lowest updated date) and write the result into output.csv and duplicates into duplicates.csv:
```
cat input.bin | \
evl sort -ty -k'd,updated' -u'id' \
  -d'id int sep=",", updated date sep="\n"' -r duplicates.csv >output.csv
```
Check sort (being case insensitive) of input text file input.txt and write into file output.bin in binary (i.e. not as text):
```
evl sort -cix -k'name' -d'name string sep="|", personal_id int sep="\n"' \
  <input.txt >output.bin
```

8.16 Sortgroup ¶

(since EVL 2.0)

By having sorted input by <group_key>, sort within groups defined by such <group_key> according to <key>. So output is sorted by <group_key>,<key>. At the moment it uses only traditional sort order (i.e. like LC_ALL=C), not national.

Sortgroup
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl sortgroup
is intended for standalone usage, i.e. to be invoked from command line and reading records from standard input and writing to standard output.

EVD and EVS are EVL definition files, for details see evl-evd(5) and evl-evs(5).

Synopsis ¶

Sortgroup
  <f_in> <f_out> (<evd>|-d <inline_evd)
  -g|--group-key=<group_key>
  -k|--key=<key>
  [-c|--check-sort] [-i|--ignore-case]
  [--validate] [-x|--text-input] [-y|--text-output]

evl sortgroup
  (<evd>|-d <inline_evd)
  -g|--group-key=<group_key>
  -k|--key=<key>
  [-c|--check-sort] [-i|--ignore-case]
  [--validate] [-x|--text-input] [-y|--text-output]
  [-v|--verbose]

evl sortgroup
  ( --help | --usage | --version )

Options ¶

-c, --check-sort
check if the input is really sorted by ‘;<group_key>’

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: ‘-d 'id int, user_id string'’

-g, --group-key=<group_key>
input is sorted via this key, where <group_key> is comma separated list of fields with type (default type is ASC). Example: ‘-k 'id,user_id DESC'’

-i, --ignore-case
ignore case sensitivity for key fields

-k, --key=<key>
sort via this key within each group of records with same <group_key>. <key> is comma separated list of fields with type (default type is ASC). Example: ‘-k 'modify_dt ASC'’

--validate
without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input
suppose the input as text, not binary

-y, --text-output
write the output as text, not binary

Standard options: ¶

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples ¶

Suppose having a dataset already sorted by field ‘customer’. TBA

8.17 Tail ¶

(since EVL 1.1)

Command prints to output last <num> records of input. Without option ‘-n’ prints last 10 records.

Tail
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl tail
is intended for standalone usage, i.e. to be invoked from command line.

EVD is EVL data definition file, for details see evl-evd(5).

Synopsis ¶

Tail
  <f_in> <f_out> [<evd>|-d <inline_evd>] [-n [+]<num>] [-s|--skip-parse]
  [--validate] [--skip-bom]
  [ -x|--text-input | --text-input-dos-eol | --text-input-mac-eol ]
  [ -y|--text-output | --text-output-dos-eol | --text-output-mac-eol ]


evl tail
  [<evd>|-d <inline_evd>] [-n [+]<num>] [-s|--skip-parse]
  [--validate] [--skip-bom]
  [ -x|--text-input | --text-input-dos-eol | --text-input-mac-eol ]
  [ -y|--text-output | --text-output-dos-eol | --text-output-mac-eol ]
  [-v|--verbose]

evl tail
  ( --help | --usage | --version )

Options ¶

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string enc=iso-8859-1’

-n, --records=[+]<num>
output the last <num> records instead of the default last 10; or use -n +<num> to output starting with record <num>

-s, --skip-parse
with this option it does not parse all fields, but ’jump’ over record separator, i.e. the separator of the last field. Be careful with this option, it is particularly good for ’csv’ files, when you want to skip some weird formatted header for example, but might be a wrong solution when some fields are separated by the same character as the last one.

--skip-bom
skip utf-8 BOM (Byte order mark) from the beginning of input, i.e. EF BB BF. Windows usually add it to files in UTF8 encoding

--validate
without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input
suppose the input as text, not binary

--text-input-dos-eol
suppose the input as text with CRLF as end of line

--text-input-mac-eol
suppose the input as text with CR as end of line

-y, --text-output
write the output as text, not binary

--text-output-dos-eol
produce the output as text with CRLF as end of line

--text-output-mac-eol
produce the output as text with CR as end of line

Standard options: ¶

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples ¶

Print to stdout only last 10 records:
```
evl tail example.evd -xy <in.txt
```

Read the binary input and skip first 2 records without parsing them (i.e. they no need to have the data structure defined by evd):

cat input.bin | evl tail -sy -n+3 \
                  -d'id int sep=",", updated date sep="\n"'
                     > output.txt

8.18 Tee ¶

(since EVL 1.0)

Replicate <f_in> to multiple <f_out>

Tee
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

There is no standalone component version as one can use standard UNIX command ’tee’.

Synopsis ¶

Tee
  <f_in> <f_out>...

evl tee
  ( --help | --usage | --version )

Options ¶

Standard options: ¶

--help
print this help and exit

--usage
print short usage information and exit

--version
print version and exit

Examples ¶

Replicate to output flows (or files) A,B,C,D,E,F:

Tee IN_FLOW A B C D E F

8.19 Trash ¶

(since EVL 1.0)

Send <f_in> into /dev/null. Try to avoid using it in production environment as redirecting to /dev/null also costs the resources.

Trash
is to be used in EVS job structure definition file. <f_in> is either input file or flow name, both can be partitioned.

There is no standalone version of this component as you can always use >/dev/null.

EVS is EVL job structure definition file, for details see evl-evs(5).

Synopsis ¶

Trash
  <f_in>...

evl trash
  ( --help | --usage | --version )

Options ¶

Standard options: ¶

--help
print this help and exit

--usage
print short usage information and exit

--version
print version and exit

8.20 Uniq ¶

(since EVL 2.1)

Read stdin or <f_in> and write to stdout or <f_out> last record in the group specified by the <key>. The input must be sorted according to this key.

Uniq
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl uniq
is intended for standalone usage, i.e. to be invoked from command line and reading records from standard input and writing to standard output.

EVD and EVS are EVL definition files, for details see evl-evd(5) and evl-evs(5).

Synopsis ¶

Uniq
  <f_in> <f_out> (<evd>|-d <inline_evd>) -k <key> [-c|--check-sort]
  [-i|--ignore-case] [--reject=<file>] [-t|--keep-first]
  [--validate] [-x|--text-input] [-y|--text-output]

evl uniq
  [<evd>] -k <key> [-c|--check-sort]
  [-i|--ignore-case] [--reject=<file>] [-t|--keep-first]
  [--validate] [-x|--text-input] [-y|--text-output]
  [-v|--verbose]

evl uniq
  ( --help | --usage | --version )

Options ¶

-c, --check-sort
check if the input is sorted and fail if not

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string enc=iso-8859-1’

-i, --ignore-case
ignore case sensitivity for key fields

-k, --key=<key>
deduplicate via a key, where <key> is comma separated list of fields with type (default type is ASC). Example: -k ’id,user_id DESC,modify_dt ASC’

-r, --reject=<reject_file>
being used with option -u it catch duplicated records into <reject_file>

-t, --keep-first
keep the first record of the group instead of the last one

--validate
without this option, no fields are checked against data types. With this option, all output fields are checked

-x, --text-input
suppose the input as text, not binary

-y, --text-output
write the output as text, not binary

Standard options: ¶

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

Examples ¶

Uniq via the all fields and write into text output file:
```
evl uniq example.evd -k'' -xy < in.txt > out.txt
```
Deduplicate the binary input (for example from another EVL component) by keeping the first record in each group with the same id (with the lowest updated date) and write the result into output.csv and duplicates into duplicates.csv:
```
cat input.bin | evl uniq -ty -k'id,updated' -u'id' \
     -d'id int sep=",", updated date sep="\n"' \
     -r duplicates.csv > output.csv
```

Check uniq (being case insensitive) of input text file input.txt and write into file output.bin in binary (i.e. not as text):

evl uniq -cix --key="name" \
         -d 'name string sep="|", personal_id int sep="\n"' \
         < input.txt > output.bin

8.21 Validate ¶

(since EVL 1.1)

Fail in case invalid data type appear unless ‘--limit’ option is specified.

Validate
is to be used in EVS job structure definition file. <f_in> and <f_out> are either input and output file or flow name.

evl validate
is intended for standalone usage, i.e. to be invoked from command line and reading records from standard input and writing to standard output.

EVD and EVS are definition files, for details see evl-evd(5) and evl-evs(5).

Synopsis ¶

Validate
  <f_in> <f_out> (<evd>|-d <inline_evd>)
  [-l|--limit <num>] [--text-output]

evl validate
  <f_in> <f_out> (<evd>|-d <inline_evd>)
  [-l|--limit <num>] [--text-output]
  [-v|--verbose]

evl validate
  ( --help | --usage | --version )

Options ¶

-l, --limit=<num>
fail after reaching <num> number of invalid records. If <num> is ‘0’, then never fails. Default value is ‘1’, i.e. fail immediatelly after first invalid record.

-y, --text-output
write the output as text, not binary

Standard options: ¶

--help
print this help and exit

--usage
print short usage information and exit

-v, --verbose
print to stderr info/debug messages of the component

--version
print version and exit

8.22 Watcher ¶

(since EVL 1.2)

This component writes records passing through the <flow> into <file> in text format.

Works only when variable ‘EVL_WATCHER’ is set to ‘1’, otherwise does nothing. One can use it for debugging data in ‘DEV’ or ‘TEST’ environment, but it would be switched off in ‘PROD’.

If not full path to the <file> is specified, it writes into directory defined by ‘EVL_WATCHER_DIR’ environment variable, which is by default ‘watcher’ subfolder of current project.

EVD is EVL data definition file, for details see evl-evd(5).

Synopsis ¶

Watcher
  <flow> <file> (<evd>|-d <inline_evd>) [-x|--text-input]

evl watcher
  ( --help | --usage | --version )

Options ¶

-d, --output-definition=<inline_evd>
either this option or the file <evd_out> must be presented. Example: ‘-d 'user_sum long'’

-x, --text-input
suppose the input as text, not binary

Standard options: ¶

--help
print this help and exit

--usage
print short usage information and exit

--version
print version and exit

Examples ¶

In EVL job (‘evs’ file):

Sort     FLOW_01 FLOW_02 some.evd --key='id'
Watcher  FLOW_02 sorted.csv some.evd

8 Basic Components ¶​

Standard ETL components ¶​

8.1 Assign ¶​

Synopsis ¶​

Options ¶​

Standard options: ¶​

Examples ¶​

8.2 Cat ¶​

Synopsis ¶​

Options ¶​

Standard options: ¶​

Examples ¶​

8.3 Cmd ¶​

Synopsis ¶​

Options ¶​

Standard options: ¶​

Examples ¶​

8.4 Component ¶​

Synopsis ¶​

Options ¶​

Standard options: ¶​

Examples ¶​

8.5 Cut ¶​

Synopsis ¶​

Options ¶​

Standard options: ¶​

Examples ¶​

8.6 Departition ¶​

Synopsis ¶​

Options ¶​

Standard options: ¶​

Examples ¶​

8.7 Echo ¶​

Synopsis ¶​

Options ¶​

Standard options: ¶​

Examples ¶​

8.8 Filter ¶​

Synopsis ¶​

Options ¶​

Standard options: ¶​

Examples ¶​

Command line invocation examples: ¶​

EVL job examples: ¶​

8.9 Gather ¶​

Synopsis ¶​

Options ¶​

Standard options: ¶​

Examples ¶​

8.10 Generate ¶​

When no <config_file> is specified: ¶​

When <config_file> in JSON format is specified: ¶​

Synopsis ¶​

Options ¶​

Standard options: ¶​

Examples ¶​

8.11 Head ¶​

Synopsis ¶​

Options ¶​

Standard options: ¶​

Examples ¶​

8.12 Lookup ¶​

Synopsis ¶​

Options ¶​

Standard options: ¶​

Examples ¶​

8.13 Merge ¶​

Synopsis ¶​

Options ¶​

Standard options: ¶​

Examples ¶​

8.14 Partition ¶​

Synopsis ¶​

Options ¶​

Standard options: ¶​

Examples ¶​

8.15 Sort ¶​

Synopsis ¶​

Options ¶​

Standard options: ¶​

8 Basic Components ¶

Standard ETL components ¶

8.1 Assign ¶

Synopsis ¶

Options ¶

Standard options: ¶

Examples ¶

8.2 Cat ¶

Synopsis ¶

Options ¶

Standard options: ¶

Examples ¶

8.3 Cmd ¶

Synopsis ¶

Options ¶

Standard options: ¶

Examples ¶

8.4 Component ¶

Synopsis ¶

Options ¶

Standard options: ¶

Examples ¶

8.5 Cut ¶

Synopsis ¶

Options ¶

Standard options: ¶

Examples ¶

8.6 Departition ¶

Synopsis ¶

Options ¶

Standard options: ¶

Examples ¶

8.7 Echo ¶

Synopsis ¶

Options ¶

Standard options: ¶

Examples ¶

8.8 Filter ¶

Synopsis ¶

Options ¶

Standard options: ¶

Examples ¶

Command line invocation examples: ¶

EVL job examples: ¶

8.9 Gather ¶

Synopsis ¶

Options ¶

Standard options: ¶

Examples ¶

8.10 Generate ¶

When no `<config_file>` is specified: ¶

When `<config_file>` in JSON format is specified: ¶

Synopsis ¶

Options ¶

Standard options: ¶

Examples ¶

8.11 Head ¶

Synopsis ¶

Options ¶

Standard options: ¶

Examples ¶

8.12 Lookup ¶

Synopsis ¶

Options ¶

Standard options: ¶

Examples ¶

8.13 Merge ¶

Synopsis ¶

Options ¶

Standard options: ¶

Examples ¶

8.14 Partition ¶

Synopsis ¶

Options ¶

Standard options: ¶

Examples ¶

8.15 Sort ¶

Synopsis ¶

Options ¶

Standard options: ¶