Basic-Components
8 Basic Components ¶
Most of these basic components follows standard GNU/Linux commands, their purpose is obvious immediately.
Standard ETL components ¶
- Assign
- Cat
- Cmd
- Comp
- Cut
- Departition
- Echo
- Filter
- Gather
- Generate
- Head
- Lookup
- Merge
- Partition
- Sort
- Sortgroup
- Tail
- Tee
- Trash
- Uniq
- Validate
- Watcher
8.1 Assign ¶
(since EVL 1.2)
Assign the content of input flow or file <f_in> into shell variable <varname>, which
is then exported into environment. Don’t forget to apply ‘--text-output’ on preceding
component to get text content in the <variable>.
This component doesn’t work for partitioned flow.
Assign
is to be used in EVS job structure definition file. <f_in> is either input file or flow
name.
There is no standalone version of this component as you can use standard Bash behaviour for this purpose. For example:
VARNAME=$(evl cat filename some.evd --text-output)
EVS is EVL job structure definition file, for details see evl-evs(5).
Synopsis ¶
Assign
<f_in> <varname>
evl assign
( --help | --usage | --version )
Options ¶
Standard options: ¶
--help
print this help and exit
--usage
print short usage information and exit
--version
print version and exit
Examples ¶
-
EVL job (an ‘
evs’ file) which reads content of a binary file ‘hwm.bin’ into variable ‘HWM’:Read hwm.bin FLOW_HWM evd/some.evd --text-output
Assign FLOW_HWM HWMSuch a value can be then used (after ‘
Wait’ component!) within mapping by:static int hwm = getenv_int("HWM",0); // use 0 when $HWM is empty
*out->incremental_id = ++hwm; -
To get a value from text file:
Assign hwm.txt HWM -
To assign flow content into a ‘
NATCO’ variable:Map FLOW_01 FLOW_02 in.evd out.evd map.evm --text-output
Assign FLOW_02 NATCO
8.2 Cat ¶
(since EVL 1.0)
Concatenate flows or files.
Cat
is to be used in EVS job structure definition file. <f_in> and <f_out> are either
input and output file or flow name.
evl cat
is intended for standalone usage, i.e. to be invoked from command line.
EVD is EVL data definition file, for details see ‘man 5 evd’.
Synopsis ¶
Cat
<f_in>... <f_out> (<evd>|-d <inline_evd>)
[ --validate ]
[ -x|--text-input | --text-input-dos-eol | --text-input-mac-eol ]
[ -y|--text-output | --text-output-dos-eol | --text-output-mac-eol ]
evl cat
[<file>...] (<evd>|-d <inline_evd>)
[ --validate ]
[ -x|--text-input | --text-input-dos-eol | --text-input-mac-eol ]
[ -y|--text-output | --text-output-dos-eol | --text-output-mac-eol ]
[ -v|--verbose ]
evl cat
( --help | --usage | --version )
Options ¶
-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example:
‘-d 'id int, user_id string enc=iso-8859-1'’
--validate
without this option, no fields are checked against data types. With this option, all output fields
are checked
-x, --text-input
suppose the input as text, not binary
--text-input-dos-eol
suppose the input as text with CRLF as end of line
--text-input-mac-eol
suppose the input as text with CR as end of line
-y, --text-output
write the output as text, not binary
--text-output-dos-eol
produce the output as text with CRLF as end of line
--text-output-mac-eol
produce the output as text with CR as end of line
Standard options: ¶
--help
print this help and exit
--usage
print short usage information and exit
-v, --verbose
print to stderr info/debug messages of the component
--version
print version and exit
Examples ¶
Print to stdout binary input in text format:
evl cat example.evd -y <input.bin
8.3 Cmd ¶
(since EVL 1.2)
Basicly it calls:
cat <f_in> | <command> > <f_out>
When <f_in> is empty, then it runs:
<command> > <f_out>
and when <f_out> is empty:
cat <f_in> | <command>
<command> can be also a pipeline.
If <f_in> is partitioned, then <command> is applied on all partitions and keep the
output <f_out> also partitioned.
Synopsis ¶
Cmd
<f_in> <f_out> <command>
evl cmd
( --help | --usage | --version )
Options ¶
Standard options: ¶
--help
print this help and exit
--usage
print short usage information and exit
-v, --verbose
print to stderr info/debug messages of the component
--version
print version and exit
Examples ¶
-
Write 10 times ’repeat some error message’ to the STDERR and into EVL job log:
Cmd "" /dev/stderr "yes repeat some error message | head" -
Suppose from ‘
SOME_FLOW’ we obtain integers, one by line, then median can be obtained from R and be written into ‘/some/file’:Cmd SOME_FLOW /some/file "Rscript median.R"The file median.R might look like this:
f <- file('stdin'); open(f); x <- c();
while ( length( line <- readLines(f) ) > 0 ) x <- c(x,as.integer(line));
write(median(x), stdout());
8.4 Component ¶
(since EVL 1.0)
Run <component> from the project’s evc directory with arguments <comp_arg>. In the
<component> these arguments are available as the array ‘COMP_ARG[1]’,
‘COMP_ARG[2]’, ... ‘COMP_ARG[0]’ is the component’s name.
When the <component> is not in current project subdirectory ‘evc/’, it tries the folder
‘$EVL_EVC_DIR/’.
You can also specify the full path to the component. Check examples.
Flow names within a component have unique prefixes, so cannot be in conflict with those in the job.
However if you need to connect output flow(s) of the component, you need to use variable
‘$COMP_FLOW’ which is set by the component to such a prefix. So then flow from the component,
e.g. ‘FLOW_IN_COMP’, can be read in parent job as ‘$COMP_FLOW.FLOW_IN_COMP’. Check
examples.
For input flow there is a variable ‘$PARENT_FLOW’ which can be used in the component. Parent
flow ‘FLOW_INTO_COMP’ can be reference within a component as
‘$PARENT_FLOW.FLOW_INTO_COMP’. Check examples for better understanding.
Comp
is to be used in EVS job structure definition file.
evl comp
is intended for standalone usage, i.e. to be invoked from command line and reading records from
standard input and writing to standard output.
EVS is EVL job structure definition file, for details see evl-evs(5).
Synopsis ¶
Comp
<component> [<comp_arg>...]
evl comp
( --help | --usage | --version )
Options ¶
Standard options: ¶
--help
print this help and exit
--usage
print short usage information and exit
--version
print version and exit
Examples ¶
-
Run custom component ‘
evc/prepare_lkp.evc’ with neither input nor output:Comp prepare_lkp -
Run component from EVL Data Hub template project with three arguments:
Comp $EVL_TEMPLATE_DIR/data-hub/evc/scd2_read_increments.evc party.*.csv evd/party.evd id -
Reading output from the component. Suppose you have custom generic component ‘
evc/read_files.evc’ which do some magic with json files, e.g.:jsons="${COMP_ARG[1]}"
evd="${COMP_ARG[2]}"
key="${COMP_ARG[3]}"
Read "$jsons" JSONS "$evd"
Tee JSONS A B "$evd" --key="$key"And you need to connect these output flows ‘
A’ and ‘B’ into your job, e.g.:Comp read_files /landing/users.*.json evd/users.evd surname
Sort $COMP_FLOW.A SORTED evd/users.evd
Write $COMP_FLOW.B users.csv evd/users.evd
... -
Writing flow to the component. Suppose you have custom component ‘
evc/write_log.evc’, e.g.:flow_in="${COMP_ARG[1]}"
Write $flow_in some_file.log -d "X string" --text-outputIn the job it would look like this:
Tail XXX LOG evd/XXX.evd -n 100
Comp write_log.evc LOGAlternatively the component would look like this as well:
Write $PARENT_FLOW.LOG some_file.log -d "X string" --text-output
8.5 Cut ¶
(since EVL 1.0)
Remove columns from input records. Use this component when you want to reduce the number of columns.
Cut
is to be used in EVS job structure definition file. <f_in> and <f_out> are either
input and output file or flow name.
evl cut
is intended for standalone usage, i.e. to be invoked from command line and read records from
standard input and write to standard output.
EVD is EVL data definition file, for details see evl-evd(5).
Synopsis ¶
Cut
<f_in> <f_out> (<evd_in>|-D <inline_evd) (<evd_out>|-d <inline_evd>)
[--validate] [-x|--text-input] [-y|--text-output]
evl cut
(<evd_in>|-D <inline_evd) (<evd_out>|-d <inline_evd>)
[--validate] [-x|--text-input] [-y|--text-output]
[-v|--verbose]
evl cut
( --help | --usage | --version )
Options ¶
-D, --input-definition=<inline_evd>
either this option or the file <evd_in> must be presented. Example: -D ’id int, user_id
string’
-d, --output-definition=<inline_evd>
either this option or the file <evd_out> must be presented. Example: -d ’user_sum long’
--validate
without this option, no fields are checked against data types. With this option, all output fields
are checked
-x, --text-input
suppose the input as text, not binary
-y, --text-output
write the output as text, not binary
Standard options: ¶
--help
print this help and exit
--usage
print short usage information and exit
-v, --verbose
print to stderr info/debug messages of the component
--version
print version and exit
Examples ¶
-
Print to stdout only integer field ‘
id’:evl cut example.evd -d'id int' -xy <in.txt
8.6 Departition ¶
(since EVL 1.2)
Gather or merge partitions into one output flow or file. When ‘-k <key>’ is specified, then
sorted input of each partition is supposed and output will be again sorted (i.e. merged). With no
‘-k <key>’, it gather input partitions in round-robin fashion. Applying to only one partition
simply write input to output. EVD is EVL data definition file, for details see evl-evd(5).
Synopsis ¶
Departition
<f_in>... <f_out> (<evd>|-d <inline_evd>)
(--key=<key> | --round-robin)
[--validate] [-x|--text-input] [-y|--text-output]
evl departition
<file_in> <file_out> (<evd>|-d <inline_evd>)
(--key=<key> | --round-robin)
[-v|--validate] [-x|--text-input] [-y|--text-output]
[-v|--verbose]
evl departition
( --help | --usage | --version )
Options ¶
-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string
enc=iso-8859-1’
-k, --key=<key>
merge partitioned flows/files according to the key, so the output is sorted by this key
-r, --round-robin
gather in round-robin fashion
--validate
without this option, no fields are checked against data types. With this option, all output fields
are checked
-x, --text-input
suppose the input as text, not binary
-y, --text-output
write the output as text, not binary
Standard options: ¶
--help
print this help and exit
--usage
print short usage information and exit
-v, --verbose
print to stderr info/debug messages of the component
--version
print version and exit
Examples ¶
-
To departition partitioned flow in the EVL job:
Read gs://my_bucket/cust.csv CUST $EVD_CUST
Partition CUST CUST_P $EVD_CUST --round-robin
Map CUST_P PROC_M $EVD_CUST $EVD_PROC $EVM_PROC
Departition PROC_M PROC_G $EVD_PROC --round-robin
Write PROC_G gdrive://proc.xlsx $EVD_PROC
8.7 Echo ¶
(since EVL 2.0)
Write <string> into <f_out>. This component doesn’t produce partitioned flow.
‘Echo’ is to be used in EVS job structure definition file.
<f_out> is either output file or flow name.
There is no standalone version of this component as you can use standard ‘echo’.
EVS is EVL job structure definition file, for details see evl-evs(5).
Synopsis ¶
Echo
<string> <f_out> [-e] [-n]
evl echo
( --help | --usage | --version )
Options ¶
-n
do not output the trailing newline (standard Bash echo option)
-e
enable interpretation of backslash escapes (standard Bash echo option)
Standard options: ¶
--help
print this help and exit
--usage
print short usage information and exit
--version
print version and exit
Examples ¶
-
An EVL job (specified in ‘
evs’ file) which run simple select statement from Postgreql table:Echo "select max(id) from some_db.some_table;" SELECT
RunPG SELECT MAX_ID -
To add two hardcoded records to the end of a flow:
... ... FLOW -d "s string"
Echo "Some string footer,\nwith two lines." FOOTER -e
Cat FLOW FOOTER -d "s string"
...
8.8 Filter ¶
(since EVL 1.0)
Filter records by the <condition>. Records for which the <condition> is false, are
forwarded to a reject file or to a flow if specified.
In many cases filtering records would be better to do in ‘Map’ component using
‘discard()’ function. Having ‘Filter’ component right before or after a ‘Map’ is not
perfomance optimal. Check ‘man evl-map’ for details.
Also using ‘Filter’ right after a ‘Read’ component is usually not performance optimal. It
is usually better to shift filtering to the database for example. Check option ‘--where’ of
‘Read’ component for details.
Filter
is to be used in EVS job structure definition file. <f_in> and <f_out> are either
input and output file or flow name.
evl filter
is intended for standalone usage, i.e. to be invoked from command line and read records from
standard input and write to standard output.
EVD is EVL data definition file, for details see evl-evd(5).
Synopsis ¶
Filter
<f_in> <f_out> (<evd>|-d <inline_evd>) <condition>
[-r|--reject=<f_out>]
[-x|--text-input] [-y|--text-output]
evl filter
(<evd>|-d <inline_evd>) <condition>
[-r|--reject=<f_out>]
[-x|--text-input] [-y|--text-output]
[-v|--verbose]
evl filter
( --help | --usage | --version )
Options ¶
-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string
enc=iso-8859-1’
-r, –reject=<f_out> catch rejected records into file or flow.
-x, --text-input
suppose the input as text, not binary
-y, --text-output
write the output as text, not binary
Standard options: ¶
--help
print this help and exit
--usage
print short usage information and exit
-v, --verbose
print to stderr info/debug messages of the component
--version
print version and exit
Examples ¶
Command line invocation examples: ¶
-
To print to stdout only records from file ‘
ID.txt’ with value of id less than 100:evl filter -d 'id int' -xy '*id<100' < ID.txt
Field ‘id’ is a pointer, so to get the value, ‘*id’ must be used.
2. Print to stdout only records from file ‘IDs.csv’ where ‘id1’ is different from
‘id2’, records with the same ids will be send into ‘same_IDs.csv’:
evl filter -d 'id1 int sep=",", id2 int' -xy -r same_IDs.csv \
'*id1 != *id2' < IDs.csv
EVL job examples: ¶
-
In an ‘
evs’ file:... ... SOURCE evd/sample.evd
Filter SOURCE OUTPUT evd/sample.evd "price && *currency == \"EUR\""
... OUTPUT ... evd/sample.evd
This example filter out records with NULL ‘price’ and with currency other than ‘EUR’.
(‘price’ is a pointer, so simply specifying ‘price’ in the condition means
‘price != nullptr’.)
-
If there would be a ‘
Read’ component right before the ‘Filter’, then consider using option ‘--where’ instead, because in such case the filter is shifted to the source DB, e.g.:SRC_HOST_URI="postgres://tech_etl@pg_server:5432"
SRC_PATH="dwh_db?schema=public&table=invoices"
Read $SRC_HOST_URI/$SRC_PATH INVOICES_EUR evd/invoices.evd \
--where "price is not null AND currency = 'EUR'"
Map INVOICES_EUR EUR_MAP evd/invoices.evd ...will run the query in PostgreSQL database with where condition:
WHERE price is not null AND currency = 'EUR'One can also use EVL notation with this ‘
--where’ option, e.g.:SRC_HOST_URI="postgres://tech_etl@pg_server:5432"
SRC_PATH="dwh_db?schema=public&table=invoices"
Read $SRC_HOST_URI/$SRC_PATH INVOICES_EUR evd/invoices.evd \
--where 'price && *currency == "EUR"'
Map INVOICES_EUR EUR_MAP evd/invoices.evd ...so then it would work also in case of reading a file:
Read data/invoices.csv INVOICES_EUR evd/invoices.evd \
--where 'price && *currency == "EUR"'
Map INVOICES_EUR EUR_MAP evd/invoices.evd ...in such case it is then internally the same as:
Read data/invoices.csv INVOICES_SRC evd/invoices.evd
Filter INVOICES_SRC INVOICES_EUR evd/invoices.evd \
'price && *currency == "EUR"'
Map INVOICES_EUR EUR_MAP evd/invoices.evd ... -
And using ‘
Filter’ to split a flow:... ... INV evd/invoices.evd
Filter INV EUR evd/invoices.evd -r NONEUR '*currency == "EUR"'
Sort EUR EUR_SRT evd/invoices.evd --key "price"
Sort NONEUR NONEUR_SRT evd/invoices.evd --key "currency,price"
...
8.9 Gather ¶
(since EVL 1.2)
Gather several input flows or files into one output flow or file in round-robin fashion.
Gather
is to be used in EVS job structure definition file. <f_in> and <f_out> are either
input and output file or flow name.
evl gather
is intended for standalone usage, i.e. to be invoked from command line. When <file> is ’-’,
then read from stdin.
EVD is EVL data definition file, for details see evl-evd(5).
Synopsis ¶
Gather
<f_in>... <f_out> (<evd>|-d <inline_evd>)
[--validate] [-x|--text-input] [-y|--text-output]
evl gather
[<file>...] (<evd>|-d <inline_evd>)
[--validate] [-x|--text-input] [-y|--text-output]
[-v|--verbose]
evl gather
( --help | --usage | --version )
Options ¶
-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string
enc=iso-8859-1’
--validate
without this option, no fields are checked against data types. With this option, all output fields
are checked
-x, --text-input
suppose the input as text, not binary
-y, --text-output
write the output as text, not binary
Standard options: ¶
--help
print this help and exit
--usage
print short usage information and exit
-v, --verbose
print to stderr info/debug messages of the component
--version
print version and exit
Examples ¶
-
Following command:
evl gather file.a file.b file.c file.evd -xy
print to stdout first record of ‘
file.a’ then first record of ‘file.b’ then first record of ‘file.c’, then second records and so on
-
To gather partitioned flow in the EVL job:
Read s3://my_bucket/cust.csv CUSTOMERS $EVD_CUST
Partition CUSTOMERS CUST_P $EVD_CUST --round-robin
Map CUST_P PROC_M $EVD_CUST $EVD_PROC $EVM_PROC
Gather PROC_M PROC_G $EVD_PROC
Write PROC_G sftp:///some/path/proc.csv.gz $EVD_PROC
8.10 Generate ¶
(since EVL 1.3)
According to data definition (evd file) generates records to stdout or output flow or file. EVD is EVL data definition file, for details see evl-evd(5).
When no <config_file> is specified: ¶
Number data types
values from the whole range of given data type are randomly generated
Date, timestamp
values between 1970-01-01 and 2199-12-31 are randomly generated
String
random characters [a-zA-Z0-9] of the length between 0 and 10 are generated
Vector
random number of elements between 0 and 10 are generated
When <config_file> in JSON format is specified: ¶
Number data types
range, values, probability of nulls
Date, timestamp
range, values, probability of nulls
String
range, values, min-length, max-length, probability of nulls
Vector
range, values, min-elements, max-elements, probability of nulls
When both, probability of nulls and values with null is specified, then only probability is taken. When range(s) and values overlaps, then it has no effect on the probability, all values has the same probability of being generated. See examples of JSON below for details.
Synopsis ¶
Generate
<f_out> (<evd>|-d <inline_evd>) [<config_file>]
[-n|--records <num>] [-y|--text-output]
evl generate
(<evd>|-d <inline_evd>) [<config_file>]
[-n|--records <num>] [-y|--text-output]
[-v|--verbose]
evl generate
( --help | --usage | --version )
Options ¶
-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string
enc=iso-8859-1’
-n, --records=<num>
generate <num> number of records instead of the default one
-y, --text-output
write the output as text, not binary
Standard options: ¶
--help
print this help and exit
--usage
print short usage information and exit
-v, --verbose
print to stderr info/debug messages of the component
--version
print version and exit
Examples ¶
-
Print to stdout one random uchar:
evl generate -d 'value uchar' -y -
Example of config JSON file:
{
"int_field": {
"values": [100, 200, 500],
"range": { "min": 0, "max": 10 },
"range": { "min": 50, "max": 60 },
"null": 0.1
},
"float_field": {
"range": { "min": 0, "max": 100 }
},
"date_field": {
"values": [ null, "2018-03-07", "2018-03-08" ]
},
"struct_field.string_field1": {
"min-length": 10,
"max-length": 20
},
"struct_field.string_field2": {
"values": ["abc", "def", "ghi", "jkl"]
},
"struct_field.decimal_field": {
"range": { "min": "0.00", "max": "100.00" }
},
"vector_field": {
"min-elements": 2,
"max-elements": 5
},
"vector_field[]": {
"range": { "min": "2018-03-07 05:00:00", "max": "2018-03-07 14:00:00" }
}
}where corresponding evd is:
int_field int sep="|" null=""
float_field float sep="|"
date_field date sep="|" null=""
struct_field struct sep="|"
string_field1 string sep=";"
string_field2 string sep=";"
decimal_field decimal(5.2) sep=";"
vector_field vector sep="\n"
timestamp sep=","For the ‘
int_field’ it will generate randomly values 0,1,...,10,50,...,60,100,200,500, but in 10% cases there will be ‘NULL’ values generated.
8.11 Head ¶
(since EVL 1.1)
Command prints to output first <num> records of input. Without option -n prints first 10
records.
Head
is to be used in EVS job structure definition file. <f_in> and <f_out> are either
input and output file or flow name.
evl head
is intended for standalone usage, i.e. to be invoked from command line.
EVD is EVL data definition file, for details see evl-evd(5).
Synopsis ¶
Head
<f_in> <f_out> [<evd>|-d <inline_evd>] [-n [-]<num>] [-s|--skip-parse]
[--validate] [--skip-bom]
[ -x|--text-input | --text-input-dos-eol | --text-input-mac-eol ]
[ -y|--text-output | --text-output-dos-eol | --text-output-mac-eol ]
evl head
[<evd>|-d <inline_evd>] [-n [-]<num>] [-s|--skip-parse]
[--validate] [--skip-bom]
[ -x|--text-input | --text-input-dos-eol | --text-input-mac-eol ]
[ -y|--text-output | --text-output-dos-eol | --text-output-mac-eol ]
[-v|--verbose]
evl head
( --help | --usage | --version )
Options ¶
-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string
enc=iso-8859-1’
-n, --records=[-]<num>
output first <num> records instead of the default first 10; or use -n -<num> to output all records
except last last <num>
-s, --skip-parse
this option has no effect with ’–records <num>’ (i.e. the case first <num> records are read and
the rest is ignored). But with ’–records -<NUM>’ it does not parse all fields, but ’jump’ over
record separator, i.e. the separator of the last field. Be careful with this option, it is
particularly good for ’csv’ files, when you want to skip some weird formatted footer for example,
but might be a wrong solution when some fields are separated by the same character as the last one.
--skip-bom
skip utf-8 BOM (Byte order mark) from the beginning of input, i.e. EF BB BF. Windows usually add it
to files in UTF8 encoding
--validate
without this option, no fields are checked against data types. With this option, all output fields
are checked
-x, --text-input
suppose the input as text, not binary
--text-input-dos-eol
suppose the input as text with CRLF as end of line
--text-input-mac-eol
suppose the input as text with CR as end of line
-y, --text-output
write the output as text, not binary
--text-output-dos-eol
produce the output as text with CRLF as end of line
--text-output-mac-eol
produce the output as text with CR as end of line
Standard options: ¶
--help
print this help and exit
--usage
print short usage information and exit
-v, --verbose
print to stderr info/debug messages of the component
--version
print version and exit
Examples ¶
-
print to stdout only first 10 records:
evl head example.evd -xy <in.txt -
read the binary input and omit last 3 records without parsing them (i.e. they no need to have the data structure defined by evd):
cat input.bin | evl head -sy -n-3 \
-d 'id int sep=",", updated date sep="\n"' \
> output.txt
8.12 Lookup ¶
(since EVL 2.0)
Prepare lookup from sorted input, which can be used after Wait command till ‘Lookup remove’.
Input must be sorted by the <key>.
Lookup [remove]
is to be used in EVS job structure definition file. <f_in> and <f_out> are either
input and output file or flow name.
evl lookup [remove]
is intended for standalone usage, i.e. to be invoked from command line.
EVD is EVL data definition file, for details see evl-evd(5).
Synopsis ¶
Lookup
<f_in> <lookup_name> (<evd>|-d <inline_evd>) -k <key> [-x|--text-input]
Lookup remove
<lookup_name>
evl lookup
<lookup_name> (<evd>|-d <inline_evd>) -k <key> [-x|--text-input]
[-v|--verbose]
evl lookup remove
<lookup_name>
evl lookup
( --help | --usage | --version )
Options ¶
-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string
enc=iso-8859-1’
-k, --key=<key>
key for looking up records
-x, --text-input
suppose the input as text, not binary
Standard options: ¶
--help
print this help and exit
--usage
print short usage information and exit
-v, --verbose
print to stderr info/debug messages of the component
--version
print version and exit
Examples ¶
-
To prepare lookup at the beginning of the job:
Read dimension.csv DIM evd/dim.evd --text-input
Sort DIM DIM_SRT evd/dim.evd --key="id"
Lookup DIM_SRT dim_lkp evd/dim.evd --key="id"
8.13 Merge ¶
(since EVL 1.2)
Merge sorted flows or files into one (sorted) output. In the case of only one input flow or file, it simply writes this file to output flow or file.
To merge based on all of the fields, use an empty <key>.
Merge
is to be used in EVS job structure definition file. <f_in> and <f_out> are either
input and output file or flow name.
evl merge
is intended for standalone usage, i.e. to be invoked from command line. When <file> is ’-’,
then read from stdin.
EVD is EVL data definition file, for details see evl-evd(5).
Synopsis ¶
Merge
<f_in>... <f_out> [<evd>|-d <inline_evd>] -k|--key <key>
[-c|--check-sort] [-i|--ignore-case]
[--validate] [-x|--text-input] [-y|--text-output]
evl merge
[<file>...] [<evd>|-d <inline_evd>] -k|--key <key>
[-c|--check-sort] [-i|--ignore-case]
[--validate] [-x|--text-input] [-y|--text-output]
[-v|--verbose]
evl merge
( --help | --usage | --version )
Options ¶
-c, --check-sort
check if the input is really sorted according to specified key
-d, --data-definition=<inline_evd>
either this option or the file <evd_out> must be presented. Example: -d ’some_id long sep="|",
some_value string sep="\n"’
-i, --ignore-case
be case insensitive for key fields
-k, --key=<key>
group by this key, where <key> is comma separated list of fields with type (either DESC or ASC,
default type is ASC). When the <key> is empty, it sorts based on the whole record.
--validate
without this option, no fields are checked against data types. With this option, all output fields
are checked
-x, --text-input
suppose the input as text, not binary
-y, --text-output
write the output as text, not binary
Standard options: ¶
--help
print this help and exit
--usage
print short usage information and exit
-v, --verbose
print to stderr info/debug messages of the component
--version
print version and exit
Examples ¶
evl merge example.evd -k 'input_id' -y input1.bin input2.bin input3.bin
merge three (sorted) binary files, the output is in text and sorted by ’input_id’
8.14 Partition ¶
(since EVL 1.2)
Read input flow or file and according to ‘--key’ or ‘--round-robin’ logic send to several
number of output flows or files. The number of partitions depends on the ‘EVL_PARTITIONS’
environment variable and also on the EVL version/edition.
Partition
is to be used in EVS job structure definition file. <f_in> and <f_out> are either
input and output file or flow name.
evl partition
is intended for standalone usage, i.e. to be invoked from command line.
EVD is EVL data definition file, for details see evl-evd(5).
Synopsis ¶
Partition
<f_in> <f_out> (<evd>|-d <inline_evd>)
(--key=<key> | --round-robin)
[--validate] [-x|--text-input] [-y|--text-output]
evl partition
<file_in> <file_out> (<evd>|-d <inline_evd>)
(--key=<key> | --round-robin)
[--validate] [-x|--text-input] [-y|--text-output]
[-v|--verbose]
evl partition
( --help | --usage | --version | --max-partitions )
Options ¶
-d, --data-definition=<inline_evd>
either this option or the file <evd_out> must be presented
-k, --key=<key>
key according to which to distribute data
-m, --max-partitions
return the number of maximal possible partitions
-r, --round-robin
split by round-robin, i.e. simply one record after another to one output flow/file after another
--validate
without this option, no fields are checked against data types. With this option, all output fields
are checked
-x, --text-input
suppose the input as text, not binary
-y, --text-output
write the output as text, not binary
Standard options: ¶
--help
print this help and exit
--usage
print short usage information and exit
-v, --verbose
print to stderr info/debug messages of the component
--version
print version and exit
Examples ¶
-
To partition flow in the EVL job:
Read s3://my_bucket/cust.csv CUST $EVD_CUST
Partition CUST CUST_P $EVD_CUST --round-robin
Map CUST_P PROC_M $EVD_CUST $EVD_PROC $EVM_PROC
Departition PROC_M PROC_G $EVD_PROC --round-robin
Write PROC_G sftp:///some/path/proc.csv.gz $EVD_PROC
8.15 Sort ¶
(since EVL 1.0)
Command takes records from stdin or <f_in>, sort them via <key> and write them to
stdout or <f_out>. With the ‘-u’ option it deduplicates the data. At the moment it uses
only traditional sort order (i.e. like LC_ALL=C), not national.
To sort based on all of the fields, use an empty <key>.
Sort
is to be used in EVS job structure definition file. <f_in> and <f_out> are either
input and output file or flow name.
evl sort
is intended for standalone usage, i.e. to be invoked from command line and reading records from
standard input and writing to standard output.
EVD and EVS are EVL definition files, for details see evl-evd(5) and evl-evs(5).
Synopsis ¶
Sort
<f_in> <f_out> (<evd>|-d <inline_evd) -k <key>
[-u <unique-key> [-t|--keep-first] [--reject=<file>]]
[-c|--check-sort] [-f|--file-storage] [-i|--ignore-case]
[--validate] [-x|--text-input] [-y|--text-output]
evl sort
(<evd>|-d <inline_evd) -k <key>
[-u <unique-key> [-t|--keep-first] [--reject=<file>]]
[-c|--check-sort] [-f|--file-storage] [-i|--ignore-case]
[--validate] [-x|--text-input] [-y|--text-output]
[-v|--verbose]
evl sort
( --help | --usage | --version )
Options ¶
-c, --check-sort
only check if the input is sorted and fail if not
-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string
enc=iso-8859-1’
-f, --file-storage
store temporary files on disk instead of using memory
-i, --ignore-case
ignore case sensitivity for key fields
-k, --key=<key>
sort via a key, where <key> is comma separated list of fields with type (default type is
ASC). When the <key> is empty, it sorts based on the whole record. Example:
–key=’id,user_id DESC,modify_dt ASC’
-r, --reject=<reject_file>
being used with option -u it catch duplicated records into <reject_file>
-t, --keep-first
when deduplicate by –unique-key, keep the first record from the group
-u, --unique-key=<unique_key>
deduplicate the output via <unique_key>; take only the last value unless –keep-first is specified.
Duplicated records are catched by -r option. Example: -u ’id,user_id’
--validate
without this option, no fields are checked against data types. With this option, all output fields
are checked
-x, --text-input
suppose the input as text, not binary
-y, --text-output
write the output as text, not binary
Standard options: ¶
--help
print this help and exit
--usage
print short usage information and exit
-v, --verbose
print to stderr info/debug messages of the component
--version
print version and exit
Examples ¶
-
Sort via the whole record (i.e. according to all fields) the text input and write into text output file:
evl sort example.evd -k '' -xy <in.txt >out.txt -
Deduplicate the binary input (for example from another EVL component) by keeping the first record in each group with the same id (with the lowest updated date) and write the result into output.csv and duplicates into duplicates.csv:
cat input.bin | \
evl sort -ty -k'd,updated' -u'id' \
-d'id int sep=",", updated date sep="\n"' -r duplicates.csv >output.csv -
Check sort (being case insensitive) of input text file input.txt and write into file output.bin in binary (i.e. not as text):
evl sort -cix -k'name' -d'name string sep="|", personal_id int sep="\n"' \
<input.txt >output.bin
8.16 Sortgroup ¶
(since EVL 2.0)
By having sorted input by <group_key>, sort within groups defined by such <group_key>
according to <key>. So output is sorted by <group_key>,<key>. At the moment it uses
only traditional sort order (i.e. like LC_ALL=C), not national.
Sortgroup
is to be used in EVS job structure definition file. <f_in> and <f_out> are either
input and output file or flow name.
evl sortgroup
is intended for standalone usage, i.e. to be invoked from command line and reading records from
standard input and writing to standard output.
EVD and EVS are EVL definition files, for details see evl-evd(5) and evl-evs(5).
Synopsis ¶
Sortgroup
<f_in> <f_out> (<evd>|-d <inline_evd)
-g|--group-key=<group_key>
-k|--key=<key>
[-c|--check-sort] [-i|--ignore-case]
[--validate] [-x|--text-input] [-y|--text-output]
evl sortgroup
(<evd>|-d <inline_evd)
-g|--group-key=<group_key>
-k|--key=<key>
[-c|--check-sort] [-i|--ignore-case]
[--validate] [-x|--text-input] [-y|--text-output]
[-v|--verbose]
evl sortgroup
( --help | --usage | --version )
Options ¶
-c, --check-sort
check if the input is really sorted by ‘;<group_key>’
-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: ‘-d 'id int, user_id string'’
-g, --group-key=<group_key>
input is sorted via this key, where <group_key> is comma separated list of fields with type
(default type is ASC). Example: ‘-k 'id,user_id DESC'’
-i, --ignore-case
ignore case sensitivity for key fields
-k, --key=<key>
sort via this key within each group of records with same <group_key>. <key> is comma
separated list of fields with type (default type is ASC). Example: ‘-k 'modify_dt ASC'’
--validate
without this option, no fields are checked against data types. With this option, all output fields
are checked
-x, --text-input
suppose the input as text, not binary
-y, --text-output
write the output as text, not binary
Standard options: ¶
--help
print this help and exit
--usage
print short usage information and exit
-v, --verbose
print to stderr info/debug messages of the component
--version
print version and exit
Examples ¶
- Suppose having a dataset already sorted by field ‘
customer’. TBA
8.17 Tail ¶
(since EVL 1.1)
Command prints to output last <num> records of input. Without option ‘-n’ prints last
10 records.
Tail
is to be used in EVS job structure definition file. <f_in> and <f_out> are either
input and output file or flow name.
evl tail
is intended for standalone usage, i.e. to be invoked from command line.
EVD is EVL data definition file, for details see evl-evd(5).
Synopsis ¶
Tail
<f_in> <f_out> [<evd>|-d <inline_evd>] [-n [+]<num>] [-s|--skip-parse]
[--validate] [--skip-bom]
[ -x|--text-input | --text-input-dos-eol | --text-input-mac-eol ]
[ -y|--text-output | --text-output-dos-eol | --text-output-mac-eol ]
evl tail
[<evd>|-d <inline_evd>] [-n [+]<num>] [-s|--skip-parse]
[--validate] [--skip-bom]
[ -x|--text-input | --text-input-dos-eol | --text-input-mac-eol ]
[ -y|--text-output | --text-output-dos-eol | --text-output-mac-eol ]
[-v|--verbose]
evl tail
( --help | --usage | --version )
Options ¶
-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string
enc=iso-8859-1’
-n, --records=[+]<num>
output the last <num> records instead of the default last 10; or use -n +<num> to output starting
with record <num>
-s, --skip-parse
with this option it does not parse all fields, but ’jump’ over record separator, i.e. the separator
of the last field. Be careful with this option, it is particularly good for ’csv’ files, when you
want to skip some weird formatted header for example, but might be a wrong solution when some
fields are separated by the same character as the last one.
--skip-bom
skip utf-8 BOM (Byte order mark) from the beginning of input, i.e. EF BB BF. Windows usually add it
to files in UTF8 encoding
--validate
without this option, no fields are checked against data types. With this option, all output fields
are checked
-x, --text-input
suppose the input as text, not binary
--text-input-dos-eol
suppose the input as text with CRLF as end of line
--text-input-mac-eol
suppose the input as text with CR as end of line
-y, --text-output
write the output as text, not binary
--text-output-dos-eol
produce the output as text with CRLF as end of line
--text-output-mac-eol
produce the output as text with CR as end of line
Standard options: ¶
--help
print this help and exit
--usage
print short usage information and exit
-v, --verbose
print to stderr info/debug messages of the component
--version
print version and exit
Examples ¶
-
Print to stdout only last 10 records:
evl tail example.evd -xy <in.txt -
Read the binary input and skip first 2 records without parsing them (i.e. they no need to have the data structure defined by evd):
cat input.bin | evl tail -sy -n+3 \
-d'id int sep=",", updated date sep="\n"'
> output.txt
8.18 Tee ¶
(since EVL 1.0)
Replicate <f_in> to multiple <f_out>
Tee
is to be used in EVS job structure definition file. <f_in> and <f_out> are either
input and output file or flow name.
There is no standalone component version as one can use standard UNIX command ’tee’.
Synopsis ¶
Tee
<f_in> <f_out>...
evl tee
( --help | --usage | --version )
Options ¶
Standard options: ¶
--help
print this help and exit
--usage
print short usage information and exit
--version
print version and exit
Examples ¶
Replicate to output flows (or files) A,B,C,D,E,F:
Tee IN_FLOW A B C D E F
8.19 Trash ¶
(since EVL 1.0)
Send <f_in> into /dev/null. Try to avoid using it in production environment as redirecting
to /dev/null also costs the resources.
Trash
is to be used in EVS job structure definition file. <f_in> is either input file or flow
name, both can be partitioned.
There is no standalone version of this component as you can always use >/dev/null.
EVS is EVL job structure definition file, for details see evl-evs(5).
Synopsis ¶
Trash
<f_in>...
evl trash
( --help | --usage | --version )
Options ¶
Standard options: ¶
--help
print this help and exit
--usage
print short usage information and exit
--version
print version and exit
8.20 Uniq ¶
(since EVL 2.1)
Read stdin or <f_in> and write to stdout or <f_out> last record in the group
specified by the <key>. The input must be sorted according to this key.
Uniq
is to be used in EVS job structure definition file. <f_in> and <f_out> are either
input and output file or flow name.
evl uniq
is intended for standalone usage, i.e. to be invoked from command line and reading records from
standard input and writing to standard output.
EVD and EVS are EVL definition files, for details see evl-evd(5) and evl-evs(5).
Synopsis ¶
Uniq
<f_in> <f_out> (<evd>|-d <inline_evd>) -k <key> [-c|--check-sort]
[-i|--ignore-case] [--reject=<file>] [-t|--keep-first]
[--validate] [-x|--text-input] [-y|--text-output]
evl uniq
[<evd>] -k <key> [-c|--check-sort]
[-i|--ignore-case] [--reject=<file>] [-t|--keep-first]
[--validate] [-x|--text-input] [-y|--text-output]
[-v|--verbose]
evl uniq
( --help | --usage | --version )
Options ¶
-c, --check-sort
check if the input is sorted and fail if not
-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented. Example: -d ’id int, user_id string
enc=iso-8859-1’
-i, --ignore-case
ignore case sensitivity for key fields
-k, --key=<key>
deduplicate via a key, where <key> is comma separated list of fields with type (default type is
ASC). Example: -k ’id,user_id DESC,modify_dt ASC’
-r, --reject=<reject_file>
being used with option -u it catch duplicated records into <reject_file>
-t, --keep-first
keep the first record of the group instead of the last one
--validate
without this option, no fields are checked against data types. With this option, all output fields
are checked
-x, --text-input
suppose the input as text, not binary
-y, --text-output
write the output as text, not binary
Standard options: ¶
--help
print this help and exit
--usage
print short usage information and exit
-v, --verbose
print to stderr info/debug messages of the component
--version
print version and exit
Examples ¶
-
Uniq via the all fields and write into text output file:
evl uniq example.evd -k'' -xy < in.txt > out.txt -
Deduplicate the binary input (for example from another EVL component) by keeping the first record in each group with the same id (with the lowest updated date) and write the result into output.csv and duplicates into duplicates.csv:
cat input.bin | evl uniq -ty -k'id,updated' -u'id' \
-d'id int sep=",", updated date sep="\n"' \
-r duplicates.csv > output.csv -
Check uniq (being case insensitive) of input text file input.txt and write into file output.bin in binary (i.e. not as text):
evl uniq -cix --key="name" \
-d 'name string sep="|", personal_id int sep="\n"' \
< input.txt > output.bin
8.21 Validate ¶
(since EVL 1.1)
Fail in case invalid data type appear unless ‘--limit’ option is specified.
Validate
is to be used in EVS job structure definition file. <f_in> and <f_out> are either
input and output file or flow name.
evl validate
is intended for standalone usage, i.e. to be invoked from command line and reading records from
standard input and writing to standard output.
EVD and EVS are definition files, for details see evl-evd(5) and evl-evs(5).
Synopsis ¶
Validate
<f_in> <f_out> (<evd>|-d <inline_evd>)
[-l|--limit <num>] [--text-output]
evl validate
<f_in> <f_out> (<evd>|-d <inline_evd>)
[-l|--limit <num>] [--text-output]
[-v|--verbose]
evl validate
( --help | --usage | --version )
Options ¶
-l, --limit=<num>
fail after reaching <num> number of invalid records. If <num> is ‘0’, then never
fails. Default value is ‘1’, i.e. fail immediatelly after first invalid record.
-y, --text-output
write the output as text, not binary
Standard options: ¶
--help
print this help and exit
--usage
print short usage information and exit
-v, --verbose
print to stderr info/debug messages of the component
--version
print version and exit
8.22 Watcher ¶
(since EVL 1.2)
This component writes records passing through the <flow> into <file> in text format.
Works only when variable ‘EVL_WATCHER’ is set to ‘1’, otherwise does nothing. One can use
it for debugging data in ‘DEV’ or ‘TEST’ environment, but it would be switched off in
‘PROD’.
If not full path to the <file> is specified, it writes into directory defined by
‘EVL_WATCHER_DIR’ environment variable, which is by default ‘watcher’ subfolder of
current project.
EVD is EVL data definition file, for details see evl-evd(5).
Synopsis ¶
Watcher
<flow> <file> (<evd>|-d <inline_evd>) [-x|--text-input]
evl watcher
( --help | --usage | --version )
Options ¶
-d, --output-definition=<inline_evd>
either this option or the file <evd_out> must be presented. Example:
‘-d 'user_sum long'’
-x, --text-input
suppose the input as text, not binary
Standard options: ¶
--help
print this help and exit
--usage
print short usage information and exit
--version
print version and exit
Examples ¶
-
In EVL job (‘
evs’ file):Sort FLOW_01 FLOW_02 some.evd --key='id'
Watcher FLOW_02 sorted.csv some.evd