Skip to main content
Version: 2.8

4 Utils


4.1 csv2evd

Read <file.csv> or standard input, and guess:

  • data types,
  • field separator (unless option ‘--separator=<char>’ is used),
  • if strings are quoted (unless option ‘--quote=<char>’ or ‘--optional-quote=<char>’ is used),
  • end-of-line character(s) (unless option ‘--dos-eol’ or ‘--lin-eol’ or ‘--mac-eol’ is used)

and write EVD to standard output or to <file.evd>.

It uses header line for field names, spaces are replaced by underscores.

Separator is trying to be guessed in this order: ‘,’ (comma), ‘;’ (semi-colon), ‘|’ (pipe), ‘\t’ (tab), ‘:’ (colon), (space).

Quotation character is guessed in this order: double quotes, single quotes.

EVD is EVL data definition file, for details see man 5 evd.

Synopsis

csv2evd
[<file.csv>] [-o|--output=<file.evd>]
[--inline]
[-d|--date=<format>]
[-h|--header=<field_name>,...]
[-n|--no-header]
[-l|--null=<string>]
[-q|--quote=<char> | --optional-quote=<char>]
[-s|--separator=<char>]
[-t|--datetime=<format>]
[--timestamp=<format>]
[--dos-eol | --lin-eol | --mac-eol]
[-v|--verbose]

csv2evd
( --help | --usage | --version )

Options

-d, --date=<format>
by default it tries only ‘%Y-%m-%d’, then ‘%d.%m.%Y

-h, --header=<field_name>,...
use comma separated list of field names instead of header line, for example when there is no header in csv file (option ‘-n’ must be used) or when other field names would be used

--inline
output EVD in the inline format (for example to use EVD by other component with ‘-d’ option)

--dos-eol
do not guess end-of-line character(s), but suppose the input is text with CRLF as end of line,

--lin-eol
do not guess end-of-line character(s), but suppose the input is text with LF as end of line

--mac-eol
do not guess end-of-line character(s), but suppose the input is text with CR as end of line

-n, --no-header
with this option it suppose there is no header. Fields will be named ‘field_001’, ‘field_002’, etc.

-l, --null=<string>
to specify what string is used for NULL values in CSV, empty string is allowed

-o, --output=<file.evd>
write output into file <file.evd> instead of standard output

--optional-quote=<char>
suppose optional quote character <char>, must be used together with ‘--separator

-q, --quote=<char>
do not guess if fields are quoted, but suppose <char> as quotation character

-s, --separator=<char>
do not guess the separator, but use <char> instead

-t, --datetime=<format>
by default it tries only ‘%Y-%m-%d %H:%M:%S

--timestamp=<format>
by default it tries only ‘%Y-%m-%d %H:%M:%S.%E*f

-v, --verbose
print to STDERR info/debug messages

--help
print this help and exit

--usage
print short usage information and exit

--version
print version and exit

Examples

  1. Having table.csv:

    id;started;value
    1;2019-06-06;some string

    This command:

    csv2evd table.csv

    will try to guess data types, field separator and if strings are quoted or not, and use header line for field names, to produce EVD to standard output:

    id       int              null="" sep=";"
    started date("%Y-%m-%d") null="" sep=";"
    value string null="" sep="\n"
  2. Just an alternative invocation forwording output EVD to a file:

    csv2evd < table.csv > table.evd
  3. To skip header and use different field names:

    csv2evd --header="first_field,other_field,last_one" \
    table.csv > table.evd
  4. Case when there is no header in CSV file, but use specified field names:

    csv2evd --no-header --header="first_field,other_field,last_one" \
    table.csv > table.evd
  5. No header in CSV and use generated field names ‘field_001’, ‘field_002’, etc.:

    csv2evd --no-header table.csv > table.evd
  6. Consider specific date format, here day of year (‘001..366’), and ‘|’ as a field separator:

    csv2evd --date="%j" -s '|' table.csv > table.evd

4.2 csv2parquet

Read <file.csv>, guess data types or use <evd> file or <inline_evd> and write Parquet file to <file.parquet> or standard output. For guessing data types (EVD) it uses utility ‘csv2evd’.

EVD is EVL data definition file, for details see man 5 evd.

Synopsis

csv2parquet
<file.csv>
[-o|--output=<file.parquet>]
[-d|--date=<format>]
[-h|--header=<field_name>,...]
[-n|--no-header]
[-l|--null=<string>]
[-q|--quote=<char> | --optional-quote=<char>]
[-s|--separator=<char>]
[-t|--datetime=<format>]
[--timestamp=<format>]
[-a|--dos-eol | -b|--mac-eol]
[-v|--verbose]

csv2parquet
<file.csv> (<evd>|-d <inline_evd>)
[-o|--output=<file.parquet>]
[-a|--dos-eol | -b|--mac-eol]
[-v|--verbose]

csv2parquet
( --help | --usage | --version )

Options

Standard options:

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented to use already defined EVD

-a, --dos-eol
suppose DOS end-of-line, i.e. replace CR+LF (‘\r\n’) by LF (‘\n’) on input

-b, --mac-eol
suppose Mac end-of-line, i.e. replace CR (‘\r’) by LF (‘\n’) on input

-o, --output=<file.parquet>
write output into <file.parquet> instead of standard output

-v, --verbose
print to STDERR info/debug messages

--help
print this help and exit

--usage
print short usage information and exit

--version
print version and exit

EVD options:

--date=<format>
by default it tries only ‘%Y-%m-%d’, then ‘%d.%m.%Y

-h, --header=<field_name>,...
use comma separated list of field names instead of header line, for example when there is no header in csv file (option ‘-n’ must be used) or when other field names should be used

-n, --no-header
with this option it suppose there is no header. Fields will be named ‘field_001’, ‘field_002’, etc.

-l, --null=<string>
to specify what string is used for NULL values in CSV, empty string is allowed

--optional-quote=<char>
suppose optional quote character <char>, must be used together with ‘--separator

-q, --quote=<char>
do not guess if fields are quoted, but suppose <char> as quotation character

-s, --separator=<char>
do not guess the separator, but use <char> instead

-t, --datetime=<format>
by default it tries only ‘%Y-%m-%d %H:%M:%S

--timestamp=<format>
by default it tries only ‘%Y-%m-%d %H:%M:%S.%E*f

Examples

  1. Having ‘some.csv’:

    id;started;value
    1;2019-06-06;some string

    The command:

    csv2parquet --null="NULL" some.csv > some.parquet

    will produce some.parquet file with these field:

    id       int               null="NULL"  sep=";"
    started date("%Y-%m-%d") null="NULL" sep=";"
    value string null="NULL" sep="\n"
  2. Following invocation will guess data types, field separator and if strings are quoted or not, and use header line for field names:

    csv2parquet table.csv > table.parquet

    With the ‘--verbose’ option it will write to standard error the whole EVD file which was used:

    csv2parquet --verbose table.csv > table.parquet
  3. To skip header and use different field names:

    csv2parquet --header="first_field,other_field,last_one"
    table.csv > table.parquet
  4. Case when there is no header in CSV file, but use specified field names:

    csv2parquet --no-header --header="first_field,other_field,last_one" \
    table.csv > table.parquet
  5. No header in CSV and use generated field names ‘field_001’, ‘field_002’, etc.:

    csv2parquet --no-header table.csv > table.parquet
  6. Consider specific date format, here day of year (‘001..366’), and ‘|’ as a field separator:

    csv2parquet --date="%j" -s '|' table.csv > table.parquet
  7. To use own (specific or already generated) EVD file (i.e. data types definition):

    csv2parquet table.csv table.evd > table.parquet

Another example

Let’s have following sample_1.csv file with header and with semicolon as a field separator:

ID;Name;Code;Price;Created at
1;item_11;11;15,85;2016-05-04
2;item_12;12;21,25;2019-11-11
3;item_13;13;12,99;2019-05-05

Then by running:

csv2parquet -v < sample_1.csv > sample_1.parquet

will guess data types and use field names from the header and produce sample_1.parquet file.

And it will also write to standard error EVL data types which were used:

Used EVD:
---------
ID int null="" sep=";"
Name string null="" sep=";"
Code int null="" sep=";"
Price decimal(8,2) null="" sep=";"
Created_at date null="" sep="\n"
---------

More complex example

If guessed data types are not correct for some reason, for example the date format would be different or field which looks like an integer might be next time a string, then it is better to provide own EVD file with EVL data definition description.

Suppose following sample_2.csv file with header and with semicolon as a field separator:

ID;Name;Code;Price;Created at
1;item_11;11;15,85;05/04/2016
2;item_12;12;21,25;11/11/2019
3;item_13;13;12,99;05/05/2019

Let’s start by guessing EVD by csv2evd utility:

csv2evd sample_2.csv > sample_2.evd

It will result in sample_2.evd:

ID         int          null="" sep=";"
Name string null="" sep=";"
Code int null="" sep=";"
Price decimal(8,2) null="" sep=";"
Created_at string null="" sep="\n"

We can either use option --date="%m/%d/%Y" or simply edit sample_2.evd manually. Both will result in:

ID         int              null="" sep=";"
Name string null="" sep=";"
Code int null="" sep=";"
Price decimal(8,2) null="" sep=";"
Created_at date("%m/%d/%Y") null="" sep="\n"

4.3 parquet2csv

(since EVL 2.8)

Read <file.parquet> and write CSV file to <file.csv> or standard output. It uses data types from Parquet header or from existing <evd> file or from <inline_evd>.

EVD is EVL data definition file, for details see man 5 evd.

Synopsis

parquet2csv
<file.parquet>
[-o|--output=<file.csv>]
[--all-as-string | --real-as-decimal[=<precision>,<scale>]]
[-d|--date=<format>]
[-h|--header=<field_name>,...]
[-n|--no-header]
[-l|--null=<string>]
[-q|--quote=<char>]
[-s|--separator=<char>]
[-t|--datetime=<format>]
[-a|--dos-eol | -b|--mac-eol]
[--filter=<condition>]
[--first-record=<n>]
[--guess-uniform-symbol-size]
[--low-memory]
[-v|--verbose]

parquet2csv
<file.parquet> (<evd>|-d <inline_evd>)
[-m|--match-fields]
[-o|--output=<file.csv>]
[-h|--header=<field_name>,...]
[-n|--no-header]
[-a|--dos-eol | -b|--mac-eol]
[--filter=<condition>]
[--first-record=<n>]
[--guess-uniform-symbol-size]
[--low-memory]
[-v|--verbose]

parquet2csv
( --help | --usage | --version )

Options

--all-as-string
interpret all fields as strings. (Since EVL 2.5.)

-d, --data-definition=<inline_evd>
either this option or the file <evd> must be presented to use already defined (custom) EVD

-a, --dos-eol
output DOS end-of-line, i.e. CR+LF (‘\r\n’)

-b, --mac-eol
output Mac end-of-line, i.e. CR (‘\r’)

--date=<format>
to specify a <format> for date data type

--filter=<condition>
read only records with given <condition>. (Since EVL 2.6.)

--first-record=<n>
start to read from the record number <n>. (Since EVL 2.6.)

--guess-uniform-symbol-size
might speed up indexing of dictionary, but it could not work in all cases. Use only in special cases when need really good performance. (Since EVL 2.6.)

-h, --header=<field_name>,...
use comma separated list of field names instead of header line, for example when you don’t want to use field names from Parquet header.

--low-memory
do not read dictionary into memory. This could save memory consumption, but slows down reading the source file. (Since EVL 2.6.)

-l, --null=<string>
to specify what string is used for NULL values in CSV, empty string is allowed

-m, --match-fields
to read only a subset of fields from Parquet file or to read them in different order

-n, --no-header
with this option it produces no header line

-o, --output=<file.csv>
write output into <file.csv> instead of standard output

-q, --quote=<char>
to use quoted fields for the CSV output. When data contains such <char>, all of them are escaped by duplicating them. For example using ‘--quote="\""’ will serve data like ‘some "text"’ as ‘"some ""text"""’.

--real-as-decimal[=<precision>,<scale>]
interpret ‘real’ data types as ‘decimal(<precision>,<scale>)’. When no <precision> or <scale> is specified, use values from environment variables ‘EVL_DEFAULT_DECIMAL_PRECISION’ and ‘EVL_DEFAULT_DECIMAL_SCALE’, which are by default set to 18 and 2. (Since EVL 2.5.)

-s, --separator=<char>
to use <char> as field separator for the CSV output

-t, --datetime=<format>
to specify a <format> for datetime data type

-v, --verbose
print to standard error output info/debug messages

--help
print this help and exit

--usage
print short usage information and exit

--version
print version and exit

Examples

  1. Having ‘some.parquet’, the command to produce CSV file with empty strings representing NULL values, dates in format ‘DD.MM.YYYY’ and with Windows end-of-line (i.e. CRLF):

    parquet2csv --null="" --date="%d.%m.%Y" --dos-eol some.parquet > some.csv
  2. To filter only particular records from ‘large.parquet’, for example we would like to read only latest records represented by field ‘invoice_id’:

    parquet2csv --filter="invoice_id>7654000" large.parquet > latest.csv
  3. To cut only particular columns from ‘large.parquet’, for example only column ‘invoice_id’:

    parquet2csv --match-fields -d 'invoice_id int null=""' large.parquet > latest.csv
  4. To read only after by some number of rows:

    parquet2csv --first-record=1234000 huge.parquet > latest.csv

    This could be quite useful when reading a huge Parquet file.