FileSets

We've seen that you can explicitly list all the files in the EXTRACT statement. In some cases, there might be a large number of files, so you may not want to list all the files manually every time.

FileSets make it easy to define a pattern to identify a set of files to read.

In the simplest case let's get all the files in a folder.

@rs =
    EXTRACT 
        user   string,
        id     string,
    FROM 
        "/input/{*}"
    USING Extractors.Csv();

Specifying a list of files that have an extension

This is very simple modification syntax. The example uses will extract all the files that end with ".csv".

@rs =
    EXTRACT 
        user   string,
        id     string,
    FROM 
        "/input/{*}.csv"
    USING Extractors.Csv();

Getting filenames as a column in the RowSet

Because we are reading rows from multiple files. it is convenient to for the rows to have some information about the filename it came from. We can adjust the query slightly to make this possible.

@rs =
    EXTRACT 
        user       string,
        id         string,
        __filename string
    FROM 
        "/input/{__filename}"
    USING Extractors.Csv();

You are probably wondering about the __ in the column __filename. It isn't necessary at all, however it is useful as a way of marking that this information came from the process of extracting the file, not from the data in the file itself.

To emphasize that the naming of __filename and the use of the __ prefix was completely arbitrary below is the same script with a different name (foo).

@rs =
    EXTRACT 
        user string,
        id   string,
        foo  string
    FROM 
        "/input/{foo}"
    USING Extractors.Csv();

Getting parts of a filename as a column in the RowSet

Instead of the full filename, we can also get part of the filename. The sample below shows how to get just the number part.

@rs =
    EXTRACT 
        user       string,
        id         string,
        __filenum  int
    FROM 
        "/input/data{__filenum}.csv"
    USING Extractors.Csv();

Notes

  • The schemas for all the files in the FileSet must match the schema specified in the extract.
  • The more files there are in the FileSet the longer the compilation time will take.

results matching ""

    No results matching ""