How can we improve Microsoft Azure Data Lake?

Support more wildcards in ADLA file sets

It would be nice to have more wildcards besides the asterisk in file sets. Suppose we've got two sets of files like eg

file01.tbl
file02.tbl

and

file0101.tbl
file0102.tbl
file0201.tbl
file0202.tbl

So it's impossible to select just one of the two sets since the syntax

@set1 = EXTRACT ..... FROM "/file{*}.tbl" USING .....;

matches all the files. The proposal is to allow another wildcard like eg ? to mean a single character, so we could do eg

@set1 = EXTRACT ..... FROM "/file{??}.tbl" USING .....;
@set2 = EXTRACT ..... FROM "/file{????}.tbl" USING .....;

Of course the actual syntax/wildcard does not have to be ?, the request is about having more flexibility to match subsets of file names.

9 votes
Sign in
(thinking…)
Sign in with: oidc
Signed in as (Sign out)

We’ll send you updates on this idea

Anonymous shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →

3 comments

Sign in
(thinking…)
Sign in with: oidc
Signed in as (Sign out)
Submitting...
  • Michel Caradec commented  ·   ·  Flag as inappropriate

    Here is a set of files :
    1aaa.txt
    2bbb.txt
    3ccc.txt
    4ddd.txt
    5eee.txt

    Their names start with a number from 1 to 9.
    I'd like to extract files not starting with 1, 2 or 3.

    A file set pattern based on regular expressions would make it possible :

    EXTRACT ... FROM "^[^1-3]{1}[^$]+$" USING ...;

    Thank you.

  • Michael Amadi commented  ·   ·  Flag as inappropriate

    Hi there, have you tried combining the placeholder with a 'LIKE' pattern match in the WHERE clause?

    Using your example, this would be something along these lines...

    @set1 =
    EXTRACT .....
    FROM "/file{number}.tbl"
    USING .....;

    @result =
    SELECT .....
    FROM @set1
    WHERE number LIKE '____';

    etc..

  • Anonymous commented  ·   ·  Flag as inappropriate

    It would be nice to have more wildcards besides the asterisk for ADLA file sets. Currently, if you have two distinct sets of files eg

    file01.tbl
    file02.tbl

    and

    file0101.tbl
    file0102.tbl
    file0201.tbl
    file0202.tbl

    you just can't say

    EXTRACT ... FROM "/file{*}.tbl" USING ...;

    because that matches all the files. So it would be nice to have eg a ? wildcard to match single characters, for example:

    @set1 = EXTRACT ..... FROM "/file{??}.tbl" USING .....;
    @set2 = EXTRACT ..... FROM "/file{????}.tbl" USING .....;

    The ? is just a suggestion of course, The request is about having more flexible ways to match subsets of files.

Feedback and Knowledge Base