How can we improve Microsoft Azure Data Lake?

Support more wildcards in ADLA file sets

It would be nice to have more wildcards besides the asterisk in file sets. Suppose we've got two sets of files like eg

file01.tbl
file02.tbl

and

file0101.tbl
file0102.tbl
file0201.tbl
file0202.tbl

So it's impossible to select just one of the two sets since the syntax

@set1 = EXTRACT ..... FROM "/file{*}.tbl" USING .....;

matches all the files. The proposal is to allow another wildcard like eg ? to mean a single character, so we could do eg

@set1 = EXTRACT ..... FROM "/file{??}.tbl" USING .....;
@set2 = EXTRACT ..... FROM "/file{????}.tbl" USING .....;

Of course the actual syntax/wildcard does not have to be ?, the request is about having more flexibility to match subsets of file names.

8 votes
Sign in
Check!
(thinking…)
Reset
or sign in with
  • facebook
  • google
    Password icon
    Signed in as (Sign out)

    We’ll send you updates on this idea

    Anonymous shared this idea  ·   ·  Flag idea as inappropriate…  ·  Admin →

    3 comments

    Sign in
    Check!
    (thinking…)
    Reset
    or sign in with
    • facebook
    • google
      Password icon
      Signed in as (Sign out)
      Submitting...
      • Michel Caradec commented  ·   ·  Flag as inappropriate

        Here is a set of files :
        1aaa.txt
        2bbb.txt
        3ccc.txt
        4ddd.txt
        5eee.txt

        Their names start with a number from 1 to 9.
        I'd like to extract files not starting with 1, 2 or 3.

        A file set pattern based on regular expressions would make it possible :

        EXTRACT ... FROM "^[^1-3]{1}[^$]+$" USING ...;

        Thank you.

      • Michael Amadi commented  ·   ·  Flag as inappropriate

        Hi there, have you tried combining the placeholder with a 'LIKE' pattern match in the WHERE clause?

        Using your example, this would be something along these lines...

        @set1 =
        EXTRACT .....
        FROM "/file{number}.tbl"
        USING .....;

        @result =
        SELECT .....
        FROM @set1
        WHERE number LIKE '____';

        etc..

      • Anonymous commented  ·   ·  Flag as inappropriate

        It would be nice to have more wildcards besides the asterisk for ADLA file sets. Currently, if you have two distinct sets of files eg

        file01.tbl
        file02.tbl

        and

        file0101.tbl
        file0102.tbl
        file0201.tbl
        file0202.tbl

        you just can't say

        EXTRACT ... FROM "/file{*}.tbl" USING ...;

        because that matches all the files. So it would be nice to have eg a ? wildcard to match single characters, for example:

        @set1 = EXTRACT ..... FROM "/file{??}.tbl" USING .....;
        @set2 = EXTRACT ..... FROM "/file{????}.tbl" USING .....;

        The ? is just a suggestion of course, The request is about having more flexible ways to match subsets of files.

      Feedback and Knowledge Base