Ability to profile data in data lake
One of the key advantage for leveraging the data lake is exploration capability; having data profile handy would enhance the exploration experience for the user. Moreover, the “data profile” can be integrated with data catalog to further enhance the experience.
This is about the statistics of the data such as minimum, max, avg, data type, length, discrete values, uniqueness, occurrence of null values, typical string patterns etc. This is to help the user understand if the data is appropriate or are there any anomalies? The reason for having the profiling capabilities in data lake is to offload data profiling capabilities to data lake. This in turn, reduces the load on the existing dw infrastructure (if data lake is used to complement existing dw) and at the same time reducing overall cost.
Shannon Lowder commented
Even if we need to add jobs through data factory to keep these statistics up to date, refreshing this to data catalog (and users or applications using the ADC information) would be a killer feature.