On February 7, 2019, Azure Data Lake Storage (ADLS) Gen2 became generally available, and it has continued to evolve and mature since then. This article will explain the benefits and what you need to know to get started. Please also read our eBook, Data Lakes in a Modern Data Architecture, if you want to learn more about the ideas of a data lake.
1. The data lake story in Azure is unified with the introduction of ADLS Gen2
When we needed cloud storage in Azure for a data lake deployment before ADLS Gen2, we had to choose between Azure Data Lake Storage Gen1 (previously known as Azure Data Lake Store) and Azure Storage (specifically blob storage). In order to decide which service to utilise, we had to assess the business and technical requirements against the options provided. While ADLS Gen1 provides important optimizations for analytic workloads as well as more granular security (see section 3 for more information), Azure Storage has built-in features such as geo-redundancy, hot/cold/archive tiers, additional metadata, and broader regional availability that are very appealing. In the past, in some cases, we either accepted some trade-offs or stored the data twice.
Azure Storage serves as the foundation for the new ADLS Gen2 service. An otherwise ordinary, general purpose V2 storage account becomes ADLS Gen2 when the hierarchical namespace (HNS) property is enabled (see section 2 for details). As a result, ADLS Gen2 will not be identified in Azure as a separate service — many people have been confused by this change because ADLS Gen1 is a separate service. You may check if ADLS Gen2 is enabled for a storage account in a couple of ways:
When viewing the Azure Storage account, if the file system service is displayed this indicates that ADLS Gen2 is supported:
If the hierarchical namespace (HNS) is configured in the Azure Storage account configuration parameters, this implies that ADLS Gen2 is supported:
Key Conclusion: We won’t have to choose between numerous independent services when we need a data lake in Azure for an analytics project. Using the hierarchical namespace enabled, Azure Storage is now the service of choice for constructing a data lake with Azure cloud storage.
2. ADLS Gen2 converges the worlds of object storage and hierarchical file storage
Fundamentally, ADLS Gen2 aims to take advantage of file system benefits while maintaining the scalability and cost-effectiveness of an object store:
As indicated in Section 4, full feature support for ADLS Gen2 is currently evolving. The longer-term vision is depicted in the diagram below:
Azure Data Lake Storage:
Fig: Azure Data Lake Storage
The dark blue shading represents new features introduced with ADLS Gen2.
The three new areas depicted above include:
(1) File Management System. With ADLS Gen2, there is a variation in terminology. In ADLS Gen2, the concept of a container (derived from blob storage) is referred to as a file system.
(2) Namespace has a hierarchical structure. The performance and security enhancements outlined in Section 3 are enabled by the hierarchical namespace (HNS) in conjunction with the DFS endpoint.
(3) File System Driver and DFS Endpoint The ABFS driver, which is part of Apache Hadoop, is used by ADLS Gen2. The ABFS driver uses the DFS endpoint to activate efficiency and security optimizations when connecting to ADLS Gen2.
- ABFS = Azure Blob File System
- DFS = Distributed File System
Documentation for each:
Key Conclusion: The longer-term vision (shown above), which includes full compatibility between the object store and file system models, will allow us to store data once and access it in a variety of ways depending on the use case. Multi-protocol access is the term for this.
Azure Data Lake Storage Gen2 overview | Azure Friday
3. ADLS Gen2 has significant performance and security advantages for analytical workloads
HDFS is compatible with both the object store paradigm (such as Azure blob storage) and the hierarchical file system model (ADLS Gen1 and Gen2) (Hadoop Distributed File System). This is accomplished by drivers that transform server-side HDFS semantics into remote storage APIs, making ADLS Gen2 behave very similarly to native HDFS. In terms of performance and security, however, there are significant differences between object storage and hierarchical file system storage.
Folders in object storage are only imaginary. Despite the fact that we appear to be able to build folders in object storage, they are simply imitated within the URI string (or sometimes metadata is used as an alternative). Although that might initially seem trivial, it has the following implications:
- Query Execution. With a hierarchical file system like ADLS Gen2, it is possible to use partition scans for data trimming when sending a query that only retrieves a subset of data (predicate pushdown). For compute engines that understand how to use partition scans, this can drastically enhance query performance.
Data Loading Speed
It is occasionally essential to rename or move files from one directory to another.
Directory operations are not performed as effectively with the object store driver. Relocating 10,000 files from the Temp directory indicated below to their permanent location would need 10,000 rename operations and 10,000 delete actions, resulting in 20,000 calls.
When connecting through the DFS endpoint with a file system like ADLS Gen2, however, this is a metadata-only process. This leads in dramatically enhanced data load performance, especially at higher data quantities.
In addition to boosting query performance, metadata-only operations are more cost-effective in the long run since they use fewer compute engine resources.
(3) Atomic Operations for Data Consistency The object store driver does not provide atomic operations, as seen in the preceding example of 10,000 files to be relocated. In the event of a failure, the data may stay in an inconsistent state. A file system like ADLS Gen2, on the other hand, supports atomic operations via the DFS endpoint, which improves data consistency by ensuring that the entire action succeeds or fails as a unit.
(4) Directory and file-level granular security. The ADLS Gen2 (and Gen1) hierarchical file system is POSIX-compliant. To create granular security, access control lists (ACLs) can be defined at the directory and file level, providing much-needed flexibility for controlling data-level security.
Key Conclusion: The use of the ABFS driver for connectivity, in combination with enabling the hierarchical namespace for an Azure Storage account, allows for file system optimizations that effect performance, data consistency, and security.
4. Feature support in ADLS Gen2 is still evolving
Despite the fact that ADLS Gen2 has been designated as generally accessible, there are still a number of planned additions that will be implemented over time. Microsoft, like most technological companies, rushes features to market and then iterates until they’re ready. Supporting modern data warehouse and advanced analytics scenarios is the first priority for ADLS Gen2.
- The preview for multi-protocol data access was released in July 2019. This allows a number of previously unsupported features to become available. The initial preview is a ‘whitelist,’ in which clients must request access, followed by an open public preview, and then full release, as is normal. It’s also common practise to start with a small number of Azure regions and gradually expand.
Please verify current feature support utilizing these two resources:
Key Conclusion: Data access through many protocols (as shown in the figure in section 2) is a vital skill that is still evolving. When it arrives, the data can be landed using whichever endpoint is selected (i.e., to support an unaltered or legacy application or service) and analytical processing can be done on the new endpoint to achieve performance advantages.
5. ADLS Gen2 is the underlying storage for Power BI Dataflows
Dataflows in Power BI are a new functionality aimed at reusable, self-service data preparation. The ADLS Gen2 receives the results of queries made in the web-based Power Query Online. The goal is to handle the queries and data preparation once, and then have many Power BI datasets consume it.
The ADLS Gen2 account is present but only visible via the Power BI dataflows user interface if dataflows are completely controlled by Power BI. The ‘bring your own storage’ option (seen below) is suitable for businesses who want to interact with data in the data lake using tools and computation engines other than Power BI:
Key Conclusion: Power BI dataflows use the ADLS Gen2 storage service, which can be an essential aspect of a self-service business intelligence strategy.
6. There are two levels of security in ADLS Gen2
The two levels of security applicable to ADLS Gen2 were also in effect for ADLS Gen1. Even though this is not new, it is worth calling out the two levels of security because it’s a very fundamental piece to getting started with the data lake and it is confusing for many people just getting started.
(1) Access Control by Role (RBAC). Built-in Azure roles such as reader, contributor, owner, and custom roles are included in RBAC. RBAC is typically assigned for two reasons. One is to identify who will be in charge of the service (i.e., update settings and properties for the storage account). Another reason is that the built-in data explorer tools, which require reader access, can be used.
(2) Lists of Access Control (ACLs). Access control lists define which data items a user is allowed to read, write, or execute (execute is required to browse the directory structure). ACLs are POSIX-compliant, so people with a Unix or Linux experience will be familiar with them.
Because POSIX does not use a security inheritance model, access ACLs must be provided for each item. Default ACLs are important for new files in a directory to get the right security settings, but they should not be considered inheritance. Because assigning ACLs to each object is time consuming and there is a maximum of 32 ACLs per object, managing data-level security in ADLS Gen1 or Gen2 with Azure Active Directory groups is critical.
Fortunately, regardless of whatever multi-protocol access point is used to access the data, the ACLs for both directories and files are enforced.
Key Conclusion: There is a lot of flexibility for designing security for ADLS Gen2 with RBAC and ACLs.
7. Planning for ADLS Gen2 involves multiple levels
When planning for a data lake, there are a lot of things to think about, especially if you have a lot of varied data import patterns, different data usage patterns, different sorts of users, and different tools/languages. Some companies want to create a single global data lake, while others choose to use multiple lakes.
With the introduction of ADLS Gen2, there is a new level to consider: the file system, which was not included in ADLS Gen1. In ADLS Gen2, a file system is the blob service’s counterpart of a container. The following levels must be considered during the planning process:
- Account
- File system(s) within an account
- Directory structure within a file system
- Account-level features include region and geo-replication. Numerous storage accounts will be required if there are multiple data residency requirements and/or distinct geo-replication requirements. If you have specialised compute engines (such as HDInsight or Azure Databricks) that are located in a certain region, the greatest performance will be achieved if the ADLS Gen2 account is also located in that region.
- At the account level, the hierarchical namespace is enabled. If there are use situations when the benefits of the hierarchical namespace aren’t required, the data should be stored in a separate storage account.
- For blob storage, immutable policies and shared access policies are established at the container level (so we can expect them to apply at the file system level for an ADLS Gen2-enabled account). Separate file systems may be necessary if various policies are desired.
- In ADLS Gen1, the root for ACLs was at the account level, however in ADLS Gen2, the root is at the file system level.
- The integration of Power BI dataflows with the Common Data Model, as explained in section 5, will necessitate one or more file systems.
Key Conclusion: You may need to segregate data outside one data lake due to use cases, access restrictions, or economic reasons (see section 8). When planning, keep in mind that the file system is a new level with its own set of attributes.
8. Pricing for ADLS Gen2 is almost as economical as object storage
The cost of object storage, such as Azure blob storage, is well known. Microsoft has announced ADLS Gen2 at the same price as Azure blob storage in terms of direct storage costs (i.e., block blob pricing). There is no concept of reserving a set size; you just pay for the storage that you utilise.
However, for storage accounts with the hierarchical namespace enabled, transaction costs are slightly greater. The cost of a transaction is commonly calculated in batches of 10,000.
Please refer to the official documentation and the online pricing calculator for more complete pricing details. The FAQs section for ADLS Gen2 pricing has an excellent practical example which contrasts pricing for the flat namespace (i.e., block blob storage) and the hierarchical namespace (i.e., ADLS Gen2).
Key Conclusion: When the hierarchical namespace is enabled for a storage account, the transaction and metadata storage costs are greater, while the storage costs remain the same. Even though transaction costs are still very low, workloads that will never use the hierarchical namespace (HNS) capabilities should be stored in a storage account that does not have HNS enabled.
9. Azure Data Lake Analytics and U-SQL have an uncertain future
The initial Azure services supported by ADLS Gen2 via the ABFS driver include:
- Azure Databricks
- Azure HDInsight
- Azure Data Factory
- Azure SQL Data Warehouse (PolyBase)
When you consider that U-SQL within Azure Data Lake Analytics (ADLA) isn’t one of the first services to be supported by the improved ABFS driver, it’s clear where we should be betting. Although Microsoft hasn’t revealed its plans for ADLA, we’ve noticed that open source technologies like Spark appeal to a broader client base than proprietary tools and languages.
We would advise any customers considering using ADLA on future projects to proceed with caution.
Key Conclusion: There is currently no serverless (pay-per-use) solution to run queries against ADLS Gen2. Direct querying capabilities are currently provided by Azure Databricks and HDInsight.
10. ADLS Gen1 will be supported for quite some time
All indications are that the ADLS Gen1 will not be phased out very soon. There is no need to be concerned if you have a large ADLS Gen1 implementation.
There are various upgrade options if you want to move from ADLS Gen 1 to ADLS Gen 2. A few significant considerations are as follows:
- Because there is presently no migration tool available, migrating data via Azure Data Factory is the simplest approach to execute a one-time data migration.
- If you have any files in ADLS Gen1 that are larger than 5TB, you’ll need to split them up before migrating.
- Any references to the adl:/ addressing scheme must be updated to use abfs[s]:/ connectivity, the new REST APIs, and/or the new SDKs instead.
Key Conclusion: It is not necessary to migrate from ADLS Gen1, although you should do so if it is practical. If there are no feature gaps, new implementations should use ADLS Gen2.
If you’re exploring the best Azure solutions for your firm’s needs, Addend Analytics would love to help. Contact us today for more information.