This is an all to common, and unfortunate, occurrence. So long as Excel dominates, this is the reality with which we will have to live. Ultimately, it’s the user needing to do advanced or in depth Analytics that loses.
The basic gist of this article is that the exercise of data modeling is just as important when using the big data and NoSQL technologies as it is when using the more traditional relational algebra based technologies.
This conclusion came after a series of experiments were performed pitting Cloudera’s Hadoop distribution against an unidentified ‘major relational database’. A suite of 5 business questions were distilled into either SQL for the relational database, or HQL for execution against Hadoop stacked with Hive. For each of the queries, for each data store, 5 experimental scenarios were explored:
- Flat Schema vs. Star Schema
- Using compressed data vs. uncompressed in Hadoop
- Indexing appropriate columns
- Partitioning the data by date
- Compare Hive/HQL/MapReduce to Cloudera Impala
Details of the experiment and intermediate results can be found in the article, but at a macro level, the results were mixed with the exception of it being clear that a flat un-modeled schema was not a scenario one should use and expect performance. As the article points out, the question is not whether one should model or not, but rather how and when.
Tamara Dull, Director of Emerging Technologies, SAS Best Practices wrote the following at SAS’ “The Data Roundtable” blog:
It was refreshing to see that the RDBMS skills some of us have developed over the years still apply with these new big data technologies. And while discussions of late binding (i.e., applying structure to the data at query time, not load time) work their way down our corporate hallways, we are reminded once again that “it depends” is a far more honest and accurate answer than it’s been given credit for in the past.
To model or not to model is no longer the question. How and when to model is.
To recap, version 1 of Hadoop is made up of two basic components; the foundation is a fault-resilient distributed file system called the Hadoop Distributed File System (HDFS), upon which a framework for the parallel processing of that distributed data, called MapReduce, is built. In this post we’ll start to look into the components, architecture and key configuration options of version 2 of the Hadoop Distributed File System (HDFS-V2). This will set us up to learn about the makeup and workings of MapReduce version 2 (MRV2) and the new resource management system, YARN (Yet Another Resource Navigator).
The Hadoop development community has rallied and worked diligently to make version 2 of HDFS a much more scalable, efficient and enterprise-friendly storage platform. This has been accomplished by focusing on and addressing the key functionality areas of:
- High Availability
- NameNode Performance and Scalability
- Backup and Recoverability
Almost all of the significant changes between HDFS V1 and V2 have to do with the NameNode. With that in mind, this post will take a dive into the functionality of the NameNode and a look into the data on which it operates.
HDFS NameNode Availability
It is no secret that the high availability Achilles heel of HDFS is the single point of failure expressed by the NameNode. If the NameNode goes down, the entire associated Hadoop cluster is rendered unavailable. Further, a worse case is realized with the unrecoverable loss of the NameNode server. This would result in the loss of all of the metadata necessary to rebuild access to the data stored on the cluster resulting in potentially significant data loss. Because of the potential for data loss, a number of mechanisms exist to mitigate this potential.
NameNode MetaData and Operation
Before discussing the mechanisms available enhance the NameNode operation, we need to have an understanding of what it is the NameNode does, how it does it and the data on which it does it’s do.
NameNode Directory Structure
A newly initialized NameNode creates the following directory structure for persistent storage of filesystem state and metadata:
$(dfs.name.dir) |— current/ |— VERSION |— edits |— fsimage |— fstime
Note: $(dfs.name.dir) is a configuration variable used to specify a list of directories to be used for persistent storage. It is typical to configure multiple directories to be used for the persistent storage of the NameNode so that in the case of a hardware failure, a copy of the persistent data is saved and available to be used for recovery. A common configuration is to locate a directory on two independent local disks and one remote directory via a network share (for a total of three copies of the NameNode persistent data).
The VERSION file is a java properties file which contains various information about the version of Hadoop and HDFS being used. Contents of a typical file might look as follows:
#Tue Feb 25 08:35:40 GMT 2014 namespaceID=142527538 cTime=0 storageType=NAME_NODE layoutVersion=-22
The namespaceID is a unique identifier for the filesystem. It is created when the filesystem is initialized. The namenode uses this to identify new DataNodes. New DataNodes will not know the namespaceID of the filesystem until they have registered with the NameNode.
The creation time of the NameNode’s storage directory is held in the cTime property. Newly initialized storage always has the value, zero. It is updated to a timestamp whenever the filesystem is upgraded.
This storageType (NAME_NODE) indicates this directory contains data for a NameNode.
The layoutVersion is a negative number that identifies the version of HDFS’s data structures. This version number has no relation to the release number of Hadoop. When file layouts change, the version number is decremented (for example, the version after −22 is −23) and HDFS is required to be upgraded. NameNodes and DataNodes will not operate without there being matching and correct layoutVersion numbers throughout the system.
fsimage and edits Files
The fsimage and edits files are binary files that use Hadoop
Writable objects as their serialization format for persistent storage.
As the NameNode processes client requests to create, delete and move directories and files (essentially, write operations) it maintains the state and associated metadata of the directories and files in an in-memory version of the fsimage file (presumably a mnemonic for File System Image). The in-memory version of fsimage is not written to persistent storage when a change is made to it as it has the potential to be very large (hundereds of megabytes to gigabytes) and would take significant time to write.
The fsimage file contains a list of all the directory and file inodes in the filesystem. The data maintained for each inode is such information as the file’s replication level, modification and access times, access permissions, block size, and the blocks a file is made up of. For directories, data about modification time, permissions, and quota is stored.
The fsimage file does not record information about the datanodes on which the blocks are stored. Instead, the namenode keeps this mapping in memory only. This information is constructed by asking the datanodes for their block lists when they join the cluster and periodically afterward to ensure the namenode’s block mapping is up-to-date.
Prior to making any change to the in-memory version of fsimage, a write-forward record of the changes to be made are written to the edits file, essentially as a form of transaction logging. If the NameNode has been configured to maintain multiple persistent store directories, the writes to the edits files must be flushed and synced to all configured copies before returning a notice of success. This is done to ensure that no data is lost due to a machine failure. The configuration option of specifying multiple persistent store directories for the NameNode is one of the mechanisms available to mitigate the potential of data loss due to a catastrophic NameNode failure.
The fstime file is also a binary file making use of the Hadoop
Writable object for serialization for persistent storage. The primary use of the fstime file is to record the time of the last checkpoint operation of the fsimage and edits files.
In the most basic of configurations, a checkpoint operation of the fsimage and edits files occurs at NameNode startup as part of its ‘safe mode’ sequence of operations. The NameNode self-checkpoint process is essentially the following sequence of events:
- Freeze the file system for any write operations- read-only mode
- Load fsimage from persistent storage into memory
- Apply Edit Log items to fsimage
- Write updated, checkpointed fsimage to persistent storage (and replicate to multiple locations if so configured)
- Purge edits file (and replicate to multiple locations if so configured)
- Write timestamp of the checkpoint operation to the fstime file
Before the NameNode may exit ‘safe mode’ it must receive information about the DataNode block mapping a from a minimum number of Datanodes.
The downside to this method of operation is the that it can take a significant amount of time to roll the Edit Log forward and checkpoint the fsimage file. Depending on the amount of file system activity and the length of time since the last checkpoint, it is not unheard of for the checkpoint process to take several hours. An obvious way to reduce the amount of time for NameNode checkpoint to run is to execute a checkpoint of the fsimage file on a periodic basis. This is exactly the purpose and operation of the Secondary name node.
Secondary NameNode Checkpointing
As mentioned, the Secondary NameNode facilitates the online checkpointing of the fsimage maintained by the Primary NameNode. It does not act as a second or a backup NameNode despite its name. Periodically (the default is once per hour, or whenever the size of the edits file exceeds 64MB), the Secondary NameNode will send a message to the Primary NameNode to initiate a checkpoint. This in turn sets off the following sequence of events:
- Primary NameNode: Freeze writing to the current Edit Log
- Primary NameNode: Initialize a new Edit Log
- Secondary NameNode: HTTP GET Frozen Edit Log
- Secondary NameNode: HTTP GET fsimage
- Secondary NameNode: Load fsimage into memory
- Secondary NameNode: Apply Edit Log items to fsimage
- Secondary NameNode: Write fsimage to local persistent storage
- Secondary NameNode: HTTP PUT checkpointed fsimage file
- Primary NameNode: Roll Edit Log forward
- Primary NameNode: Roll checkpointed fsimage file forward
- Primary NameNode: Write checkpoint timestamp to fstime file
- High Availability for the HDFS NameNode
- By using Quorum Journal Manager (QJM)
- By using Network File System (NFS)
- Namespace Federation for Performance and Scaling
- HDFS Snapshots to support Backup and Recoverability
Hadoop Distributed File System: Version 2 – Part I by Mike Pluta is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.