Typical “Big” data architecture, that covers most components involved in the data pipeline.
This diagram and the corresponding blog entry (found at VenuBlog.com presents a straightforward architecture for a typical Big Data endevour.
Is it complete or perfect? Of course not! Would I change anything? Duh!
Until requirements have been communicated, collected and analyzed every which way from Sunday, anything we present is going to be an academic guess at best. What this picture gives us is a point from which we can begin a conversation.
The economics of flash memory are staggering. If you’re not using SSD, you are doing it wrong.
Not quite true, but close. Some small applications fit entirely in memory – less than 100GB – great for in-memory solutions. There’s a place for rotational drives (HDD) in massive streaming analytics and petabytes of data. But for the vast space between, flash has become the only sensible option.
If you have a dataset under 10TB, and you’re still using rotational drives, you’re doing it wrong. The new low cost of flash makes rotational drives useful only for the lightest of workloads.
These are very bold statements. I agree with them.
Switch your databases to Flash storage. Now. Or you’re doing it wrong.
Via: High Scalability
In a traditional RDBMS, the data is first written to the database, then to the memory. When the memory reaches a certain threshold, it’s written to the Logs. The Log files are used for recovering in case of server crash. In RDBMS before returning a success on an insert/update to the client, the data has to be validated against the predefined schema, indexes created and other things which makes it a bit slow compared to the NoSQL approach discussed below.
In case of a NoSQL database like HBase, the data is first written to the Log (WAL
), then to the memory. When the memory reaches a certain threshold, it’s written to the Database. Before returning a success for a put call, the data has to be just written to the Log file, there is no need for the data to be written to the Database and validated against the schema.
This blog post, very simply, explains not only a good part of the why behind NoSQL performance, but also exposes the ACID implications.
Unless queries are resolved through the log, the results of a given query cannot be consistent from run to run, but the trade-off is to performance.
RDBMS vs NoSQL Data Flow Architecture
Via: Hadoop Tips