Hadoop and S3: 6 Tips for Top Performance

mortardata:

image

Netflix kicked off the first session at this summer’s Hadoop Summit, telling the crowd about their Hadoop stack that powers its world-renowned data science practice. The punchline: they run everything on the Amazon Web Services cloud—Amazon S3, Elastic MapReduce (EMR), and their platform-as-a-service, Genie.

Putting S3 at the base of your Hadoop strategy, as Netflix and Mortar have, catapults you past many of the Hadoop headaches others will face.  No running out of storage unexpectedly: you get (essentially) infinite, low cost storage from S3, with frequent price cuts. No need to worry about your data: Amazon estimates they might lose one of your objects every 10 million years or so.  And best of all, no waiting in line behind other people’s slow jobs: spin up your own personal cluster whenever you want and point it at the same underlying S3 files.

A lot of these benefits come directly from S3.  It’s a pretty magical technology, and we use it extensively at Mortar.  There are some tricks we’ve learned to get the best performance out it in conjunction with Hadoop. I’m going to share those with you now; some can improve your performance 10X or more.

Read More