Treasure Data’s Plazma: Columnar Cloud Storage
Last updated June 21, 2013Treasure Data has been developed by Hadoop experts. We get Hadoop, and, in many ways, it’s part of our core. As we have built out the platform, we noticed that the storage layer needs to be multi-tenant, elastic, and easy to manage while keeping the scalability and efficiency. This led us to create Plazma, our own distributed columnar storage system in place of HDFS. We wanted to leverage the “store everything now, analyze later” model of our schema-less architecture and provide better performance in terms of storage and query processing.
By separating the MapReduce processing engine of Hadoop and the storage layer, we would be able to optimize the elasticity, efficiency, and reliability of the system. Making our system more modular also allowed us to use columnar storage for our data and allow queries to only parse through the relevant records instead of reading the whole dataset. Plazma led us to process the queries faster, manage databases more simply, and make better use of our schemaless database architecture.
We achieved our technical goals by architecting Plazma in the following ways:
- JSON processing: automatically converts row-based JSON objects into a columnar format
- Columnar storage: uses a columnar file storage format which significantly reduces disk IO for analytical queries
- IO optimizations: implements various IO optimizations such as parallel pre-fetch and background decompression
- Scalability and ease management: Plazma is built on top of object-based storage, which is more easier to scale and maintain
These are some of the key innovations we made with Plazma to optimize query processing and storage and provide us with a more efficient distributed storage system solution. Some companies make the argument that leveraging HDFS allows for their business to take advantage of open source innovation, which is preferable to on-premise solutions. However, for our purposes, Plazma is much more efficient in terms of query processing and allows us to separate the processing and storage layers for optimizing query processing and manageability.
While this technology is currently proprietary to Treasure Data, we have discussed open sourcing it to provide developers with the tools they need for efficient distributed storage systems meant for big data analytics processing.
What do you think? Would you find this kind of technology useful and would you be interested in using it? Leave your thoughts in the comments.