Continuous Optimization for distributed BigData analysis Базы данных и системы хранения
A software engineer working at Treasure Data Inc. I am working on developing and maintaining distributed processing platform in our service. My speciality is distributed computing using open source software like
- Hadoop;
- Spark;
- Presto.
In addition I am a committer of Apache Hivemall which is a scalable machine learning library running on Hive/Spark.
The performance of distributed data analysis highly depends on the data structure such as file format.In practical we also need to consider the application workload. We developed a technology which can improve OLAP performance by optimizing storage structure continuously to fit application workload.
The performance of distributed data analysis highly depends on the data structure. The factors which can have impact on the performance are this.
- File format.
- Metadata statistics.
- Network bandwidth.
There are a lot of researches and technologies to make them optimized. For example, we have several de fact standard columnar file format such as ORC, Parquet. These are optimized for OLAP processing and their metadata enables us to skip partitions.
But we found the best storage structure is also decided by application workloads in practical use case. The best storage for a user is not always the best one to another user. It means we need to take use cases and application workloads into consideration to optimize storage system. What we need to do was
1. Collect accurate metrics of storage access (access rate of each partition, column and.
2. Decide the target table based on 1.
3. Then optimize user storage system continuously.
We call this process Continuous Optimization. In the process of Continuous Optimization, we reorganize storage structure to fit application workloads. Concretely it reorganizes partitions and metadata of the table by using application workload metrics. So we could reduce the storage access cost in average then improve query performance.
Since in many cases the storage file is huge size, we make use of our Hadoop infrastructure which also provides our cloud service. Therefore we can make sure scalability when we improve our storage system without any additional instance cost. At the same time we can achieve high cluster utilization by improving scheduling algorithm of Continuous Optimization job.
Last but not least, data consistency is also the most important factor in OLAP. While optimizing the storage, we make sure the consistency of dataset by using transactional metadata update. We developed RDBMS base distributed storage system to guarantee no data lost, no data duplication even in the case of distributed update semantics. We talked about this storage system at the part of previous talk. See from the page 24 of the slide
So the key points of Continuous Optimization are:
- analysis of application workload;
- scalability and High utility cluster;
- distributed update semantics.
I’ll talk about the architecture and implementation to achieve these goals.