Massive data analysis in cloud-scale datacenters plays a crucial role in making critical business decisions andimproving quality of service. High-level scripting languages free users fromunderstanding various system trade-offs and complexities, support a transparentabstraction of the underlying system, and provide the system greatopportunities and challenges for query optimization. Data shuffling is the mostexpensive operation in such environment. Its judicious placement andimplementation techniques play a vital role in the effectiveness and efficiencyof cloud-scale query execution. We describe several advanced partitioningtechniques to significantly improve data shuffling efficiency and integrate suchcomplex reasoning into the query optimizer to generate much more efficientquery plans. The system intelligently exploits the input data properties and performs partial partitioning by moving a only small subset of the input dataset whenever possible. A novel index-based partitioning strategy is also usedfor the system to efficiently support a massive data partitioning operationthat generates thousands of partitions. The techniques are incorporated in Scope,running over data clusters of tens of thousands of machines, and have proven tobe effective, greatly improving query performance for a wide range ofreal-world jobs.
错了,是翻译成中文=。=都翻译糊涂了