Yan Chen (yanchen@cs.berkeley.edu)
March 1, 2000
ABSTRACT
Bakeoff for a given workload required us
to generate a better access method (index structure) than R*-tree with
the help of libGiST and amdb. The characteristics of the dataset
and queries were initially unknown. In this report we discuss how we analyzed
this workload and the design process we used to produce a result that is
75% faster in number of I/O's at the leaf level than the R*-tree implementation.
In fact, our access method only requires 43% more I/O than the theoretical
minimum. We then suggest some ways to even further reduce the rest overhead.
Finally, we demonstrate that if precision is flexible our performance to
a drastic departure from R*-tree and showed and outstanding improvement
that is160% faster in the number of I/O's at the leaf than the R*-tree.
Also, our implementation requires a low 29% more I/O than the theoretical
minimum. The access method analysis and visualization features provided
by GIST and AMDB were very valuable to our work.
DATASET AND WORKLOAD ANALYSIS
|
The workload we were given consists of 85,000 ordered insertions of data items with 5-dimensional keys, and 2,125 200-nearest neighbor queries on the data set. To optimize towards this workload, our first task was to understand it. R*-tree gave the following results under AMDB workload analysis:
|
|
|
(b) The leave I/O statistics of the R*-tree |
Obviously, the rectangles have bad overlap with in the two dimensions that AMDB displays. Also from the above figures and numbers you will notice that there is a large amount of clustering loss in the AMDB analysis. So our task is to better clusterize these data points and minimize the overlap between the rectangles.
Spreadsheet analysis gave us some more information:
Dimension
|
Max Interval
|
Mean
|
Stdev
|
66% interval
|
95% interval
|
X-d
|
[0.367, 1.029]
|
0.997478
|
0.03632
|
[0.961,
1.033]
|
[0.925,
1.070]
|
Y-d
|
[-0.379, 1.029]
|
-0.13042
|
0.091197
|
[-0.221,
-0.039]
|
[-0.312,
0.052]
|
Z-d
|
[-0.267, 0.411]
|
0.193732
|
0.094833
|
[0.099,
-0.289]
|
[0.004,
0.383]
|
K-d
|
[-0.413, 0.400]
|
-0.15699
|
0.129915
|
[-0.287,
-0.027]
|
[-0.416,
0.103]
|
M-d
|
[-0.315, 0.338]
|
-0.11218
|
0.162924
|
[-0.275,
0.051]
|
[-0.438,
0.214]
|
Dimension
|
Max Interval
|
Mean
|
Stdev
|
66% interval
|
95% interval
|
X-d
|
[0.576, 1.029]
|
0.999038
|
0.03322
|
[0.966, 1.032]
|
[0.933, 1.065]
|
Y-d
|
[-0.372, 0.251]
|
-0.13414
|
0.09346
|
[-0.227, -0.041]
|
[-0.321, 0.052]
|
Z-d
|
[-0.200, 0.405]
|
0.19515
|
0.095926
|
[0.099, 0.291]
|
[0.003, 0.387]
|
K-d
|
[-0.408, 0.303]
|
-0.15884
|
0.121521
|
[-0.280, -0.037]
|
[-0.402, 0.084]
|
M-d
|
[-0.315, 0.329]
|
-0.10929
|
0.162467
|
[-0.272, 0.053]
|
[-0.434, 0.216]
|
To summarize, we found that the workload consists of 85,000 data points, grouped in a tight range in each dimension with the exception of a couple of outliers (which are responsible for ballooning of some bounding predicates). In fact in the 95% interval of most of the dataset and query set where in a tight bounds. Finally, we would have to use another parameter besides for distance.
Based on the above analysis we continued
to modify our R*-tree implementation.
APPROACHES AND RESULTS
1. Are they clustered? - Penalty()
Our first glance at the spreadsheet analysis and the 2d projection of amdb view gave us the impression that the data are highly clustered. Note they have very small standard variations in X, Y and Z dimensions. So we tried to clusterize them.
We wrote a Penalty() method which tried to reduce the minimum distance
of the point to the rectangle instead of minimizing the area enlargement
or overlap. It turned out to causes the rectangle to get larger and
larger, and eventually the bad results as the following:
|
Our goal was pretty difficult, minimize the bounding predicate overlap
at the leaf level, knowing that the dataset /workload is tightly bounded.
In order to achieve our goal we had to take into account our workload analysis
and our above failures (distance from the new point to the center of
a bounding predicate would not work). As a result, we decided to focus
on the penalty function and minimize the penalty of an insertion.
It follows that...
In order to minimize the penalty of insertion, we had to change the R*-Tree penalty implementation from only considering local consequences of an insert to considering global consequences! Specifically, the original R*-Tree penalty implementation only considered the internal nodes whose area would increase the least, it did not consider the overlap and area increase of the actually leaf that the new point would be inserted into. The above observation lead us directly into making modification to the R*-tree penalty implementation. Hence when the page.level() >2, we would only pick a tentative page for the new point, if and only if, the area enlargement and overlap enlargement was smaller than the already minimum area and overlap enlargement.
The above change to the penalty function actually allowed the I/O's at the leave to almost reduce in half. From a total of 80,777 I/O's to approximately 49,000I/O's at the leaf. We were well on our way of reaching the optimal I/O of 26,130.
The reason why the above change worked so well is because we did not restrict our self's to the internal node whose area would be increased the least (as in the original R*-tree), instead we considered the actual leaf node whose area and overlap would be increased the least. Once we noticed that considering all the leaves had such a positive affect, we wanted to consider a combination of the leaves and internal nodes. As a result, we did the above at the leaf and internal node level. Therefore, we would only insert a new point into the traversal of nodes from root to leaf whose set of area's and overlaps would increase the least. As a result the performance of our index went further down to approximately 46,000 I/O's even closer to the optimal of 26,130.
Penalty(penaltyNode, &overlapReturned,&areaReturned)
{
if (current_level == 2 )// children are leaf nodes
{
// return the minimum area
and overlap enlargement of the page the new point might be inserted into;
return overlapReturned
and areaReturned;
}
minOverlapEnlargementLowerLevel = MAXDOUBLE;
minAreaEnlargementLowerLevel = MAXDOUBLE;
minOverlapEnlargementCurrentLevel = MAXDOUBLE;
minAreaEnlargementCurrentLevel = MAXDOUBLE;
for each child N of penaltyNode
{
areaEnlargementCurrentLevel
= computeAreaEnlargement(N, newPoint);
overlapEnlargementCurrentLevel
= computeOverlapEnlargement(N,newPoint);
P = Penalty(penaltyNodes.child, overlapReturned, areaReturned);
if (overlapReturned
<= minOverlapEnlargementLowerLevel &&
areaReturned <= minAreaEnlargementLowerLevel &&
overlapEnlargementCurrentLevel <= minOverlapEnlargementCurrentLevel
&&
areaEnlargementCurrentLevel <= minAreaEnlargementCurrentLevel)
{
minOverlapEnlargementLowerLevel = overlapReturned;
minAreaEnlargementLowerLevel = areaReturned;
minOverlapEnlargementCurrentLevel = minOverlapEnlargementCurrentLevel;
minAreaEnlargementCurrentLevel = minAreaEnlargementCurrentLevel;
}
}
return N;
}
PickSplit()
1. Minimum Split Utilization Constraint
We realized that R*-tree is actually optimizing in the right direction for clustered data in that it tries to minimize overlap. We came back to check the three different losses, clustering loss, utilization loss and extra coverage loss in figure 4(b). Utilization loss is the minimum among the three. Maybe we can relax the utilization restrictions in R*-tree algorithm (which allows at most a 40/60 imbalance). The purpose was to allow a node split to separate two clusters cleanly instead of forcing it to divide up individual clusters between two nodes to satisfy the utilization restrictions. [KSH99]
We changed to the restriction and tested it from 0.15 to 0.4.
The following gif shows our results:
|
|
|
|
|
|
Obviously 0.25 is the optimal minimum split utilization for given dataset and queries. And the performance gets worse when the minimum split utilization gets larger or smaller. It is interesting that the optimal minimum split utilization not only reduce the clustering and excess coverage loss, it also reduced utilization loss which is unexpected because splits were allowed to be less balanced. This trend finishes when minimum split utilization reaches 0.25.
2. Modification of pickSplit() algorithm
We improved the PickSplit() algorithm as following:
Algorithm Split
Sl Invoke ChooseSplitAxis to determine the axis, perpendicular
to which the split is performed;
S2 Invoke ChooseSplitIndex to determine the best distribution
into two groups along that axis;
S3 Distribute the entries into two groups.
Algorithm ChooseSplitAxis
CSAl For each axis
Sort the entries by the lower then by the upper value of their rectangles
and determine all distributions as described above Compute S, the sum of
all margin-values of the different distributions and the all overlap-values
of the different distributions.
(the old algorithm only consider the margin values)
end
CSA2 Choose the axis with the minimum S as split
Algorithm ChooseSplitlndex
CSIl Along the chosen split axis, choose the distribution with the
minimum overlap-value and minimum area value.
(the old algorithm
only use the minimum overlap value and use minimum area value only when
there is a tie.)
We made improvement to this algorithm by two features as marked in the algorithm. After we combined our improved pickSplit() algorithm with the penalty() one, we got our optimal performance. The reason, we think is because the R*-tree is general heurstic method, the heuristic it takes is not optimal for our dataset and queries. We made the improvement is based on the step-by-step debugging with the aid of AMDB.
Reinsertion
We almost ran out of the time when we came up with the idea of reinsertion.
We noticed that dynamic reinsertion is not implemented gist_rstartree.
We did not have time to do that. Instead we made some experiments
on the ad hoc deletion the first 40,000 dataset and reinsert them afterwards
by using the perl script. We did two experiments as displayed below:
(a) bulk delete first 40,000 data and reinsert 40,000 data; (b) interleaved
delete and reinsert for the first 40,000 dataset (e.g., delete data A,
reinsert data A; delete B, reinsert B et al.) The results were worse
than the original one. The reason is of the fact that the deletion
does not really remove nodes even when their records are less than half
of the fanout due to the performance reasons. As we recalled from
the previous access methods paper, the index deletion is not really fully
executed due to performance reasons.
|
Results:
The results are best demonstrated with the four pictures below (figure8a,b,9,and
10). Figure 8a and 8b demonstrate in 2 dimensions the tightness our
Our-tree BP against the original R*-tree. Figure 9 and Figure 10
are Our-tree and Our*-tree, respectively. We will not repeat the
performance statistics since it is listed below except to note how close
to optimal both cases are, especially, if the workload can tolerate less
precise representation.
|
|
|
|
|
Given the improved Penalty() and PickSplit() algorithm above and all double values, we achieve the results as displayed in the pictures. |
|
Given the improved Penalty() and PickSplit() algorithm above and single precision float values, we achieve the results as displayed in the pictures. |
Results continued
The bottom line:
The bottom line is the set
of three tables below. The first table is Our Star Tree (Our*-tree),
the second table is Our-tree, and the third table is the R*-tree.
The first table, Our*-tree,
uses some lossy compression on the key (floats instead of double).
There are many applications where a float precision is accurate enough
to sufficiently index and query a workload. By using a float instead
of a double our index is able to better utilize and cluster data on a page.
And we demonstrate that below with an outstanding 30,661 Actual clustering
I/O's or 160% better than R*-tree and only 29% from the optimal.
Even the total number of I/O's at the internal and leaf nodes combined
is 157% better then the R*-tree total number of I/O's.
Our*-tree:
Items
Retrieved
|
425000
|
|
|
|
|
|
Avg.
Util
|
|
69.744
|
|
|
|
|
|
|
|
|
|
|
|
Leaves
|
|
I/O's
|
Overhead
|
Internal/Totals
|
||
Theoretical
Minimum:
|
4250
|
N/A
|
|
Internal
|
All
|
|
Optimal
Clustering:
|
21619
|
5.086
|
|
|
|
|
Random
Clustering:
|
349263.2(302)
|
16.155
|
|
|
|
|
Clustering
Loss:
|
-2799.179
|
0.87
|
|
N/A
|
-2799.18
|
|
Utilization
Loss:
|
1472
|
1.075
|
|
2129.258
|
3601.437
|
|
Exc.
Coverage Loss:
|
10369
|
1.51
|
|
2460.439
|
12829.48
|
|
Actual
Clustering:
|
30661
|
1.418
|
|
8752
|
39413
|
The second tree we call Our-tree.
It is the best index we could create without using less accurate floats.
Our-tree is closer t a lossless tree since the data is store in type double.
The performance is still 75% better than R*-tree at the leave nodes and
43% from the optimal.
Our-Tree:
Items
Retrieved
|
425000
|
|
|
|
|
|
Avg.
Util
|
|
69.475
|
|
|
|
|
|
|
|
|
|
|
|
Leaves
|
|
I/O's
|
Overhead
|
Internal/Totals
|
||
Theoretical
Minimum:
|
4250
|
N/A
|
|
Internal
|
Total
|
|
Optimal
Clustering:
|
26183
|
6.16
|
|
|
|
|
Random
Clustering:
|
377071.3(273)
|
14.401
|
|
|
|
|
Clustering
Loss:
|
-556.369
|
0.978
|
|
N/A
|
-556.369
|
|
Utilization
Loss:
|
2042.368
|
1.079
|
|
737.6378
|
2780.007
|
|
Exc.
Coverage Loss:
|
18490
|
1.668
|
|
5881.526
|
24371.53
|
|
Actual
Clustering:
|
46159
|
1.7962
|
|
14673
|
60832
|
R*-Tree:
Items
Retrieved
|
425000
|
|
|
|
|
|
Avg.
Util
|
|
70.339
|
|
|
|
|
|
|
|
|
|
|
|
Leaves
|
|
I/O's
|
Overhead
|
Internal/Totals
|
||
Theoretical
Minimum:
|
4250
|
N/A
|
|
Internal
|
Total
|
|
Optimal
Clustering:
|
26130
|
6.148
|
|
|
|
|
Random
Clustering:
|
377195.7(303)
|
14.435
|
|
|
|
|
Clustering
Loss:
|
10379.817
|
1.397
|
|
N/A
|
10379.82
|
|
Utilization
Loss:
|
2644.177
|
1.066
|
|
3521.611
|
6165.788
|
|
Exc.
Coverage Loss:
|
41623
|
2.063
|
|
6879.781
|
48502.78
|
|
Actual
Clustering:
|
80777
|
3.091
|
|
20439
|
101216
|
CONCLUSIONS AND FUTURE WORK
Note that we have achieved excellent performance. This is due to the combined effect of how we modified Penalty() and PickSplit(). Specifically, Penalty allowed gave us a global view of an insert. And given that global view we inserted the newPoint in the node that would increase on every level the least.
It is was only later that we realized that less precise representation of data is good in some domains; hence, our*-tree, demonstrated how close to optimal our index performed. A double to float transformation allowed to reach the outstanding benchmark of 160% better than R*-tree for I/O's at the leaf on the given workload. If we had more time we would pursue changing some of the 32-bit int values to 16-bit shorts. We would expect an even greater clustering scheme.
If we had more time we would definitely pursue one appealing future direction, dynamic reinsertion described in [BK90]. To achieve dynamic reorganizations, the R*-tree forces entries to be reinserted during the inserting routine. Thus the forced reinsert can change entries between neighboring nodes and thus decreases the overlap. As a side effect, the storage utilization is improved. Due to more restructuring, less splits will occur.
We did not have time to implement dynamic reinsertion in libgist, but we would definietly follow that path if we did. Although we did attempt a cheap hack using a perl script to
REFERENCES
[AOKI97] Paul M. Aoki, "Generalizing `Search' in Generalized Search Trees", UCB 1997.
[BK90] N. Beckmann, H. Kriegel, "The R*-tree: An Efficient and Robust Access Method for Points and Rectangles", SIGMOD 90.
[KSM99] M. Kornacker, M. Shah, J. M. Hellerstein, "Amdb: A Design Tool
for Access Methods", , UCB 1999.
.