Hakim Weatherspoon (hweather@cs.berkeley.edu)

Yan Chen (yanchen@cs.berkeley.edu)

CS 262B

March 1, 2000

Homework1
 

Indexing Bake-Off

 

 

ABSTRACT

Bakeoff for a given workload required us to generate a better access method (index structure) than R*-tree with the help of libGiST and amdb.  The characteristics of the dataset and queries were initially unknown. In this report we discuss how we analyzed this workload and the design process we used to produce a result that is 75% faster in number of I/O's at the leaf level than the R*-tree implementation.  In fact, our access method only requires 43% more I/O than the theoretical minimum. We then suggest some ways to even further reduce the rest overhead. Finally, we demonstrate that if precision is flexible our performance to a drastic departure from R*-tree and showed and outstanding improvement  that is160% faster in the number of I/O's at the leaf than the R*-tree.  Also, our implementation requires a low 29% more I/O than the theoretical minimum. The access method analysis and visualization features provided by GIST and AMDB were very valuable to our work.
 
 

DATASET AND WORKLOAD ANALYSIS

 
figure 1:  First 4,500 point of the dataset in each dimension X-d,Y-d,Z-d,K-d,M-d.

figure 2:  2,125 nearest neighbor queries in insert order.  Each dimension has the same color as the dataset in figure 1.

 

The workload we were given consists of 85,000 ordered insertions of data items with 5-dimensional keys, and 2,125 200-nearest neighbor queries on the data set. To optimize towards this workload, our first task was to understand it.  R*-tree gave the following results under AMDB workload analysis: 

figure 3. AMDB view of R*-tree on the given dataset and queries

 
 
 
 
figure 4. (a)Subtree view on the first level 
(b) The leave I/O statistics of the R*-tree

 
 

Obviously, the rectangles have bad overlap with in the two dimensions that AMDB displays.  Also from the above figures and numbers you will notice that there is a large amount of clustering loss in the AMDB analysis.  So our task is to better clusterize these data points and minimize the overlap between the rectangles.

Spreadsheet analysis gave us some more information:

Table1:  85,000 5-dimensional data points.

 
Dimension
Max Interval
Mean
Stdev
66% interval
95% interval
X-d
[0.367, 1.029]
0.997478
0.03632
[0.961, 1.033]
[0.925, 1.070]
Y-d
[-0.379, 1.029]
-0.13042
0.091197
[-0.221, -0.039]
[-0.312, 0.052]
Z-d
[-0.267, 0.411]
0.193732
0.094833
[0.099, -0.289]
[0.004, 0.383]
K-d
[-0.413, 0.400]
-0.15699
0.129915
[-0.287, -0.027]
[-0.416, 0.103]
M-d
[-0.315, 0.338]
-0.11218
0.162924
[-0.275, 0.051]
[-0.438, 0.214]
Table2:  cs262bqueries.  2,125 near neighbor searches returning the closest 200.

 
Dimension
Max Interval
Mean
Stdev
66% interval
95% interval
X-d
[0.576, 1.029]
0.999038
0.03322
[0.966, 1.032]
[0.933, 1.065]
Y-d
[-0.372, 0.251]
-0.13414
0.09346
[-0.227, -0.041]
[-0.321, 0.052]
Z-d
[-0.200, 0.405]
0.19515
0.095926
[0.099, 0.291]
[0.003, 0.387]
K-d
[-0.408, 0.303]
-0.15884
0.121521
[-0.280, -0.037]
[-0.402, 0.084]
M-d
[-0.315, 0.329]
-0.10929
0.162467
[-0.272, 0.053]
[-0.434, 0.216]
Based on our analysis of the workload we noticed that the dataset and queries had similar distributions.  Because of the tight bounds and high overlap we concluded that considering distance from a new insertion point and a tentative bounding predicate may not be a good idea.  But we tried it anyway.  As a result, we received bad performance of our index as expected.

 
 

To summarize, we found that the workload consists of 85,000 data points, grouped in a tight range in each dimension with the exception of a couple of outliers (which are responsible for ballooning of some bounding predicates). In fact in the 95% interval of most of the dataset and query set where in a tight bounds.  Finally, we would have to use another parameter besides for distance.

Based on the above analysis we continued to modify our R*-tree implementation.
 
 

APPROACHES AND RESULTS

1. Are they clustered? - Penalty()

Our first glance at the spreadsheet analysis and the 2d projection of amdb view gave us the impression that the data are highly clustered. Note they have very small standard variations in X, Y and Z dimensions.  So we tried to clusterize them.

We wrote a Penalty() method which tried to reduce the minimum distance of the point to the rectangle instead of minimizing the area enlargement or overlap.  It turned out to causes the rectangle to get larger and larger, and eventually the bad results as the following:
 
figure 5. Performance of the bad tree. (a) Level 1 subtree view (b) leaves I/O metrics

 

Our goal was pretty difficult, minimize the bounding predicate overlap at the leaf level, knowing that the dataset /workload is tightly bounded.  In order to achieve our goal we had to take into account our workload analysis
and our above failures (distance from the new point to the center of a bounding predicate would not work). As a result, we decided to focus on the penalty function and minimize the penalty of an insertion.

It follows that...

In order to minimize the penalty of insertion, we had to change the R*-Tree penalty implementation from only considering local consequences of an insert to considering global consequences!  Specifically, the original R*-Tree penalty implementation only considered the internal nodes whose area would increase the least, it did not consider the overlap and area increase of the actually leaf that the new point would be inserted into.  The above observation lead us directly into making modification to the R*-tree penalty implementation.  Hence when the page.level() >2,  we would only pick a tentative page for the new point, if and only if, the area enlargement and overlap enlargement was smaller than the already minimum area and overlap enlargement.

The above change to the penalty function actually allowed the I/O's at the leave to almost reduce in half.  From a total of 80,777 I/O's to approximately 49,000I/O's at the leaf.  We were well on our way of reaching the optimal I/O of 26,130.

The reason why the above change worked so well is because we did not restrict our self's to the internal node whose area would be increased the least (as in the original R*-tree), instead we considered the actual leaf node whose area and overlap would be increased the least. Once we noticed that considering all the leaves had such a positive affect, we wanted to consider a combination of the leaves and internal nodes. As a result, we did the above at the leaf and internal node level.  Therefore, we would only insert a new point into the traversal of nodes from root to leaf whose set of area's and overlaps would increase the least. As a result the performance of our index went further down to approximately 46,000 I/O's even closer to the optimal of 26,130.

Penalty(penaltyNode, &overlapReturned,&areaReturned)
{
    if (current_level == 2 )// children are leaf nodes
     {
        // return the minimum area and overlap enlargement of the page the new point might be inserted into;
         return overlapReturned and areaReturned;
     }

     minOverlapEnlargementLowerLevel = MAXDOUBLE;
     minAreaEnlargementLowerLevel = MAXDOUBLE;
     minOverlapEnlargementCurrentLevel = MAXDOUBLE;
     minAreaEnlargementCurrentLevel = MAXDOUBLE;

    for each child N of penaltyNode
     {
           areaEnlargementCurrentLevel = computeAreaEnlargement(N, newPoint);
           overlapEnlargementCurrentLevel = computeOverlapEnlargement(N,newPoint);

           P = Penalty(penaltyNodes.child, overlapReturned, areaReturned);

           if (overlapReturned <= minOverlapEnlargementLowerLevel &&
               areaReturned <= minAreaEnlargementLowerLevel &&
               overlapEnlargementCurrentLevel <= minOverlapEnlargementCurrentLevel &&
               areaEnlargementCurrentLevel <= minAreaEnlargementCurrentLevel)
           {

                  minOverlapEnlargementLowerLevel = overlapReturned;
                  minAreaEnlargementLowerLevel = areaReturned;
                  minOverlapEnlargementCurrentLevel = minOverlapEnlargementCurrentLevel;
                  minAreaEnlargementCurrentLevel = minAreaEnlargementCurrentLevel;
           }
     }

   return N;
}
 

PickSplit()

1. Minimum Split Utilization Constraint

We realized that R*-tree is actually optimizing in the right direction for clustered data in that it tries to minimize overlap. We came back to check the three different losses, clustering loss, utilization loss and extra coverage loss in figure 4(b). Utilization loss is the minimum among the three.  Maybe we can relax the utilization restrictions in R*-tree algorithm (which allows at most a 40/60 imbalance).  The purpose was to allow a node split to separate two clusters cleanly instead of forcing it to divide up individual clusters between two nodes to satisfy the utilization restrictions. [KSH99]

We changed to the restriction and tested it from 0.15 to 0.4.  The following gif shows our results:
 
(a)
(b)

 
 
(c)
(d)
Figure 6. performance under different MINSPLITUTLI. (a) top_left, 0.15, (b) top_right 0.2, (c) bottom_left 0.25 and (d) bottom_right 0.3

 

Obviously 0.25 is the optimal minimum split utilization for given dataset and queries. And the performance gets worse when the minimum split utilization gets larger or smaller.  It is interesting that the optimal minimum split utilization not only reduce the clustering and excess coverage loss, it also reduced utilization loss which is unexpected because splits were allowed to be less balanced.  This trend finishes when minimum split utilization reaches 0.25.

2. Modification of pickSplit() algorithm

We improved the PickSplit() algorithm as following:

Algorithm Split
Sl Invoke ChooseSplitAxis to determine the axis, perpendicular to which the split is performed;
S2 Invoke ChooseSplitIndex to determine the best distribution into two groups along that axis;
S3 Distribute the entries into two groups.

Algorithm ChooseSplitAxis
CSAl For each axis
            Sort the entries by the lower then by the upper value of their rectangles and determine all distributions as described above Compute S, the sum of all margin-values of the different distributions and the all overlap-values of the different distributions.
            (the old algorithm only consider the margin values)
          end
CSA2 Choose the axis with the minimum S as split

Algorithm ChooseSplitlndex
CSIl Along the chosen split axis, choose the distribution with the minimum overlap-value and minimum area value.
        (the old algorithm  only use the minimum overlap value and use minimum area value only when there is a tie.)

We made improvement to this algorithm by two features as marked in the algorithm.  After we combined our improved pickSplit() algorithm with the penalty() one, we got our optimal performance.  The reason, we think is because the R*-tree is general heurstic method, the heuristic it takes is not optimal for our dataset and queries.  We made the improvement is based on the step-by-step debugging with the aid of AMDB.

Reinsertion

We almost ran out of the time when we came up with the idea of reinsertion.  We noticed that dynamic reinsertion is not implemented gist_rstartree.  We did not have time to do that.  Instead we made some experiments on the ad hoc deletion the first 40,000 dataset and reinsert them afterwards by using the perl script.  We did two experiments as displayed below: (a) bulk delete first 40,000 data and reinsert 40,000 data; (b) interleaved delete and reinsert for the first 40,000 dataset (e.g., delete data A, reinsert data A; delete B, reinsert B et al.)  The results were worse than the original one.  The reason is of the fact that the deletion does not really remove nodes even when their records are less than half of the fanout due to the performance reasons.  As we recalled from the previous access methods paper, the index deletion is not really fully executed due to performance reasons.
 
figure 7. performance of ad hoc delete and reinsert of first 40,000 data (a) bulk; (b) interleaved

 
 
 
 

Results:
The results are best demonstrated with the four pictures below (figure8a,b,9,and 10).  Figure 8a and 8b demonstrate in 2 dimensions the tightness our Our-tree BP against the original R*-tree.  Figure 9 and Figure 10 are Our-tree and Our*-tree, respectively.  We will not repeat the performance statistics since it is listed below except to note how close to optimal both cases are, especially, if the workload can tolerate less precise representation.
 
(a) 2-d view of the clustering of our-tree.


 
(b)2-d view of the original R*-tree.
figure 8. Overall tree view comparison between the our-tree and original R*-tree

 
Figure 9. Final performance of our optimal R*-tree (our-tree) with a close to lossless index.

Given the improved Penalty() and PickSplit() algorithm above and all double values, we achieve the results as displayed in the pictures.

 
 
 
Figure 10. Final performance of our exceptional R*-tree (Our*-tree) with a  lossy index.

Given the improved Penalty() and PickSplit() algorithm above and single precision float values, we achieve the results as displayed in the pictures.

Results continued

The bottom line:

The bottom line is the set of three tables below.  The first table is Our Star Tree (Our*-tree), the second table is Our-tree, and the third table is the R*-tree.
 

 

The first table, Our*-tree, uses some lossy compression on the key (floats instead of double).  There are many applications where a float precision is accurate enough to sufficiently index and query a workload.  By using a float instead of a double our index is able to better utilize and cluster data on a page.  And we demonstrate that below with an outstanding 30,661 Actual clustering I/O's or 160% better than R*-tree and only 29% from the optimal.  Even the total number of I/O's at the internal and leaf nodes combined is 157% better then the R*-tree total number of I/O's.
 

 

Our*-tree:
 
Items Retrieved
425000
 
 
 
 
Avg. Util
 
69.744
 
 
 
 
 
 
 
 
 
 
 
Leaves
 
I/O's
Overhead
Internal/Totals
Theoretical Minimum:
4250
N/A
 
Internal 
All
Optimal Clustering:
21619
5.086
 
 
 
Random Clustering:
349263.2(302)
16.155
 
 
 
Clustering Loss:
-2799.179
0.87
 
N/A
-2799.18
Utilization Loss:
1472
1.075
 
2129.258
3601.437
Exc. Coverage Loss:
10369
1.51
 
2460.439
12829.48
Actual Clustering:
30661
1.418
 
8752
39413

The second tree we call Our-tree.  It is the best index we could create without using less accurate floats.  Our-tree is closer t a lossless tree since the data is store in type double.  The performance is still 75% better than R*-tree at the leave nodes and 43% from the optimal.
 

 

Our-Tree:
 
Items Retrieved
425000
 
 
 
 
Avg. Util
 
69.475
 
 
 
 
 
 
 
 
 
 
 
Leaves
 
I/O's
Overhead
Internal/Totals
Theoretical Minimum:
4250
N/A
 
Internal 
Total
Optimal Clustering:
26183
6.16
 
 
 
Random Clustering:
377071.3(273)
14.401
 
 
 
Clustering Loss:
-556.369
0.978
 
N/A
-556.369
Utilization Loss:
2042.368
1.079
 
737.6378
2780.007
Exc. Coverage Loss:
18490
1.668
 
5881.526
24371.53
Actual Clustering:
46159
1.7962
 
14673
60832

 

R*-Tree:
 
Items Retrieved
425000
 
 
 
 
Avg. Util
 
70.339
 
 
 
 
 
 
 
 
 
 
 
Leaves
 
I/O's
Overhead
Internal/Totals
Theoretical Minimum:
4250
N/A
 
Internal 
Total
Optimal Clustering:
26130
6.148
 
 
 
Random Clustering:
377195.7(303)
14.435
 
 
 
Clustering Loss:
10379.817
1.397
 
N/A
10379.82
Utilization Loss:
2644.177
1.066
 
3521.611
6165.788
Exc. Coverage Loss:
41623
2.063
 
6879.781
48502.78
Actual Clustering:
80777
3.091
 
20439
101216

 

CONCLUSIONS AND FUTURE WORK

Note that we have achieved excellent performance. This is due to the combined effect of how we modified Penalty() and PickSplit(). Specifically, Penalty allowed gave us a global view of an insert.  And given that global view we inserted the newPoint in the node that would increase on every level the least.

 It is was only later that we realized that less precise representation of data is good in some domains; hence, our*-tree, demonstrated how close to optimal our index performed.  A double to float transformation allowed to reach the outstanding benchmark of 160% better than R*-tree for I/O's at the leaf on the given workload.  If we had more time we would pursue changing some of the 32-bit int values to 16-bit shorts.  We would expect an even greater clustering scheme.

If we had more time we would definitely pursue one appealing future direction, dynamic reinsertion described in [BK90]. To achieve dynamic reorganizations, the R*-tree forces entries to be reinserted during the inserting routine.  Thus the forced reinsert can change entries between neighboring nodes and thus decreases the overlap.  As a side effect, the storage utilization is improved.  Due to more restructuring, less splits will occur.

We did not have time to implement dynamic reinsertion in libgist, but we would definietly follow that path if we did.  Although we did attempt a cheap hack using a perl script to

REFERENCES

[AOKI97] Paul M. Aoki, "Generalizing `Search' in Generalized Search Trees", UCB 1997.

[BK90] N. Beckmann, H. Kriegel, "The R*-tree: An Efficient and Robust Access Method for Points and Rectangles", SIGMOD 90.

[KSM99] M. Kornacker, M. Shah, J. M. Hellerstein, "Amdb: A Design Tool for Access Methods", , UCB 1999.
.