March 1, 2000

Indexing Bake-Off

ABSTRACT

Bakeoff for a given workload required us to generate a better access method (index structure) than R*-tree with the help of libGiST and amdb. The characteristics of the dataset and queries were initially unknown. In this report we discuss how we analyzed this workload and the design process we used to produce a result that is 75% faster in number of I/O's at the leaf level than the R*-tree implementation. In fact, our access method only requires 43% more I/O than the theoretical minimum. We then suggest some ways to even further reduce the rest overhead. Finally, we demonstrate that if precision is flexible our performance to a drastic departure from R*-tree and showed and outstanding improvement that is160% faster in the number of I/O's at the leaf than the R*-tree. Also, our implementation requires a low 29% more I/O than the theoretical minimum. The access method analysis and visualization features provided by GIST and AMDB were very valuable to our work.

DATASET AND WORKLOAD ANALYSIS

figure 1: First 4,500 point of the dataset in each dimension X-d,Y-d,Z-d,K-d,M-d.

figure 2: 2,125 nearest neighbor queries in insert order. Each dimension has the same color as the dataset in figure 1.

The workload we were given consists of 85,000 ordered insertions of data items with 5-dimensional keys, and 2,125 200-nearest neighbor queries on the data set. To optimize towards this workload, our first task was to understand it. R*-tree gave the following results under AMDB workload analysis:

figure 3. AMDB view of R*-tree on the given dataset and queries


figure 4. (a)Subtree view on the first level (b) The leave I/O statistics of the R*-tree

Obviously, the rectangles have bad overlap with in the two dimensions that AMDB displays. Also from the above figures and numbers you will notice that there is a large amount of clustering loss in the AMDB analysis. So our task is to better clusterize these data points and minimize the overlap between the rectangles.

Spreadsheet analysis gave us some more information:

All range queries are 5-d squares with each side having a max range of 1.4.
We arbitrarily named dimension 1,2,3,4,5, X,Y,Z,K,M.
Each dimension is limited to the max range interval of [-0.387,1.0285] or a range of 1.4.
Specifically,

Table1: 85,000 5-dimensional data points.

Dimension	Max Interval	Mean	Stdev	66% interval	95% interval
X-d	[0.367, 1.029]	0.997478	0.03632	[0.961, 1.033]	[0.925, 1.070]
Y-d	[-0.379, 1.029]	-0.13042	0.091197	[-0.221, -0.039]	[-0.312, 0.052]
Z-d	[-0.267, 0.411]	0.193732	0.094833	[0.099, -0.289]	[0.004, 0.383]
K-d	[-0.413, 0.400]	-0.15699	0.129915	[-0.287, -0.027]	[-0.416, 0.103]
M-d	[-0.315, 0.338]	-0.11218	0.162924	[-0.275, 0.051]	[-0.438, 0.214]

In fact, once we analyzed the cs262bqueries we noticed that the queries were even more tightly bound than the dataset. The tight bounds is evident in Table 2 and Figure 2 below.

Table2: cs262bqueries. 2,125 near neighbor searches returning the closest 200.

Dimension	Max Interval	Mean	Stdev	66% interval	95% interval
X-d	[0.576, 1.029]	0.999038	0.03322	[0.966, 1.032]	[0.933, 1.065]
Y-d	[-0.372, 0.251]	-0.13414	0.09346	[-0.227, -0.041]	[-0.321, 0.052]
Z-d	[-0.200, 0.405]	0.19515	0.095926	[0.099, 0.291]	[0.003, 0.387]
K-d	[-0.408, 0.303]	-0.15884	0.121521	[-0.280, -0.037]	[-0.402, 0.084]
M-d	[-0.315, 0.329]	-0.10929	0.162467	[-0.272, 0.053]	[-0.434, 0.216]

Standard deviation, median and average of data points and query squares for the Y-d suggest that they are normally distributed. However, there is not such characteristic in the other dimensions.
The mean, standard deviation, variance, minimum, maximum, average values of dataset is very close to those of the queries. More importantly, it holds for all five dimensions.

Based on our analysis of the workload we noticed that the dataset and queries had similar distributions. Because of the tight bounds and high overlap we concluded that considering distance from a new insertion point and a tentative bounding predicate may not be a good idea. But we tried it anyway. As a result, we received bad performance of our index as expected.

To summarize, we found that the workload consists of 85,000 data points, grouped in a tight range in each dimension with the exception of a couple of outliers (which are responsible for ballooning of some bounding predicates). In fact in the 95% interval of most of the dataset and query set where in a tight bounds. Finally, we would have to use another parameter besides for distance.

Based on the above analysis we continued to modify our R*-tree implementation.

APPROACHES AND RESULTS

1. Are they clustered? - Penalty()

Our first glance at the spreadsheet analysis and the 2d projection of amdb view gave us the impression that the data are highly clustered. Note they have very small standard variations in X, Y and Z dimensions. So we tried to clusterize them.

We wrote a Penalty() method which tried to reduce the minimum distance of the point to the rectangle instead of minimizing the area enlargement or overlap. It turned out to causes the rectangle to get larger and larger, and eventually the bad results as the following:

figure 5. Performance of the bad tree. (a) Level 1 subtree view (b) leaves I/O metrics

Our goal was pretty difficult, minimize the bounding predicate overlap at the leaf level, knowing that the dataset /workload is tightly bounded. In order to achieve our goal we had to take into account our workload analysis
and our above failures (distance from the new point to the center of a bounding predicate would not work). As a result, we decided to focus on the penalty function and minimize the penalty of an insertion.

It follows that...

In order to minimize the penalty of insertion, we had to change the R*-Tree penalty implementation from only considering local consequences of an insert to considering global consequences! Specifically, the original R*-Tree penalty implementation only considered the internal nodes whose area would increase the least, it did not consider the overlap and area increase of the actually leaf that the new point would be inserted into. The above observation lead us directly into making modification to the R*-tree penalty implementation. Hence when the page.level() >2, we would only pick a tentative page for the new point, if and only if, the area enlargement and overlap enlargement was smaller than the already minimum area and overlap enlargement.

The above change to the penalty function actually allowed the I/O's at the leave to almost reduce in half. From a total of 80,777 I/O's to approximately 49,000I/O's at the leaf. We were well on our way of reaching the optimal I/O of 26,130.

The reason why the above change worked so well is because we did not restrict our self's to the internal node whose area would be increased the least (as in the original R*-tree), instead we considered the actual leaf node whose area and overlap would be increased the least. Once we noticed that considering all the leaves had such a positive affect, we wanted to consider a combination of the leaves and internal nodes. As a result, we did the above at the leaf and internal node level. Therefore, we would only insert a new point into the traversal of nodes from root to leaf whose set of area's and overlaps would increase the least. As a result the performance of our index went further down to approximately 46,000 I/O's even closer to the optimal of 26,130.

Penalty(penaltyNode, &overlapReturned,&areaReturned)
{
    if (current_level == 2 )// children are leaf nodes
     {
        // return the minimum area and overlap enlargement of the page the new point might be inserted into;
         return overlapReturned and areaReturned;
     }

     minOverlapEnlargementLowerLevel = MAXDOUBLE;
     minAreaEnlargementLowerLevel = MAXDOUBLE;
     minOverlapEnlargementCurrentLevel = MAXDOUBLE;
     minAreaEnlargementCurrentLevel = MAXDOUBLE;

    for each child N of penaltyNode
     {
           areaEnlargementCurrentLevel = computeAreaEnlargement(N, newPoint);
           overlapEnlargementCurrentLevel = computeOverlapEnlargement(N,newPoint);

P = Penalty(penaltyNodes.child, overlapReturned, areaReturned);

           if (overlapReturned <= minOverlapEnlargementLowerLevel &&
               areaReturned <= minAreaEnlargementLowerLevel &&
               overlapEnlargementCurrentLevel <= minOverlapEnlargementCurrentLevel &&
               areaEnlargementCurrentLevel <= minAreaEnlargementCurrentLevel)
           {

                  minOverlapEnlargementLowerLevel = overlapReturned;
                  minAreaEnlargementLowerLevel = areaReturned;
                  minOverlapEnlargementCurrentLevel = minOverlapEnlargementCurrentLevel;
                  minAreaEnlargementCurrentLevel = minAreaEnlargementCurrentLevel;
           }
     }

return N;
}

PickSplit()

1. Minimum Split Utilization Constraint

We realized that R*-tree is actually optimizing in the right direction for clustered data in that it tries to minimize overlap. We came back to check the three different losses, clustering loss, utilization loss and extra coverage loss in figure 4(b). Utilization loss is the minimum among the three. Maybe we can relax the utilization restrictions in R*-tree algorithm (which allows at most a 40/60 imbalance). The purpose was to allow a node split to separate two clusters cleanly instead of forcing it to divide up individual clusters between two nodes to satisfy the utilization restrictions. [KSH99]

We changed to the restriction and tested it from 0.15 to 0.4. The following gif shows our results:

(a) (b)

(c) (d)

Figure 6. performance under different MINSPLITUTLI. (a) top_left, 0.15, (b) top_right 0.2, (c) bottom_left 0.25 and (d) bottom_right 0.3

Obviously 0.25 is the optimal minimum split utilization for given dataset and queries. And the performance gets worse when the minimum split utilization gets larger or smaller. It is interesting that the optimal minimum split utilization not only reduce the clustering and excess coverage loss, it also reduced utilization loss which is unexpected because splits were allowed to be less balanced. This trend finishes when minimum split utilization reaches 0.25.

2. Modification of pickSplit() algorithm

We improved the PickSplit() algorithm as following:

Algorithm Split
Sl Invoke ChooseSplitAxis to determine the axis, perpendicular to which the split is performed;
S2 Invoke ChooseSplitIndex to determine the best distribution into two groups along that axis;
S3 Distribute the entries into two groups.

Algorithm ChooseSplitAxis
CSAl For each axis
            Sort the entries by the lower then by the upper value of their rectangles and determine all distributions as described above Compute S, the sum of all margin-values of the different distributions and the all overlap-values of the different distributions.
            (the old algorithm only consider the margin values)
          end
CSA2 Choose the axis with the minimum S as split

Algorithm ChooseSplitlndex
CSIl Along the chosen split axis, choose the distribution with the minimum overlap-value and minimum area value.
(the old algorithm only use the minimum overlap value and use minimum area value only when there is a tie.)

We made improvement to this algorithm by two features as marked in the algorithm. After we combined our improved pickSplit() algorithm with the penalty() one, we got our optimal performance. The reason, we think is because the R*-tree is general heurstic method, the heuristic it takes is not optimal for our dataset and queries. We made the improvement is based on the step-by-step debugging with the aid of AMDB.

Reinsertion

We almost ran out of the time when we came up with the idea of reinsertion. We noticed that dynamic reinsertion is not implemented gist_rstartree. We did not have time to do that. Instead we made some experiments on the ad hoc deletion the first 40,000 dataset and reinsert them afterwards by using the perl script. We did two experiments as displayed below: (a) bulk delete first 40,000 data and reinsert 40,000 data; (b) interleaved delete and reinsert for the first 40,000 dataset (e.g., delete data A, reinsert data A; delete B, reinsert B et al.) The results were worse than the original one. The reason is of the fact that the deletion does not really remove nodes even when their records are less than half of the fanout due to the performance reasons. As we recalled from the previous access methods paper, the index deletion is not really fully executed due to performance reasons.

figure 7. performance of ad hoc delete and reinsert of first 40,000 data (a) bulk; (b) interleaved

Results:
The results are best demonstrated with the four pictures below (figure8a,b,9,and 10). Figure 8a and 8b demonstrate in 2 dimensions the tightness our Our-tree BP against the original R*-tree. Figure 9 and Figure 10 are Our-tree and Our*-tree, respectively. We will not repeat the performance statistics since it is listed below except to note how close to optimal both cases are, especially, if the workload can tolerate less precise representation.

(a) 2-d view of the clustering of our-tree.

(b)2-d view of the original R*-tree.

figure 8. Overall tree view comparison between the our-tree and original R*-tree

Figure 9. Final performance of our optimal R*-tree (our-tree) with a close to lossless index.
Given the improved Penalty() and PickSplit() algorithm above and all double values, we achieve the results as displayed in the pictures.

Figure 10. Final performance of our exceptional R*-tree (Our*-tree) with a lossy index.

Given the improved Penalty() and PickSplit() algorithm above and single precision float values, we achieve the results as displayed in the pictures.

Results continued

The bottom line:

The bottom line is the set of three tables below. The first table is Our Star Tree (Our*-tree), the second table is Our-tree, and the third table is the R*-tree.

The first table, Our*-tree, uses some lossy compression on the key (floats instead of double). There are many applications where a float precision is accurate enough to sufficiently index and query a workload. By using a float instead of a double our index is able to better utilize and cluster data on a page. And we demonstrate that below with an outstanding 30,661 Actual clustering I/O's or 160% better than R*-tree and only 29% from the optimal. Even the total number of I/O's at the internal and leaf nodes combined is 157% better then the R*-tree total number of I/O's.

Our*-tree:

Items Retrieved

425000

Avg. Util

69.744

Leaves

I/O's

Overhead

Internal/Totals

Theoretical Minimum:

4250

N/A

Internal

All

Optimal Clustering:

21619

5.086

Random Clustering:

349263.2(302)

16.155

Clustering Loss:

-2799.179

0.87

N/A

-2799.18

Utilization Loss:

1472

1.075

2129.258

3601.437

Exc. Coverage Loss:

10369

1.51

2460.439

12829.48

Actual Clustering:

30661

1.418

8752

39413

The second tree we call Our-tree. It is the best index we could create without using less accurate floats. Our-tree is closer t a lossless tree since the data is store in type double. The performance is still 75% better than R*-tree at the leave nodes and 43% from the optimal.

Our-Tree:

Items Retrieved

425000

Avg. Util

69.475

Leaves

I/O's

Overhead

Internal/Totals

Theoretical Minimum:

4250

N/A

Internal

Total

Optimal Clustering:

26183

6.16

Random Clustering:

377071.3(273)

14.401

Clustering Loss:

-556.369

0.978

N/A

-556.369

Utilization Loss:

2042.368

1.079

737.6378

2780.007

Exc. Coverage Loss:

18490

1.668

5881.526

24371.53

Actual Clustering:

46159

1.7962

14673

60832

R*-Tree:

Items Retrieved

425000

Avg. Util

70.339

Leaves

I/O's

Overhead

Internal/Totals

Theoretical Minimum:

4250

N/A

Internal

Total

Optimal Clustering:

26130

6.148

Random Clustering:

377195.7(303)

14.435

Clustering Loss:

10379.817

1.397

N/A

10379.82

Utilization Loss:

2644.177

1.066

3521.611

6165.788

Exc. Coverage Loss:

41623

2.063

6879.781

48502.78

Actual Clustering:

80777

3.091

20439

101216

CONCLUSIONS AND FUTURE WORK

Note that we have achieved excellent performance. This is due to the combined effect of how we modified Penalty() and PickSplit(). Specifically, Penalty allowed gave us a global view of an insert. And given that global view we inserted the newPoint in the node that would increase on every level the least.

It is was only later that we realized that less precise representation of data is good in some domains; hence, our*-tree, demonstrated how close to optimal our index performed. A double to float transformation allowed to reach the outstanding benchmark of 160% better than R*-tree for I/O's at the leaf on the given workload. If we had more time we would pursue changing some of the 32-bit int values to 16-bit shorts. We would expect an even greater clustering scheme.

If we had more time we would definitely pursue one appealing future direction, dynamic reinsertion described in [BK90]. To achieve dynamic reorganizations, the R*-tree forces entries to be reinserted during the inserting routine. Thus the forced reinsert can change entries between neighboring nodes and thus decreases the overlap. As a side effect, the storage utilization is improved. Due to more restructuring, less splits will occur.

We did not have time to implement dynamic reinsertion in libgist, but we would definietly follow that path if we did. Although we did attempt a cheap hack using a perl script to

REFERENCES

[AOKI97] Paul M. Aoki, "Generalizing `Search' in Generalized Search Trees", UCB 1997.

[BK90] N. Beckmann, H. Kriegel, "The R*-tree: An Efficient and Robust Access Method for Points and Rectangles", SIGMOD 90.

[KSM99] M. Kornacker, M. Shah, J. M. Hellerstein, "Amdb: A Design Tool for Access Methods", , UCB 1999.
.


figure 5. Performance of the bad tree. (a) Level 1 subtree view (b) leaves I/O metrics


(c)	(d)
Figure 6. performance under different MINSPLITUTLI. (a) top_left, 0.15, (b) top_right 0.2, (c) bottom_left 0.25 and (d) bottom_right 0.3


figure 7. performance of ad hoc delete and reinsert of first 40,000 data (a) bulk; (b) interleaved

Items Retrieved	425000
Avg. Util	69.744

Leaves	I/O's	Overhead	Internal/Totals
Theoretical Minimum:	4250	N/A	Internal	All
Optimal Clustering:	21619	5.086
Random Clustering:	349263.2(302)	16.155
Clustering Loss:	-2799.179	0.87	N/A	-2799.18
Utilization Loss:	1472	1.075	2129.258	3601.437
Exc. Coverage Loss:	10369	1.51	2460.439	12829.48
Actual Clustering:	30661	1.418	8752	39413