Spatial Indexes: MySQL

In two previous articles I introduced The spherical Earth model and importing data into SQLite for querying geographical data. In this article we're going to have a look at importing the data into MySQL and finding out how to best store and query spatial data in the databases.

MySQL

MySQL has some support for Spatial Extensions, but it's not particularly useful. For example, there is no way to query anything within the radius around a specific point. Their community pages list a method of implementing it, but it only calculates for a flat Earth model.

Instead, we'll have to implement our own algorithms again. But first of all, let us import the data into MySQL. First we create the MySQL database:

derick@whisky:~$ mysqladmin -u root -p create poi
mysql> CREATE TABLE poi(id int, type int, lat float, lon float, name char(255), address char(255), cuisine char(64), phone char(18));

And then we take the script from the previous article on importing data, and change the third line from:

$d = ezcDbFactory::create( 'sqlite://' . dirname( __FILE__ ) . '/pois.sqlite' );

to:

$d = ezcDbFactory::create( 'mysql://root:root@localhost/poi' );

Of course, substitute the username and password (root, root) and the database name (poi) to one that suits yourself. We then run again:

php parseoi.php.txt great_britain_pubs.osm

And check that our import is complete:

mysql> SELECT count(*) from poi\G
*************************** 1. row ***************************
count(*): 28147

Now that we have all the POIs imported, we can query the database. Because MySQL doesn't have working spatial extensions, nor the register-function capabilities from SQLite such as we saw in the previous article, we have to come up with a new solution. The most obvious one is writing a stored procedure for the task, another (non-explored option) would be to write a user defined function. Writing the stored procedure is simple enough; we just have to convert the distance() function to SQL. On the MySQL command line you can enter:

delimiter //

CREATE FUNCTION distance (latA double, lonA double, latB double, LonB double)
    RETURNS double DETERMINISTIC
BEGIN
    SET @RlatA = radians(latA);
    SET @RlonA = radians(lonA);
    SET @RlatB = radians(latB);
    SET @RlonB = radians(LonB);
    SET @deltaLat = @RlatA - @RlatB;
    SET @deltaLon = @RlonA - @RlonB;
    SET @d = SIN(@deltaLat/2) * SIN(@deltaLat/2) +
        COS(@RlatA) * COS(@RlatB) * SIN(@deltaLon/2)*SIN(@deltaLon/2);
    RETURN 2 * ASIN(SQRT(@d)) * 6371.01;
END//

Fetching all the pubs within a 250 meter radius around 51.5375°N, 0.1933°W is than as easy as running:

mysql> SELECT name, address, phone
       FROM poi
       WHERE DISTANCE(lat, lon, 51.5375, -0.1933) < 0.25;

With the result being:

| name            | address                        | phone           |
+-----------------+--------------------------------+-----------------+
| Mrs Betsy Smith | Kilburn High Road 77 , NW6 6HY | +44 20 76245793 |
| The Cock Tavern | Kilburn High Road 125          | NULL            |
| The Old Bell    | NULL                           | NULL            |
| The Westbury    | Kilburn High Road 34 , NW6 5UA | +44 20 76257500 |
+-----------------+--------------------------------+-----------------+
4 rows in set (0.72 sec)

If we want to also include the distance from the centre point in the result, we need to modify the query to:

mysql> SELECT name, DISTANCE(lat, lon, 51.5375, -0.1933) AS dist
       FROM poi
       HAVING dist < 0.25
       ORDER BY dist;

with as result:

| name            | dist                |
+-----------------+---------------------+
| Mrs Betsy Smith | 0.00825473345748987 |
| The Cock Tavern |     0.2420193460511 |
| The Old Bell    |   0.103123484090313 |
| The Westbury    |   0.150294300836645 |
+-----------------+---------------------+
4 rows in set (0.72 sec)

In both cases, the query takes 0.72 seconds. This is not overly fast, and the main reason for this is that the distance() function has to be called for every row in the table. An index can not be created on this either. However, what we can do is to filter out the results roughly first, by calculating a bounding box of latitude/longitude pairs around the centre point. Calculating the latitude boundaries can be done by:

d = (distInKm / 6371.01 * 2π) * 360
lat1 = centreLat - d
lat2 = centreLat + d

In our example that becomes:

d = (0.250 / 6371.01 * 2π) * 360 = 0.0022483
lat1 = 51.5375 - 0.0022483 = 51.5352.. -> 51.5351
lat2 = 51.5375 + 0.0022483 = 51.5397.. -> 51.5398

Or in PHP:

<?php
function findLatBoundary($dist, $lat, &$lat1, &$lat2)
{
    $d = ($dist / 6371.01 * 2 * M_PI) * 360;
    $lat1 = $lat - $d;
    $lat2 = $lat + $d;
}
?>

We round slightly up and down to combat inaccuracies in the calculations—it is a rough estimate after all. After adding the index on the lat column, and the index on the lon column, we reissue the query:

mysql> CREATE index poi_lat ON poi(lat);
mysql> CREATE index poi_lon ON poi(lon);

mysql> SELECT name, DISTANCE(lat, lon, 51.5375, -0.1933) AS dist
       FROM poi
       WHERE lat BETWEEN 51.5351 AND 51.5398
       HAVING dist < 0.25;

Which returns the same result as before, but faster:

| name            | dist                |
+-----------------+---------------------+
| The Westbury    |   0.150294300836645 |
| The Old Bell    |   0.103123484090313 |
| Mrs Betsy Smith | 0.00825473345748987 |
| The Cock Tavern |     0.2420193460511 |
+-----------------+---------------------+
4 rows in set (0.01 sec)

We can pre-filter the result set even more, by also limiting on the longitude boundary. This involves a few more calculations than for the latitude boundaries. The distance in degrees longitude that belongs to a distance in km depends on the latitude of the location. So first we need to calculate the latitude boundaries as we have done above, and with that information calculate the longitude boundaries.

sphere-distance.png

In the first step (red), we calculate the northern and southern boundaries of the circle. We then calculate the western and eastern boundaries for the northern boundary (green), and southern boundary (blue).

In PHP this algorithm becomes:

<?php
function findLonBoundary($dist, $lat, $lon, $lat1, $lat2, &$lon1, &$lon2)
{
    $d = $lat - $lat1;

    $d1 = $d / cos(deg2rad($lat1));
    $d2 = $d / cos(deg2rad($lat2));

    $lon1 = min($lon - $d1, $lon - $d2);
    $lon2 = max($lon + $d1, $lon + $d2);
}
?>

If we use both functions we calculate as boundaries:

<?php
$dist = 0.25;
$lat = 51.5375;
$lon = -0.1933;

findLatBoundary($dist, $lat, $lat1, $lat2);
findLonBoundary($dist, $lat, $lon, $lat1, $lat2, $lon1, $lon2);

echo "SELECT name, DISTANCE(lat, lon, $lat, $lon) AS dist
      FROM poi
      WHERE lat BETWEEN $lat1 AND $lat2
        AND lon BETWEEN $lon1 AND $lon2
      HAVING dist < $dist
      ORDER BY dist;\n";
?>

Which returns the query to execute:

SELECT name, DISTANCE(lat, lon, 51.5375, -0.1933) AS dist
FROM poi
WHERE lat BETWEEN 51.535251699514 AND 51.539748300486
  AND lon BETWEEN -0.19691479627963 AND -0.18968520372037
  HAVING dist < 0.25
  ORDER BY dist;

If we run this query, we get as result:

| name            | dist                |
+-----------------+---------------------+
| Mrs Betsy Smith | 0.00825473345748987 |
| The Old Bell    |   0.103123484090313 |
| The Westbury    |   0.150294300836645 |
| The Cock Tavern |     0.2420193460511 |
+-----------------+---------------------+
4 rows in set (0.00 sec)

As you can see, we are getting the result a lot faster now—0.00 sec vs 0.72 sec. Please do note, that if we would just have used the boundaries without using the HAVING dist < 0.25 clause, we would have gotten one extra result that is slightly too far away (0.281 km):

| name            | dist                |
+-----------------+---------------------+
| Mrs Betsy Smith | 0.00825473345748987 |
| The Old Bell    |   0.103123484090313 |
| The Westbury    |   0.150294300836645 |
| The Cock Tavern |     0.2420193460511 |
| Bar Hemia       |   0.281138534935208 |
+-----------------+---------------------+

Conclusion

In this article we saw how we can use a MySQL stored procedure to find our pubs and bars within a certain distance from a central location. In the next part, we will be looking on how to solve the same problem with PostgreSQL.

Shortlink

This article has a short URL available: http://drck.me/spat-mysql-8ls

Comments

Excellent post.

Although I'm still not convinced that using the haversine formula for small distances is the best approach here. For those distances (say, up to a few hundred meters, or maybe even a few several kilometers/miles) we can assume a Euclid geometry so Pythagoras is probably more efficient to use in this case.

So I'm very curious to see on how those 2 different formula's would differ in output when querying on short radius and how the performance would be.

Use 3959 instead of 6371.01 for miles.

You sure it is not better to use the spatial extension in MySQL if you have very large datasets? Say I want the 10 closest POIs to a (lat,lng) point, but I don't know if they are within 1 km, 10 km, 100 km or 1000 km? I could of course do a series of queries, first with 1 km, then 5 km if too few are returned within 1 km, then 10 km and so on - but that doesn't sound very effective either?

Great post thank you.

But I would be interested to know if MySQL by default has a functionality to return all locations within bounding box. So lets say I have a top left hand and bottom right corner coordinates of the map and I would like to display all pubs that would fit on this map. If not default in MySQL, how can this be achieved still keeping the performance?

Another question relating to the above, lets say for the given map it would return two thousand pubs (as my zoom level may be high), how can I group the locations in MySQL so that only a maximum of 50 markers are returned to be displayed on the map (so some may be a single marker while others will be a group marker that will contain say hundreds of locations for this particular zoom level.)

thanks

Inspiring, I was happy to know my shared hosting had the spatial extensions to its MySQL, but quite disappointed to know said extensions have no distance function working. So what's the point of having them altogether. Your approach I like very much. As a previous commenter said, this is not much useful if you want to order by distance unregarding of boundaries, but since I wouldn't be doing this anyway.. thanks!

I use a square instead of a circle.... My 2cents...

Q: Do you know any tricks to creating a good index?

eg, to speed up lat/lon searches...

Thanks for posting!!

@Paul - for grouping of results which are near you can GROUP BY a rounded (less accurate) latitude and longitude concatenated together / sha1'd / md5'd

So 51.53, -0.19 would become

concat: 51.53-0.19 sha1: a4e569a76ac978607edf2bb6ec6f5c53db0bbd49 md5: 2bf8a4ef399be9403be086245eae08a5

You can then also return the number of results in that with a SUM() with your GROUP BY

I’m finding it hard to convert function findLatBoundary($dist, $lat, &$lat1, &$lat2)

$d = ($dist / 6371.01 * 2 * M_PI) * 360; $lat1 = $lat - $d; $lat2 = $lat + $d;

To c# can you point be in the right direction?

I was wondering if abs needs to be added for distance computation:

$d = abs($lat - $lat1);

$d1 = abs($d / cos(deg2rad($lat1))); $d2 = abs($d / cos(deg2rad($lat2)));

$lon1 = min($lon - $d1, $lon - $d2); $lon2 = max($lon + $d1, $lon + $d2);

I am working on a site that needs to regularly do bounding box queries and could easily have millions of records per queried table.

I really like this example but I am worried about how well separate indexes on latitude and longitude will hold up on larger data sets, since the data cannot be stored using spatial indexing.

Do you think that there will be a significant performance hit using MySQL to do this over PostgreSQL?

To clarify I would still be using bounding box and haversine instead of relying on the built in methods if I used PostgreSQL it would literally just be to ensure the data is physically stored based on spatial indexes.

I do not know how much testing you did but you seem to be familiar with both systems, and I would love to hear your opinion on the matter.

Also I cannot seem to find the guide on PostgreSQL implementation of this that you mentioned you would release, can you link me to it?

@Dave:

I must admit I have never written the PostGreSQL variant. Instead I started experimenting with MongoDB's features instead, mostly because I work for them now.

Theoretically, a spatial index would be better because having a compound index with two ranges for each of the index elements (a range on lat and a range on lon) is difficult for non-spatial indexes to handle. I am currently writing a talk on MongoDB 2.4's geo features, and as part of that I am exploring the new geospatial index. The intention is then to also write a similar article to this one but instead using MongoDB. If you let me know what sort of queries and results you expect, and what sort of dataset, I can make sure to address that in this article.

If you intend to use PostGreSQL, I would suggest you use the PostGIS extensions to do you querying, as that will be a lot faster than just relying on a spatial index.

You can always use a search indexer like Sphinx Search. Here's a sample of doing geo searchging: http://www.god-object.com/2009/10/20/geospatial-search-using-sphinx-search-and-php/

nice post. I have avoided distance() because I just use a box. But I may try "stored procedures" which I didn't understand. I am also curious about mariaDB which is up and coming.

Add Comment

Name:
Email:

Will not be posted. Please leave empty instead of filling in garbage though!
Comment:

Please follow the reStructured Text format. Do not use the comment form to report issues in software, use the relevant issue tracker. I will not answer them here.


All comments are moderated

Life Line