triptic

Weblog by triptic | Online Communication in Eindhoven

MySQL performance nightmares: Solr to the rescue

For almost a year we have worked very hard on a huge project: Oneindig Noord-Holland. In case you were wondering why it’s been a bit quiet from my side: That’s why.

One of the key features of this website was its search feature. It’s going to be a huge collection of items of different types: stories, events, news, profiles, images, videos and a lot more… All these types of  items can have the same properties: title, author, description, tags, dates, date ranges, specific locations, location areas and links to other items. These items can also have content which contains text, images, videos, links to other items and more. As a visitor you need to be able to work your way through all these items and that’s where the search comes in.

We wanted everything in the search we could think of and we wanted it to filter on most of the properties:

  • Text: in text, tags, title, subtitle, description etc.
  • Tag
  • Location: both by region and within a certain distance from a specified location
  • Time: both by a period in time and by a specific point in time
  • Author
  • State: published or not
  • Used: used in other items
  • etc.

I thought everything through and looked at each type of search option separately and concluded most properties weren’t a problem with the queries in our database MySQL. There were some that did pose some problems, but I found solutions for this in advance:

  • Filter by a region: There are only 8 regions. When an item is added or edited we can just calculate to which regions it belongs and save that too.
  • Filter by a specific location: Not filter by radius but filter by a box. We now have minimum and maximum longitudes and latitudes to filter by.
  • Filter by text: Easy enough: just use a lot of LIKE ‘%word%’ and that’s it. When this would prove to be a performance issue we could always create a shadow table in MyISAM with a fulltext index and that’s it.

With a potential number of 20 tables involved with this query this wasn’t going to be an easy task and performance was going to be an issue. But I’ve read the book High Performance MySQL and a lot of articles on the MySQL Perfomance Blog. So I know the explain statement, I know MySQL can’t use more than one index from a table in a specific query, so I have to choose my indexes carefully and I know a large UNION statement performs better than a simple OR. I know everything I know to get the job done. What could go wrong?

The answer to this question is as simple as it is frustrating:
A lot!

The problems began when we started some test imports of images from external sources into our development environment. These were literally tens of thousands of images. Suddenly the site didn’t perform very well on our development environment: Our search query was taking 2 minutes to run when we hadn’t even used all the filtering options. No problem: we would locate the bottlenecks and fix them! We did and performance improved… a bit. We did it again and performance improved… a bit. So now we were really digging into the endless possibilities of queries and we were trying to find out which indexes were wrong and how we could tempt MySQL into using the right indexes. Until we ran into a bit of a problem: dependent subqueries, which really didn’t really seem that dependent at all… A bit of googling taught me this is a known problem in MySQL and there are workarounds. These workarounds involve writing stored procedures or clientside code to collect the data from the subquery in a separate query and feed the results to the main query. Because our subqueries can potentially produce thousands of results and they need to be combined with other subqueries, I didn’t think these workarounds were gonna give the performance benefits we were aiming for. It was time to sound the alarms…

Solr

It didn’t take very long to find the tool we needed. I attended a workshop about integrating site search at the dutch phpconference in 2010 where the speaker tried to – but couldn’t really – hide the fact that Solr was his favorite tool to use for site searching. With only a few weeks until the deadline we didn’t really have time to obtain knowledge about using a to us completely new tool like Solr. Luckily Bas de Nooijer was able to lend us a helping hand and we got Solr up and running and integrated in our site in just a few days! The performance benefits were amazing, though we did have some issues adding multiple locations and date ranges to the items in Solr. Without getting to much into the technical details I’ll try to explain what happened.

Multiple locations and areas

The newest beta versions of Solr do have support for multiple locations, but we didn’t want to take the risk of using a beta version. On top of that we still had the problem of multiple dates and dateranges. The problem of filtering on multiple locations and areas was solved by storing location and area ids in Solr so we could make a query in MySQL of which the resulting ids would be passed to the Solr query. If we would have a lot of locations this would still pose a problem but fortunately this was not the case.

Multiple dates and dateranges

Bas and I found a, what I think, clever solution for this part of the problem. We first thought we could store all days as a number within an item. Because we had items with ranges from 3000 B.C. until 1000 A.D. it turned out that an item like this really had to much days in it to be able to store it in any convenient way. This was solved by taking the full years out of these date ranges and store them in a separate field. The number of days was now small enough to be stored and we could filter by a custom date range by splitting it up into years and days and find the related items with a simple OR in the Solr query.

Thanks to Solr and Bas de Nooijer we could launch the site as scheduled with a very nice and fast search feature. Check it out at Oneindig Noord-Holland (site is in dutch).

Written by Onno Marsman
Tags: , ,

Respond to this post