Using Raw SQL snippets to create Thinking-Sphinx Indexes

We have used Thinking-Sphinx in several projects to implement search engines. Thinking-Sphinx is a wrapper in Ruby to use ActiveRecord with Sphinx. But in the project I’m working right now I had to solve a different problem: the user should be able to search for formatted values, however without using the formatting characters, i.e, suppose a field containing 111.444.777-35, the user should be able to find it searching for 11144477735 (or a substring of it).

The default behaviour for Thinking-Sphinx is to index the exact values for each column it scans, but we can by-pass it. The Thinking-Sphinx index definition allows us to use raw SQL to define which values should be indexed for each attribute. In my case, I used the native PostgreSQL regular expressions functions to get rid of the punctuation characters, indexing only the “clean” version of each value. My database column is called “entity_number”:

define_index do 
  indexes "regexp_replace(entity_number, E'[[:punct:]]', '', 'g')", :as => :raw_entity_number
  group_by "raw_entity_number"
end

However, there are two gotchas:

  • The regular expression metacharacters have to be escaped, so Sphinx can build the right SQL query. If you will or will not need to do the same, will depend on what you’re using inside your SQL snippet.
  • Sphinx has to put each indexed column in a GROUP BY clause. By default, fields created like I did above are not automatically included in this clause, so your indexing will fail. You’ll need to explicitly include your new field to the GROUP BY clause using “group_by :raw_entity_number”. The new field name will be appended to the query, without messing up the values already present in it.