Thinking Sphinx in Arabic/Unicode
While using Thinking Sphinx in one of my Rails projects, I needed to search Arabic content. Since Sphinx supports Unicode, I thought that would be easy. But it was not due to the lack of documentation of Unicode support through Thinking Sphinx. So here is what to do to support Arabic (Unicode) search.
After reading a little in Sphinx documentation, I knew that to support non-English languages I had to create a charset_table for Sphinx to use while indexing my data. After some research, I found a nice charset table for several languages. So, I went to the configuration file created by Thinking Sphinx (app/config/development.sphinx.conf) and added an English/Arabic charset_table. I stopped, reindexed and restarted searchd. Then, tried to search Arabic with no luck! I noticed that my new configuration, including charset_table, was gone! Why? Thinking Sphinx regenerates the configuration file before reindexing!
After a lot of research, I discovered that to add your custom configuration, you must create the file app/config/sphinx.yml which Thinking Sphinx will use to override its default configuration. Hey, why didn't any one tell me that?!
After 2 hours of YAML syntax errors, I did it. Here is my sphinx.yml:
development: &my_settings
enable_star: true
min_prefix_len: 0
min_infix_len: 1
min_word_len: 1
charset_table: "0..9, a..z, _, A..Z->a..z, U+621..U+63a, U+640..U+64a, U+66e..U+66f, U+671..U+6d3, U+6d5, U+6e5..U+6e6, U+6ee..U+6ef, U+6fa..U+6fc, U+6ff"
test:
<<: *my_settings
production:
<<: *my_settings
Other Settings
- min_word_len: 1
Setting the minimum indexed word length to 1 means index everything. - min_prefix_len: 0
Setting the minimum word prefix length to index to 0 disables prefix indexing. If set to a positive number, indexer would index all the possible keyword prefixes (ie. word beginnings) in addition to the keywords themselves. - min_infix_len: 1
Setting the minimum infix length to index to 1 asks the indexer to index all the possible keyword infixes (ie. substrings) in addition to the keywords themselves. This allows wildcard searching by 'start*', '*end', and '*middle*' wildcards. However, indexing infixes will make the index grow significantly (because of many more indexed keywords), and will degrade both indexing and searching times. Note that you can't enable both prefix and infex indexing at the same time; that's why I disabled prefix indexing. - enable_star: true
This enables "star-syntax", or wildcard syntax, when searching through indexes which were created with prefix or infix indexing enabled. It only affects searching; so it can be changed without reindexing by simply restarting searchd.
Now, stop, reindex and restart searchd:
rake thinking_sphinx:stop
rake thinking_sphinx:index
rake thinking_sphinx:start
Finally, for the wildcard search to work, your controller should look something like this:
class PostsController < BaseController
def search
@posts = Post.search "*#{params[:search_query]}*"
end
end
You should be enjoying Arabic search now.
Did you like this article? Bookmark it:
Related Articles
- Installing Mephisto 0.8.2 on Rails 2.2.2
- Upgrading InstantRails to Rails 2.2 on Windows
- Running Aptana/Eclipse and InstantRails from a USB Drive
- Thinking Sphinx in Arabic/Unicode
- Thinking Sphinx on Windows








Jerome
March 8th, 2009 - 12:42 PM
Thank you !!!!! Works great with hebrew too using U+5D0..U+5EA, U+5F0..U+5F2, U+5BE, U+5C0, U+5C3, U+5F3, U+5F4
Faisal
April 22nd, 2009 - 08:54 PM
Thanx a lot... This is a great guide that gathers all important info in one place. Wish the sphinx official documentation was as concise and elaborate as your post here. The only thing I might add, is that you need to make sure that your database is UTF8. Even though you might create your tables with charset utf8, you need to make sure that the data is being stored as utf8 by ensuring that your connection is set with "SET names utf8". If your data is not in utf8, nothing will be indexed and you'll get zero results for your search. A great guide of how to convert your old rails DB into utf8 can be found here: http://tumblelog.jauderho.com/post/27806549/converting-your-rails-app-to-utf8
Hatem
April 23rd, 2009 - 09:21 PM
Glad to hear that, @Jerome. Thanks @Faisal for the nice comment and for the info.