I current what I belive is a novel index for indexing and looking out supply code. It copies concepts from Bing bitfunnel implementation to create a really quick, reminiscence environment friendly trigram index over supply code.
- searchcode.com is now utilizing a customized constructed index written by yours actually
- It indexes 180-200 million paperwork and 75 billions strains of code
- The index works utilizing bloom filters sharded by distinctive doc trigrams
- It borrows among the core concepts of bitfunnel utilized in microsoft’s bing
- Using trigrams inside a bloom filter search is so far as I can inform distinctive
- It lowered the index search instances from many seconds to ~40ms throughout searchcode
- Finish consumer searches nonetheless take round 300 ms to course of although
- It additionally improved search relevance and reliability
- Structure smart issues grew to become easier, as the brand new index sits on a single machine not 4
- It makes use of caddy because the reverse proxy and redis as a degree 2 cache
- The index sits solely in reminiscence on a 16 core 5950x CPU and 128 GB RAM machine
- It at present processes near one million searches on daily basis
- I additionally gave the location a brand new feel and appear which is a lot better than what it seemed like beforehand
- I had an absolute blast studying and constructing it
On Monday twenty first November 2022 I up to date the DNS entries for searchcode to level on the new searchcode server working a customized applied index, ending the reliance on sphinx/manticore with a purpose to energy searches. Since being launched I’ve noticed it run over 1 million searches a day with a rolling common runtime of ~40 ms for substitute index.
Someday through the 2020 lock downs once I commented on the corporate slack that I ought to construct my very own customized index for searchcode. A couple of responded encouragingly that if anybody can do it it was me. Whereas I appreciated their confidence I didn’t do something till mountaineering with a mate someday in 2021 the place after a couple of beers on the finish of the day I brimming with confidence laid out my plan of assault.
I had felt like a fraud for some time. I get plenty of questions on indexing code and my reply has all the time been that I take advantage of sphinx search. Not too long ago that has modified, as I moved over to its forked model manticore search. Manticore is a wonderful successor to sphinx and I actually do advocate it.
Nevertheless, utilizing the above is me outsourcing the core performance of searchcode to a 3rd social gathering, and I strongly consider you shouldn’t outsource your core competency. So I assumed I ought to actually strive to do that myself. Apart from, I knew I might actually take pleasure in this course of and so a couple of days later I began work.
The results of my effort may be considered at searchcode.com and appears just like the beneath.
A bit of over a yr’s value of effort and I can now speak/write about it. I take into account it the top of my growth profession to this point, and I’m very happy with my achievement. Be happy to learn additional if you need to learn the way I did it, what I used to be considering and so on… What follows is a set of my growth ideas and concepts.
What follows are my ideas I stored as I applied issues. Included so I’ve one thing to look again on, and maybe somebody will discover it helpful. It could be a bit disjointed as plenty of it was written as I used to be constructing issues.
The Downside with Indexing Supply Code
So why even construct your personal index? Why would anybody take into account doing this when so many tasks can do that for you. Off the highest of my head now we have code accessible for lucene, sphinx, solr, elasticsearch, manticore, xaipan, gigablast, bleve, bludge, mg4j, mnoGoSearch… you get the concept. There are plenty of them on the market. Most of them nonetheless are targeted on textual content and never supply code, and the distinction in terms of search is bigger than you would possibly anticipate.
It’s additionally value noting that the actually massive scale engines corresponding to Bing and Google have their very own indexing engines. Partly as a result of it provides you the power to wring extra efficiency out of the system since you already know the place you may reduce corners or save house, assuming you may have the talent or information.
I suppose the saying “the factor about reinventing the wheel is you will get a spherical one” applies right here. By implementing my very own answer I may guarantee I get one thing that works inside the bounds of what I would like.