se_01

History

About twenty plus years ago, a colleague David and I, for a project at Auckland University of Technology, created an internet indexing website. Our thoughts were back in those days, the internet was like a book. Each page is a website. The problem was that there was no index. So our project was to create one which we did. This was in a time when the internet was mainly available educational institutes like universities. AltaVista didn’t exist yet.

The machine was a 486DX. The operating system was Slackware Linux (pre version one). The web service was pre-Apache.

Linux at the time did not have a database engine stable enough to use at the time so I wrote a database engine. The engine had a very simple SQL language set. This was programmed in C++. The logic for the website was also programmed in C++ . The code was complied with GCC.

Fast-forward to 2017. We now have Search Engines which have been here for a while and have matured. So we have use Google as an benchmark for a modern Search Engine.

se_02

Method

This time I have used a Raspberry Pi 3 as the machine (attached 500G USB hard drive for the space needed). For the operating system I’ve chosen Raspbian Jessie (Debian Linux). The Debian installation will be installed in a headless configuration (no graphical interface). Database engines have improved over time, so I’ve used MySQL Server (5.5.57). For the programming, I’ve chosen PHP, a scripting language. I’ll be editing PHP in Notepad ++. To communicate with the Raspberry Pi I used a SSH client to help configure and edit.
I’ve created one script that runs in the background, crawling the internet for websites. For this project I’ll only grab the main page and not the rest of the pages on each website. I’ll record two meta tags if they exist, description and keywords. The other information recorded is, the website URL, and index page. I'll automatically generate an index plus a date stamp.

The search engine where you type in the query, will need some nice logic to grab the most relevant results. I created a SQL function that evaluates your keywords against the database; this is placed in a stored procedure that returns the results in descending order of relevance.

The user interface of our search engine needs to be very simple. If you examine Google’s Search engine page you’ll notice it’s straight to the point, not cluttered with things of no relevance.

se_03

Conclusion

With the changes over the years, a small computer that can fit in your hand, can now be used as a search engine. Using PHP over C++ makes development quicker. PHP has a lot of useful built in functions with the bonus of having a similar syntax as C++. The built in functions of PHP also reduce the size of your code. Another bonus is that that you will be using a lot of th built in functions.

Database engines are powerful and very fast. With the addition of using MySQL Workbench a graphical front end, allowed database development to be very quick. The project only used four tables, four stored procedures and one function.

In conclusion, such a project is a simpler undertaking due to the lesser amount of time need due to the trouble shooting /debugging and the amount of documentation on the internet to help with find functions for PHP and SQL methods.

se_04

Future additions

  - The addition of spell checking, should allow more accurate results where a spelling error might occur.

- Enhance the algorithm for results.

- Split up the web URL crawler and allow parts to run in parallel, thus speeding up the process (populate search tables quicker).

- Tagging stored URLs into catagories i.e. spyware, politics and news to list a few.