Saturday, November 26, 2005

Baidu is better than Google in Chinese Search

2005 Quality Comparison of Worldwide Chinese Search Engines

According to industry estimate, the worldwide search market will grow rapidly at 35% annual rate for the next four years. The total scale of the market will reach $7 billion in 2007. In the next three years, Chinese search market will grow even faster at 60 -70% a year. In 2004, Chinese search market reached 880 million yuan and will reach 2.4 billion yuan in 2006. Currently, the search engines have leaped from simple IT technologis into a search economy. It could become the next profit generator after the value-added wireless services and the online gaming services in the internet sector. For that reason, the quality of search engines have also reached historical height.

In September 2005, the IT Usability Lab in Qinghua University(the top university in China) had finished another comparison research on overall quality of Chinese search engines. They have included following leading search engines in the study:

Google, 一搜(Yisou, from Yahoo.China), 百度(Baidu), 中搜(Zhongsou), 爱问 (iAsk, from Sina), 搜狗(Sogou, from Sohu).

Qulity of a search engine is determined by the quality of the resulted web pages. To find a related web pages is the foundation of a search engine. The quality standards are based on relevancy, coverage, dead-link rate, fraud rate, and Chinese phrasing.

Relevancy

Relevancy of a search engine is the relationship between a user's seach intention and the content of a resulted web page. It is the capability of a search engine to find what a user wants to find. It is mostly connected with the effiency of a search engine and the magnitude of satisfaction of users. A good relevant search engine can save the most precious resources of user time and network usage and thus the most critical measurement of a search engine.

This test used the real search word samples from the log data bank and followed external procedures. The test results then analysed using multiple parameters.

Fig 1. Relevancy data

Result: the data shows, for commonly used keywords, there is no signifcant difference between the six search engines. Google, Baidu, and Zhongsu are in the lead. The time it took for search were all very short also.

Web Page Coverage

Web coverage also called indexed volume which indicate the number of web pages indexed by each search engine. The individual test result is a relative value comparing with all search engines as well as published internet data. To reduce the effect of repetitive results, the calculation used the algorithm to exclude repetitive data both in single calculation and final summary calculations.

Fig 2. Web page coverage(Yellow line is total number, blue line is
number for still pages, and purple line is from active pages)

Result: Google and Baidu led in active pages. In overal number, Google, Baidu, and Zhongsu were better than others.

Dead-link Rate

Dead links are pointing to web pages that no longer exist or unable to reach. It is affected by time, region, as well as network state.

Fig 3. Average dead-link rat(average of three dead-links)

Result: the dead-link rate test is affected by network and server state but less by sample selection. Yisou, Baidu, and Sogou were doing better while iAsk is lagging.

Fraud Rate

Search frauds are from those who use automated or artificial means to inflate their web page rankings form search engine results. Sample method is similar to relevancy test.

Fig4. Fraud rate( blue line is from numbers of front page, purple
line is from numbers of the first three pages)

Results: Zhongsu has the lowest fraud rate followed by Baidu and iAsk. In this category, the lead was significant.

Repetitive Rate

Repetitiveness is an important factor that lowers the quality of resulted web pages. It does not only hurt the user experience but also consume system resources as well as search efficiency. This test was done only for the top three players, Google, Baidu, and Zhongsu. The analysis was done for the first five pages on repetitive rate. Sampling was also similar to relevancy test. Totally 160 samples collected with each page of 10 results. Paid subscribs were excluded.

Fig.5 Repetitive rate of first five pages

Result: Baidu has the lowest repetitive rate. Among all the repetitive data, most are re-posts. Among all the re-posts, news re-posts were dominant.

Chinese Phrasing

Chinese phrasing indicate the process that computer divides the Chinese charaters in a sentense into appropriate phrases. Multiple factors are used in this analysis.

Fig. 6 Chines phrasing compounded analysis

Explanation: Acceptability indicate overall and partial correctness.

Result: Baidu is leading in overall Chinese phrasing followed by Zhongsu and Google.

In the individul tests, names both Chinese and foreign are the most difficult part for search engines. Zhongsu, Baidu and iAsk were doing better. In Chinese local names, iAsk and Baidu score the highest. Overall, domestic search engines have clear advantages in Chinese phrasing.

Summary Tables


Table 1. Overall coverage


Table 2. Overall quality analysis

Table 3. Overall qulity turnover rate

Conclusion: Domestic search engines have advanced significantly. The have clearly improved search qualities especially from Baidu and Zhongsu. In many categories they have surpassed Google. Baidu has passed Google to lead in the overall analysis. Domestic search engines are doing better in areas of relevancy, coverage, anti-fraud, and Chinese phrasing. The lead in Chinese pharsing is significant.


Translated by Huatong (Chinese link: tech.china.com)

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home