Posts tagged ‘Yandex’


Testing the search engines: Bing likes antiquity; most favour HTML over PHP

21.09.2022

Bing is spidering new pages, as long as they’re very, very old.

Last week, we added a handful of Lucire pages from 1998 and 1999. An explanation is given here. And I’ve spotted at least two of those among Bing’s results when I do a site:lucire.com search.

As a couple of newer pages have also shown up, I doubt there’s any issue with the template; and the home page now also appears, too. But, by and large, Bing is Microsoft’s own Wayback Machine, and most of the Lucire results are from the 1990s and early 2000s.

It got me thinking: do the other search engines do this, too? For years, Google grandfathered older pages and they came up earlier. (Meanwhile, searches for my own name still have this site, and the company site, down, having lost first and second when we switched from HTTP to HTTPS in March. Contrary to expert opinion, you don’t recover, at least not quickly.)

As Lucire includes the date of the article in the URL, this should be an easy investigation. We’ll only do the first 50 results as that’s all Bing’s capable of. I’ll try not to include any repeat results out of fairness. β€˜Contents’ pages’ include the home page, the Lucire TV and Lucire print shopping pages, and tag and category pages.
 
Bing
Contents’ pages β˜…β˜…β˜…
1997
1998
1999 β˜…β˜…β˜…β˜…
2000 β˜…
2001 β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…
2002 β˜…β˜…
2003 β˜…β˜…β˜…
2004 β˜…β˜…β˜…β˜…
2005 β˜…β˜…
2006
2007 β˜…β˜…β˜…
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018 β˜…
2019 β˜…
2020
2021
2022
 
Google
Contents’ pages β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…
1997
1998
1999
2000
2001
2002 β˜…β˜…
2003
2004 β˜…β˜…
2005
2006
2007 β˜…
2008
2009
2010 β˜…
2011 β˜…β˜…β˜…
2012 β˜…
2013 β˜…β˜…
2014 β˜…β˜…β˜…
2015 β˜…
2016 β˜…β˜…
2017 β˜…
2018 β˜…β˜…β˜…
2019 β˜…β˜…β˜…
2020 β˜…β˜…β˜…β˜…β˜…β˜…β˜…
2021 β˜…
2022 β˜…β˜…β˜…β˜…
 
Mojeek
Contents’ pages β˜…β˜…β˜…β˜…β˜…β˜…
1997
1998
1999
2000
2001
2002
2003
2004 β˜…
2005
2006
2007
2008
2009 β˜…
2010 β˜…β˜…
2011 β˜…β˜…
2012 β˜…β˜…β˜…
2013 β˜…β˜…β˜…β˜…
2014 β˜…β˜…β˜…
2015 β˜…β˜…β˜…β˜…β˜…
2016 β˜…β˜…β˜…β˜…β˜…β˜…β˜…
2017 β˜…β˜…β˜…β˜…β˜…β˜…
2018 β˜…β˜…β˜…
2019 β˜…β˜…β˜…β˜…
2020 β˜…β˜…β˜…
2021
2022
 
Baidu
Contents’ pages β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018 β˜…
2019 β˜…
2020
2021 β˜…β˜…β˜…
2022 β˜…
 
Yandex
Contents’ pages β˜…β˜…β˜…β˜…β˜…
1997
1998
1999 β˜…β˜…β˜…β˜…β˜…
2000 β˜…β˜…β˜…β˜…β˜…β˜…
2001 β˜…β˜…β˜…
2002 β˜…β˜…β˜…
2003 β˜…β˜…β˜…
2004 β˜…
2005
2006
2007 β˜…β˜…β˜…β˜…
2008 β˜…β˜…
2009 β˜…β˜…
2010 β˜…β˜…β˜…β˜…
2011 β˜…β˜…β˜…
2012 β˜…β˜…
2013 β˜…
2014 β˜…β˜…
2015
2016
2017
2018
2019
2020 β˜…β˜…β˜…
2021 β˜…
2022
 

To me, that was fascinating. My instincts weren’t wrong with Bing: it’s old and it favours the old (two of the restored articles were indexed). From the first 50 results, 18 results were repeatsβ€”that’s 36 per cent. I’m of the mind that Bing is so shot that it can only index old pages that don’t take up much space. New ones have a lot more data to them, generally.

Google does a good job with the top-level and second-level contents’ pages, though there were a few strange tag indices. But the distribution is what you’d expect: people would search for more recent stories. I know we had some popular stories from 2002 that still get hit a lot.

Mojeek has a similar distribution, though it should be noted that you can’t do a blanket site: search. There must be a keyword, and in this case it’s Lucire. The 2016 pages form the mode, which I don’t have a huge problem with; it’s better than the 2001 pages, which Bing has over everything else.

Baidu’s one is crazy as individual stories are seldom spat out in the first five pages, the search engine preferring tag indices, though half a dozen later story pages do make it into its top 50.

Finally, Yandex leans toward older pages, too, including our most popular 2002 piece. It’s the 2000 stories it has the most of among the top 50, and there’s a strange empty period between 2015 and 2019. But at least there is a fairer distribution than Bing can muster.

The other query that I had was whether these search engines were biasing their results toward HTML pages, rather than PHP ones. If that’s the case, then it could explain Bing’s preference for the old stuff (Lucire didn’t have PHP pages till 2008; prior to that it was all laboriously hand-coded, albeit within templates.)
 
Bing
β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜… HTML
β˜… PHP
 
Google
β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜… HTML
β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜… PHP
 
Mojeek
β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜… HTML
β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜… PHP
 
Baidu
β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜… HTML
β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜… PHP
 
Yandex
β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜… HTML
β˜…β˜…β˜…β˜…β˜…β˜… PHP
 

I think we can safely say there’s a preference for HTML over PHP. Mojeek brings up a lot of HTML pages after the top 50, even though this sample shows the split isn’t as severe.

Our PHP pages are less significant though: they contain news stories, and these are often ones other media covered, too. But I would have thought some of the more popular stories would have made the cut, and here it’s Mojeek’s distribution that looks superior to the others’. It seems like it’s actually analysing the page content’s text, which is what you want a search engine to do.

Baidu’s PHP-heaviness is down to all the tag indicesβ€”rendering it not particularly helpful as a search engine.

On these two tests, Mojeek and Google rank best, and Yandex comes in third. Baidu and Bing are a distant fourth and fifth.

Tags: , , , , , , , , , , ,
Posted in China, culture, internet, media, publishing, technology, UK, USA | No Comments »


Testing the seven search engines in the world

22.08.2022

After reading Mojeek’s blog post from last July, I learned there are only seven search engines in the world now. In other words, I was checking more search engines out in the 1990s. It’s rather depressing, especially as the search market is largely a monopoly with Google dominating it (and all the ills that brings), and Bing and its licensees (like Duck Duck Go) with their 6 per cent.

Knowing there are seven, I fed the site:lucire.com search into all of them to see where each stood.

The first figure is the claimed number of results, the second the actual number shown (without repeats removed, which Bing is guilty of).

I can’t use Brave here as its site search is Bing as well.

Yandex appears to be capped at 250 and Mojeek at 1,000, but at least they aren’t arbitrary like Google and Baidu. Baidu has a lot of category and tag pages from the Wordpress section of our site to bump up the numbers.
 
Gigablast 0/0
Sogou 19/13
Bing 243/50
Baidu 13,700/213
Yandex 2,000/250
Google 6,280/315
Mojeek 3,654/1,000
 

Frankly, more of us should go to Mojeek. It can only get better with a wider user base. Unlike Bing, it hasn’t collapsed. I know most of you will keep going to Google, but I just don’t like the look of those limits (not to mention the massive privacy issues).

Mojeek is now at 5,900 million pages, which must be the largest index in the west outside of Google.

Tags: , , , , , , , , , , , , , , , , ,
Posted in China, internet, publishing, technology, UK, USA | No Comments »


Putting the search engines through their paces

24.07.2022

One more, and I might give the subject a rest. Here I test the search engines for the term Lucire. This paints quite a different picture.

Lucire is an established site, dating from 1997, indexed by all major search engines from the start. The word did not exist online till the site began. It does exist in old Romanian. There is a (not oft-used) Spanish conjugated verb, I believe, spelt the same.

The original site is very well linked online, as you might expect after 25 years. You would normally expect, given its age and the inbound links, to see lucire.com at the top of any index.

There is a Dr Yolande Lucire in Australia whom I know, who I’m used to seeing in the search engine results.

The scores are simply for getting relevant sites to us into the top 10, and no judgement is made about their quality or relevance.
 
Google
lucire.com
twitter.com
lucire.net
instagram.com
wikipedia.org
linkedin.com
facebook.com
pinterest.nz
neighbourly.co.nz
β€”I hate to say it, as someone who dislikes Google, but all of the top 10 results are relevant. Fair play. Then again, with the milliards it has, and with this as its original product, it should do well. 10/10
 
Mojeek
scopalto.com
lucirerouge.com
lucire.net
lucire.com
mujerhoy.com
portalfeminino.com
paperblog.com
dailymotion.com
eldiablovistedezara.net
hispanaglobal.com
β€”Mojeek might be flavour of the month for me, but these results are disappointing. Scopalto retails Lucire in France, so that’s fair enough, but disappointing to see the original lucire.com site in fourth. Fifth, sixth, seventh, ninth and tenth are irrelevant and relate to the Spanish word lucir. You’d have to get to no. 25 to see Lucire again, for Yola’s website. Then it’s more lucir results till no. 52, the personal website of one of our editors. 5/10
 
Swisscows
lucire.net
wikipedia.org
lucire.com
spanishdict.com
lucire.net
lucire.com
drlucire.com
facebook.com
spanishdict.com
viyeshierelucre.com
β€”Considering it sources from Bing, it makes the same mistakes by placing the rarely linked lucire.net up top, and lucire.com in third. Fourth, ninth and tenth are irrelevant, and the last two relate to different words. Yola’s site is seventh, which is fair enough. 6/10
 
Baidu
lucire.net
lucire.com
lucire.cc
lucire.com
kanguowai.com
hhlink.com
vocapp.com
forvo.com
kuwo.cn
lucirehome.com
β€”Interesting mixture here. Strange, too, that lucire.net comes up top. We own lucire.cc but it’s now a forwarding domain (it was once our link shortener, up to a decade ago). Seventh and ninth relate to the Romanian word strΔƒlucire and eighth to the Romanian word lucire. The tenth domain is an old one, succeeded a couple of years ago by lucirerouge.com. Not very current, then. 7/10
 
Startpage
lucire.com
lucire.com
lucire.net
instagram.com
wikipedia.org
linkedin.com
facebook.com
pinterest.nz
fashionmodeldirectory.com
twitter.com
β€”All relevant, as expected, since it’s all sourced from Google. 10/10
 
Virtual Mirage
lucire.com
instagram.com
wikipedia.org
lucire.net
facebook.com
linkedin.com
pinterest.nz
lucirerouge.com
nih.gov
twitter.com
β€”I don’t know much about this search engine, since I only heard about it from Holly Jahangiri earlier today. A very good effort, with only the ninth one being irrelevant to us: it’s a paper co-written by Yola. 9/10
 
Yandex
lucire.com
lucire.net
facebook.com
twitter.com
wikipedia.org
instagram.com
wikipedia.eu
pinterest.nz
en-academic.com
wikiru.wiki
β€”This is the Russian version. All are relevant, and they are fairly expected, other than the ninth result which I’ve not come across this high before, although it still relates to Lucire. 10/10
 
Bing
lucire.net
wikipedia.org
lucire.com
spanishdict.com
lucire.com
facebook.com
drlucire.com
spanishdict.com
twitter.com
lucirahealth.com
β€”How Bing has slipped. There are sites here relating to the Spanish word lucirse and to Lucira, who makes PCR tests for COVID-19. One is for Yola. 7/10
 
Qwant.com
lucire.net
wikipedia.org
spanishdict.com
drlucire.com
spanishdict.com
tumblr.com
lucirahealth.com
lacire.co
amazon.com
lucirahealth.com
β€”For a Bing-licensed site, this is even worse. No surprise to see lucire.com gone here, given how inconsistently Bing has treated it of late. But there are results here for Lucira and a company called La Cire. The Amazon link is also for Lucira. 3/10
 
Qwant.fr
lucire.net
wikipedia.org
reverso.net
luciremen.com
lucire.com
twitter.com
lacire.co
lucirahealth.com
viyeshierelucre.com
lucirahealth.com
β€”The sites change slightly if you use the search box at qwant.fr. The Reverso page is for the Spanish word lucirΓ©. Sixth through tenth are irrelevant and do not even relate to the search term. Eleventh and twelfth are for lucire.com and facebook.com, so there were more relevant pages to come. The ranking or relevant results, then, leaves something to be desired. 5/10
 
Duck Duck Go
lucire.com
lucire.net
wikipedia.org
spanishdict.com
drlucire.com
spanishdict.com
lucirahealth.com
amazon.com
lacire.co
luciremen.com
β€”Well, at least the Duck puts lucire.com up top, and the home page at that (even if Bing can’t). Only four relevant results, with Lucire Men coming in at tenth. 4/10
 
Brave
lucire.com
instagram.com
twitter.com
wikipedia.org
linkedin.com
lucire.net
facebook.com
fashion.net
wiktionary.org
nsw.gov.au
β€”For the new entrant, not a bad start. Shame about the smaller index size. All of these relate to us except the last two, one a dictionary and the other referring to Yolande Lucire. 8/10
 

The results are surprising from these first results’ pages.
 
β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜… Google
β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜… Yandex
β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜… Startpage
β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜† Virtual Mirage
β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜†β˜† Brave
β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜†β˜†β˜† Baidu
β˜…β˜…β˜…β˜…β˜…β˜…β˜…β˜†β˜†β˜† Bing
β˜…β˜…β˜…β˜…β˜…β˜…β˜†β˜†β˜†β˜† Swisscows
β˜…β˜…β˜…β˜…β˜…β˜†β˜†β˜†β˜†β˜† Mojeek
β˜…β˜…β˜…β˜…β˜…β˜†β˜†β˜†β˜†β˜† Qwant.fr
β˜…β˜…β˜…β˜…β˜†β˜†β˜†β˜†β˜†β˜† Duck Duck Go
β˜…β˜…β˜…β˜†β˜†β˜†β˜†β˜†β˜†β˜† Qwant.com
 

It doesn’t change my mind about the suitability of Mojeek for internal searches though. It’s still the one with the largest index aside from Google, and it doesn’t track you.

Tags: , , , , , , , , , , , , , , , , , , , ,
Posted in China, France, internet, publishing, technology, UK, USA | 2 Comments »