AI-Driven Search

I just dusted off the login here to realize I hadn't posted in about a half-year & figured it was time to write another one. ;)

Yandex Source Code Leak

Some of Yandex's old source code was leaked, and few cared about the ranking factors shared in the leak.

Mike King made a series of Tweets on the leak.

The signals used for ranking included things like link age

and user click data including visit frequency and dwell time

Google came from behind and was eating Yandex's lunch in search in Russia, particularly by leveraging search default bundling in Android. The Russian antitrust regulator nixed that and when that was nixed, Yandex regained strength. Of course the war in Ukraine has made everything crazy in terms of geopolitics. That's one reason almost nobody cared about the Yandex data link. And the other reason is few could probably make sense of understanding what all the signals are or how to influence them.

The complexity of search - when it is a big black box which has big swings 3 or 4 times a year - shifts any successful long term online publishers away from being overly focused on information retrieval and ranking algorithms to focus on the other aspects of publishing which will hopefully paper over SEO issues. Signs of a successful & sustainable website include:

  • It remains operational even if a major traffic source goes away.
  • People actively seek it out.
  • If a major traffic source cuts its distribution people notice & expend more effort to seek it out.

As black box as search is today, it is only going to get worse in the coming years.

ChatGPT Hype

The hype surrounding ChatGPT is hard to miss. Fastest growing user base. Bing integration. A sitting judge using the software to help write documents for the court. And, of course, the get-rich-quick crew is out in full force.

Some enterprising people with specific professional licenses may be able to mint money for a window of time

but for most people the way to make money with AI will be doing something that AI can not replicate.

Bing Integration of Open AI Technology

The New Bing integrated OpenAI's ChatGPT technology to allow chat-based search sessions which ingest web content and use it to create something new, giving users direct answers and allowing re-probing for refinements. Microsoft stated the AI features also improved their core rankings outside of the chat model: "Applying AI to core search algorithm. We’ve also applied the AI model to our core Bing search ranking engine, which led to the largest jump in relevance in two decades. With this AI model, even basic search queries are more accurate and more relevant."

Fawning Coverage

Some of the tech analysis around the AI algorithms is more than a bit absurd. Consider this passage:

the information users input into the system serves as a way to improve the product. Each query serves as a form of feedback. For instance, each ChatGPT answer includes thumbs up and thumbs down buttons. A popup window prompts users to write down the “ideal answer,” helping the software learn from its mistakes.

A long time ago the Google Toolbar had a smiley face and a frown face on it. The signal there was basically pure spam. At one point Matt Cutts mentioned Google would look at things that got a lot of upvotes to see how else they were spamming. Direct Hit was also spammed into oblivion many years before that.

In some ways the current AI search stuff is trying to re-create Ask Jeeves, but Ask had already lost to Google long ago. The other thing AI search is similar to is voice assistant search. Maybe the voice assistant search stuff which has largely failed will get a new wave of innovation, but the current AI search stuff is simply a text interface of the voice search stuff with a rewrite of the content.

High Confidence, But Often Wrong

There are two other big issues with correcting an oracle.

  • You'll lose your trust in an oracle when you repeatedly have to correct it.
  • If you know the oracle is awful in your narrow niche of expertise you probably won't trust it on important issues elsewhere.

Beyond those issues there is the concept of blame or fault. When a search engine returns a menu of options if you pick something that doesn't work you'll probably blame yourself. Whereas if there is only a single answer you'll lay blame on the oracle. In the answer set you'll get a mix of great answers, spam, advocacy, confirmation bias, politically correct censorship, & a backward looking consensus...but you'll get only a single answer at a time & have to know enough background & have enough topical expertise to try to categorize it & understand the parts that were left out.

We are making it easier and cheaper to use software to re-represent existing works, at the same time we are attaching onerous legal liabilities to building something new.

Creating A Fuzy JPEG

This New Yorker article did a good job explaining the concept of lossy compression:

"The fact that Xerox photocopiers use a lossy compression format instead of a lossless one isn’t, in itself, a problem. The problem is that the photocopiers were degrading the image in a subtle way, in which the compression artifacts weren’t immediately recognizable. If the photocopier simply produced blurry printouts, everyone would know that they weren’t accurate reproductions of the originals. What led to problems was the fact that the photocopier was producing numbers that were readable but incorrect; it made the copies seem accurate when they weren’t. ... If you ask GPT-3 (the large-language model that ChatGPT was built from) to add or subtract a pair of numbers, it almost always responds with the correct answer when the numbers have only two digits. But its accuracy worsens significantly with larger numbers, falling to ten per cent when the numbers have five digits. Most of the correct answers that GPT-3 gives are not found on the Web—there aren’t many Web pages that contain the text “245 + 821,” for example—so it’s not engaged in simple memorization. But, despite ingesting a vast amount of information, it hasn’t been able to derive the principles of arithmetic, either. A close examination of GPT-3’s incorrect answers suggests that it doesn’t carry the “1” when performing arithmetic."

Exciting New Content Farms

Ted Chiang then goes on to explain the punchline ... we are hyping up eHow 2.0:

Even if it is possible to restrict large language models from engaging in fabrication, should we use them to generate Web content? This would make sense only if our goal is to repackage information that’s already available on the Web. Some companies exist to do just that—we usually call them content mills. Perhaps the blurriness of large language models will be useful to them, as a way of avoiding copyright infringement. Generally speaking, though, I’d say that anything that’s good for content mills is not good for people searching for information. The rise of this type of repackaging is what makes it harder for us to find what we’re looking for online right now; the more that text generated by large language models gets published on the Web, the more the Web becomes a blurrier version of itself.

The same New Yorker article mentioned the concept that if the AI was great it should trust its own output as input for making new versions of its own algorithms, but how could it score itself against itself when its own flaws are embedded recursively in layers throughout algorithmic iteration without any source labeling?

Testing on your training data is considered a cardinal rule machine learning error. Using prior output as an input creates similar problems.

Each time AI eats a layer of the value chain it leaves holes in the ecosystem, where the primary solution is to pay for what was once free. Even the "buy nothing" movements have a commercial goal worth fighting over.

As AI offers celebrity voices, impersonate friends, track people, automates marketing, and creates deep fake celebrity-like content, it will move more of social media away from ad revenue over to a subscription-based model. Twitter's default "for you" tab will only recommend content from paying subscribers. People will subscribe to and pay for a confirmation bias they know (even - or especially - if it is not approved within the state-preferred set of biases), provided there is a person & a personality associated with it. They'll also want any conversations with AI agents remain private.

When the AI stuff was a ragtag startup with little to lose the label "open" was important to draw interest. As commercial prospects improved with the launch of GPT-4 they shifted away from the "open," explaining the need for secrecy for both safety and competitive reasons. Much of the wow factor in generative AI is in recycling something while dropping the source to make something appear new while being anything but. And then the first big money number is the justification for further investments in add ons & competitors.

Google's AI Strategy

Google fast followed Bing's news with a vapoware announcement of Bard. Some are analyzing Google letting someone else go first as being a sign Google is behind the times and is getting caught out by an upstart.

Google bought DeepMind in 2014 for around $600 million. They've long believed in AI technology, and clearly lead the category, but they haven't been using it to re-represent third party content in the SERPs to the degree Microsoft is now doing in Bing.

My view is Google had to let someone else go first in order to defuse any associated antitrust heat. "Hey, we are just competing, and are trying to stay relevant to change with changing consumer expectations" is an easier sell when someone else goes first. One could argue the piss poor reception to the Bard announcement is actually good for Google in the longterm as it makes them look like they have stronger competition than they do, rather than being a series of overlapping monopoly market positions (in search, web browser, web analytics, mobile operating system, display ads, etc.)

Google may well have major cultural problems, but "They are all the natural consequences of having a money-printing machine called “Ads” that has kept growing relentlessly every year, hiding all other sins. (1) no mission, (2) no urgency, (3) delusions of exceptionalism, (4) mismanagement," though Google is not far behind in AI. Look at how fast they opened up Bard to end users.

AI = Money / Increased Market Cap

The capital markets are the scorecard for capitalism. It is hard to miss how much the market loved the Bing news for Microsoft & how bad the news was for Google.

Millions Suddenly Excited About Bing

In a couple days over a million people signed up to join a Bing wait list.

Your Margin is My Opportunity

Microsoft is pitching this as a margin compression play for Google

that may also impact their TAC spend

ChatGPT costs around a couple cents per conversation: "Sam, you mentioned in a tweet that ChatGPT is extremely expensive on the order of pennies per query, which is an astronomical cost in tech. SA: Per conversation, not per query."

The other side of potential margin compression comes from requiring additional computing power to deliver results:

Our sources indicate that Google runs ~320,000 search queries per second. Compare this to Google’s Search business segment, which saw revenue of $162.45 billion in 2022, and you get to an average revenue per query of 1.61 cents. From here, Google has to pay for a tremendous amount of overhead from compute and networking for searches, advertising, web crawling, model development, employees, etc. A noteworthy line item in Google’s cost structure is that they paid in the neighborhood of ~$20B to be the default search engine on Apple’s products.

Beyond offering a conversational interface, Bing is also integrating AI content directly in their search results on some search queries. It goes *BELOW* all the ads & *ABOVE* the organic results.

The above sort of visual separator eye candy has historically had a net effect of shifting click distributions away from organics toward the ads. It is why Google features "people also ask" and similar in their search results.

AI is the New Crypto

Microsoft is pitching that even when AI is wrong it can offer "usefully" wrong answers. And a lot of the "useful" wrong stuff can also be harmful: "there are a ton of very real ways in which this technology can be used for harm. Just a few: Generating spam, Automated romance scams, Trolling and hate speech ,Fake news and disinformation, Automated radicalization (I worry about this one a lot)"

"I knew I had just seen the most important advance in technology since the graphical user interface. This inspired me to think about all the things that AI can achieve in the next five to 10 years. The development of AI is as fundamental as the creation of the microprocessor, the personal computer, the Internet, and the mobile phone. It will change the way people work, learn, travel, get health care, and communicate with each other. Entire industries will reorient around it. Businesses will distinguish themselves by how well they use it." - Bill Gates

Since AI is the new crypto, everyone is integrating it, if only in press release format, while banks ban it. All of Microsoft's consumer-facing & business-facing products are getting integrations. Google is treating AI as the new Google+.

Remember all the hype around STEM? If only we can churn out more programmers? Learn to code!

Well, how does that work out if the following is true?

"The world now realizes that maybe human language is a perfectly good computer programming language, and that we've democratized computer programming for everyone, almost anyone who could explain in human language a particular task to be performed." - Nvidia CEO Jensen Huang

AI is now all over Windows. And for a cherry on top of the hype cycle:

A gradual transition gives people, policymakers, and institutions time to understand what’s happening, personally experience the benefits and downsides of these systems, adapt our economy, and to put regulation in place. It also allows for society and AI to co-evolve, and for people collectively to figure out what they want while the stakes are relatively low.

We believe that democratized access will also lead to more and better research, decentralized power, more benefits, and a broader set of people contributing new ideas. As our systems get closer to AGI, we are becoming increasingly cautious with the creation and deployment of our models.

We have a nonprofit that governs us and lets us operate for the good of humanity (and can override any for-profit interests), including letting us do things like cancel our equity obligations to shareholders if needed for safety and sponsor the world’s most comprehensive UBI experiment.

Algorithmic Publishing

The algorithms that allow dirt cheap quick rewrites won't be used just by search engines re-representing publisher content, but also by publishers to churn out bulk content on the cheap.

After Red Ventures acquired cNet they started publishing AI content. The series of tech articles covering that AI content lasted about a month and only ended recently. In the past it was the sort of coverage which would have led to a manual penalty, but with the current antitrust heat Google can't really afford to shake the boat & prove their market power that way. In fact, Google's editorial stance is now such that Red Ventures can do journalist layoffs in close proximity to that AI PR blunder.

Men's Journal also had AI content problems.

AI content poured into a trusted brand monetizes the existing brand equity until people (and algorithms) learn not to trust the brands that have been monetized that way.

A funny sidebar here is the original farmer update that aimed at eHow skipped hitting eHow because so many journalists were writing about how horrible eHow was. These collective efforts to find the best of the worst of eHow & constantly writing about it made eHow look like a legitimately sought after branded destination. Google only downranked eHow after collecting end user data on a toolbar where angry journalists facing less secure job prospects could vote to nuke eHow, thus creating the "signal" that eHow rankings deserve to be torched. Demand Media's Livestrong ranked well far longer than eHow did.


The process of pouring low cost backfill into a trusted masthead is the general evolution of online media ecosystems:

This strategy meant that it became progressively harder for shoppers to find things anywhere except Amazon, which meant that they only searched on Amazon, which meant that sellers had to sell on Amazon. That's when Amazon started to harvest the surplus from its business customers and send it to Amazon's shareholders. Today, Marketplace sellers are handing 45%+ of the sale price to Amazon in junk fees. The company's $31b "advertising" program is really a payola scheme that pits sellers against each other, forcing them to bid on the chance to be at the top of your search. ... once those publications were dependent on Facebook for their traffic, it dialed down their traffic. First, it choked off traffic to publications that used Facebook to run excerpts with links to their own sites, as a way of driving publications into supplying fulltext feeds inside Facebook's walled garden. This made publications truly dependent on Facebook – their readers no longer visited the publications' websites, they just tuned into them on Facebook. The publications were hostage to those readers, who were hostage to each other. Facebook stopped showing readers the articles publications ran, tuning The Algorithm to suppress posts from publications unless they paid to "boost" their articles to the readers who had explicitly subscribed to them and asked Facebook to put them in their feeds. ... "Monetize" is a terrible word that tacitly admits that there is no such thing as an "Attention Economy." You can't use attention as a medium of exchange. You can't use it as a store of value. You can't use it as a unit of account. Attention is like cryptocurrency: a worthless token that is only valuable to the extent that you can trick or coerce someone into parting with "fiat" currency in exchange for it. You have to "monetize" it – that is, you have to exchange the fake money for real money. ... Even with that foundational understanding of enshittification, Google has been unable to resist its siren song. Today's Google results are an increasingly useless morass of self-preferencing links to its own products, ads for products that aren't good enough to float to the top of the list on its own, and parasitic SEO junk piggybacking on the former.

Bing finally won a PR battle against Google & Microsoft is shooting themselves in the foot by undermining the magic & imagination of the narrative by pushing more strict chat limits, increasing search API fees, testing ads in the AI search results, and threating to cut off search syndication partners if the index is used to feed AI chatbots.

The enshitification concept feels more like a universal law than a theory.

When Yahoo, Twitter & Facebook underperform and the biggest winners like Google, Microsoft, and Amazon are doing big layoff rounds, everyone is getting squeezed.

AI rewrites accelerates the squeeze:

"When WIRED asked the Bing chatbot about the best dog beds according to The New York Times product review site Wirecutter, which is behind a metered paywall, it quickly reeled off the publication’s top three picks, with brief descriptions for each." ... "OpenAI is not known to have paid to license all that content, though it has licensed images from the stock image library Shutterstock to provide training data for its work on generating images."

The above is what Paul Kedrosky was talking about when he wrote of AI rewrites in search being a Tragedy of the Commons problem.

A parallel problem is the increased cost of getting your science fiction short story read when magazines shut down submissions due to a rash of AI-spam submissions:

The rise of AI-powered chatbots is wreaking havoc on the literary world. Sci-fi publication Clarkesworld Magazine is temporarily suspending short story submissions, citing a surge in people using AI chatbots to “plagiarize” their writing.

The magazine announced(Opens in a new window) the suspension days after Clarkesworld editor Neil Clarke warned about AI-written works posing a threat to the entire short-story ecosystem.

Warnings Serving As Strategy Maps

"He who fights with monsters might take care lest he thereby become a monster. And if you gaze for long into an abyss, the abyss gazes also into you." - Nietzsche

Going full circle here, early Google warned against ad-driven search engines, then Google became the largest ad play in the world. Similarly ...

Elon wants to create a non-woke AI, but he'll still have some free speech issues.

Over time more of the web will be "good enough" rewrites, and the JPEG will keep getting fuzzier:

"This new generation of chat-based search engines are better described as “answer engines” that can, in a sense, “show their work” by giving links to the webpages they deliver and summarize. But for an answer engine to have real utility, we’re going to have to trust it enough, most of the time, that we accept those answers at face value. ... The greater concentration of power is all the more important because this technology is both incredibly powerful and inherently flawed: it has a tendency to confidently deliver incorrect information. This means that step one in making this technology mainstream is building it, and step two is minimizing the variety and number of mistakes it inevitably makes. Trust in AI, in other words, will become the new moat that big technology companies will fight to defend. Lose the user’s trust often enough, and they might abandon your product. For example: In November, Meta made available to the public an AI chat-based search engine for scientific knowledge called Galactica. Perhaps it was in part the engine’s target audience—scientists—but the incorrect answers it sometimes offered inspired such withering criticism that Meta shut down public access to it after just three days, said Meta chief AI scientist Yann LeCun in a recent talk."

Check out the sentence Google chose to bold here:

As the economy becomes increasingly digital the AI algorithms have deep implications across the economy. Things like voice rights, knock offs, virtual re-representations, source attribution, copyright of input, copyright of output, and similar are obvious. But how far do we allow algorithms to track a person's character flaws and exploit them? Horse racing ads that follow a gambling addict around the web, or a girl with anorexia who keeps clicking on weight loss ads.

One of the biggest use cases for paid AI chatbots so far is fantasty sexting. It is far easier to program a lovebot filled with confirmation bias than it is to improve oneself. Digital soma.

When AI is connected directly to the Internet and automates away many white collar jobs what comes next? As AI does everything for you do the profit margins shift across from core product sales to hidden junk fees (e.g. ticket scalper marketplaces or ordering flowers for Mother's Day where you get charged separately for shipping, handling, care, weekend shipping, Sunday shipping, holiday shipping)?

"LLMs aren’t just the biggest change since social, mobile, or cloud–they’re the biggest thing since the World Wide Web. And on the coding front, they’re the biggest thing since IDEs and Stack Overflow, and may well eclipse them both. But most of the engineers I personally know are sort of squinting at it and thinking, “Is this another crypto?” Even the devs at Sourcegraph are skeptical. I mean, what engineer isn’t. Being skeptical is a survival skill. ... The punchline, and it’s honestly one of the hardest things to explain, so I’m going the faith-based route today, is that all the winners in the AI space will have data moats." - Steve Yegge

Monopoly Bundling

The thing that makes the AI algorithms particularly dangerous is not just that they are often wrong while appearing high-confidence, it is that they are tied to monopoly platforms which impact so many other layers of the economy. If Google pays Apple billions to be the default search provider on iPhone any error in the AI on a particular topic will hit a whole lot of people on Android & Apple devices until the problem becomes a media issue & gets fixed.

The analogy here would be if Coca Cola had a poison and they also poured Pepsi products.

These cloud platforms also want to help retailers manage in-store inventory:

Google Cloud said Friday its algorithm can recognize and analyze the availability of consumer packaged goods products on shelves from videos and images provided by the retailer’s own ceiling-mounted cameras, camera-equipped self-driving robots or store associates. The tool, which is now in preview, will become broadly available in the coming months, it said. ... Walmart Inc. notably ended its effort to use roving robots in store aisles to keep track of its inventory in 2020 because it found different, sometimes simpler solutions that proved just as useful, said people familiar with the situation.

Microsoft has a browser extension for adding coupons to website checkouts. Google is also adding coupon features to their SERPs.

Every ad network can use any OS, email, or web browser hooks to try to reset user defaults & suck users into that particular ecosystem.

AI Boundaries

Generative AI algorithms will always have a bias toward being backward looking as it can only recreate content based off of other ingested content that has went through some editorial process. AI will also overemphasize the recent past, as more dated cultural references can represent an unneeded risk & most forms of spam will target things that are sought after today. Algorithmic publishing will lead to more content created each day.

From a risk perspective it makes sense for AI algorithms to promote consensus views while omitting or understating the fringe. Promoting fringe views represents risk. Promoting consensus does not.

Each AI algorithm has limits & boundaries, with humans controlling where they are set. Injection attacks can help explore some of the boundaries, but they'll patch until probed again.

Boundaries will often be set by changing political winds:

"The tech giant plans to release a series of short videos highlighting the techniques common to many misleading claims. The videos will appear as advertisements on platforms like Facebook, YouTube or TikTok in Germany. A similar campaign in India is also in the works. It’s an approach called prebunking, which involves teaching people how to spot false claims before they encounter them. The strategy is gaining support among researchers and tech companies. ... When catalyzed by algorithms, misleading claims can discourage people from getting vaccines, spread authoritarian propaganda, foment distrust in democratic institutions and spur violence."

Stating facts about population subgroups will be limited in some ways to minimize perceived racism, sexism, or other fringe fake victim group benefits fund flows. Never trust Marxists who own multiple mansions.

At the same time individual journalists can drop napalm on any person who shares too many politically incorrect facts.

Some things are quickly labeled or debunked. Other things are blown out of proportion to scare and manipulate people:

Dr. Ioannidis et. al. found that across 31 national seroprevalence studies in the pre-vaccine era, the median IFR was 0.0003% at 0-19 years, 0.003% at 20-29 years, 0.011% at 30-39 years, 0.035% at 40-49 years, 0.129% at 50-59 years, and 0.501% at 60-69 years. This comes out to 0.035% for those aged 0-59 and 0.095% for those aged 0-69.

The covid response cycle sacrificed childhood development (and small businesses) to offer fake protections to unhealthy elderly people (and bountiful subsidies to large "essential" corporations).

‘Civilisation and barbarism are not different kinds of society. They are found – intertwined – whenever human beings come together.’ This is true whether the civilisation be Aztec or Covidian. A future historian may compare the superstition of the Aztec to those of the Covidian. The ridiculous masks, the ineffective lockdowns, the cult-like obedience to authority. It’s almost too perfect that Aztec nobility identified themselves by walking with a flower held under the nose.

A lot of children had their childhoods destroyed by the idiotic lockdowns. And a lot of those children are now destroying the lives of other children:

In the U.S., homicides committed by juveniles acting alone rose 30% in 2020 from a year earlier, while those committed by multiple juveniles increased 66%. The number of killings committed by children under 14 was the highest in two decades, according to the most recent federal data.

Now we get to pile inflation and job insecurity on top of those headwinds to see more violence.

The developmental damage (school closed, stressed out parents, hidden faces, less robust immune systems, limited social development) is hard to overstate:

The problem with this is that the harm of performative art in this regard is not speculative, particularly in young children where language development is occurring and we know a huge percentage of said learning comes from facial expressions which of course a mask prevents from being seen. Every single person involved in this must face criminal sanction and prison for the deliberate harm inflicted upon young children without any evidence of benefit to anyone. When the harm is obvious and clear but the benefit dubious proceeding with a given action is both stupid and criminal.

Some entities will claim their own statements are conspiracy theory, even when directly quoted:

“If Russia invades . . . there will be no longer a Nord Stream 2. We will bring an end to it.” - President Joseph R. Biden

In an age of deep fakes, confirmation bias driven fast social shares (filter bubble), legal threats, increased authenticity of impersonation technology, AI algorithms which sort & rewrite media, & secret censorship programs ... who do you trust? How are people informed when nation states offer free global internet access with a thumb on the scale of truth, even as aggregators block access to certain sources demanding payments?

Lab leaks sure sound a lot like an outbreak of chocolatey goodness in Hershey, PA!

"The fact that protesters could be at once both the victims and perpetrators of misinformation simply shows how pernicious misinformation is in modern society." - Canadian Justice Paul Rouleau

What is freedom?

By 2016, however, the WEF types who’d grown used to skiing at Davos unmolested and cheering on from Manhattan penthouses those thrilling electoral face-offs between one Yale Bonesman and another suddenly had to deal with — political unrest? Occupy Wall Street was one thing. That could have been over with one blast of the hose. But Trump? Brexit? Catalan independence? These were the types of problems you read about in places like Albania or Myanmar. It couldn’t be countenanced in London or New York, not for a moment. Nobody wanted elections with real stakes, yet suddenly the vote was not only consquential again, but “often existentially so,” as American Enterprise Institute fellow Dalibor Rohac sighed. So a new P.R. campaign was born, selling a generation of upper-class kids on the idea of freedom as a stalking-horse for race hatred, ignorance, piles, and every other bad thing a person of means can imagine