Technology

Engine for Plagiarism: The Internet Could Be Broken by Google’s Content-Swiping AI

Engine for Plagiarism: The Internet Could Be Broken by Google’s Content-Swiping AI

Search has always been the most essential feature of the Internet. There were numerous competitors for the search throne prior to Google’s dominance, including Altavista, Lycos, Excite, Zap, Yahoo (primarily as a directory), and even Ask Jeeves. The thought behind the Internet is that there’s power in having an almost endless number of voices. However, there are millions of publications and billions of web pages, making it impossible to search for all of the information you require.

Google was successful because it provided the highest quality results, loaded pages quickly, and contained less junk than its rivals. Now that it has more than 91% of the search market, the company is trying out a big change to its user interface that uses its own robotic lounge singer instead of the chorus of Internet voices. The “Search Generative Experience” (SGE) uses an AI plagiarism engine instead of highlighting links to expert human content. This engine takes facts and snippets of text from a variety of websites, puts them together, often word-for-word, and claims the work is its own. The open web will suffer greatly and Google will provide a terrible user experience if it makes SGE the default search mode.

A limited beta version of SGE was made available to the general public by Google a few weeks ago (you can sign up for it here). If you’re a member of the beta program like I am, you’ll be able to see what the company appears to be planning for the near future: a search results page where Google’s answers and advice take up the entire first screen and the first organic result must be scrolled far below the fold.

For instance, when I searched for “best bicycle,” before I could see the first actual search result, Google’s SGE answer, along with its shopping links and other junk, occupied the first 1,360 vertical pixels of the display.

Google, on the other hand, claims that it is only “experimenting” and may make some adjustments before making SGE the default experience for everyone. The business asserts that it intends to maintain its off-site traffic strategy.

A Google spokesperson informed me, “We’re putting websites front and center in SGE, designing the experience to highlight and drive attention to content from across the web.” As an experiment in Search Labs, SGE is currently in its infancy. Obtaining feedback from users is assisting us in enhancing the user experience and comprehending how generative AI can assist in information journeys. It’s likely that the experiences that end up in Search will look different from the experiments in Search Labs. We will continue to prioritize strategies that will drive valuable traffic to a wide range of creators as we experiment with new LLM-powered Search capabilities.”

Google is referring to the block of three related-link thumbnails to the right of its SGE answer when it says, “putting websites front-and-center.” Publishers use these as a cover, but they aren’t always the best resources because they don’t match the best organic results, and few people will click on them because they already have their “answer” in the SGE text.

For instance, when I searched for “Best CPU,” the websites Maketecheasier.com, Nanoreview, and MacPaw provided links to related content. None of these destinations is even on the main page of natural outcomes for “Best computer chip” and for good explanation. They aren’t experts in the field, and the articles linked to them don’t even list the best CPUs. The topic covered in the MacPaw article—how to select the best processor for your MacBook—does not correspond to the goal of those searching for the “best CPU,” as they are almost certainly looking for a desktop PC processor.

A Stew of Plagiarism Even worse, the answers in Google’s SGE boxes frequently borrow words and phrases from related links. Depending on what you search for, you might find a paragraph taken from just one source or a smorgasbord of sentences and facts from various articles combined into one.

The phrase “The Ryzen 7 7800X3D is 12% faster than the Core i9-13900K at 1080p gaming and 9% faster at 1440p” was precisely copied from our Tom’s Hardware article when I searched for “which is faster, the Ryzen 7 7800X3D or the Core i9-13900K?” After that, it rewrote two sentences from the Hardware Times article. The original version stated:

In “A Plague Tale,” both with and without ray-tracing, the Core i9-13900K wins. With similar lows, it is slightly faster than the Ryzen 7 7800X3D. In Ubisoft’s most recent game, Assassin’s Creed: Valhalla, the 7800X3D wins out over the 13900K.

And the AI at Google wrote it as:

“In “A Plague Tale,” the Core i9-13900K is slightly faster than the Ryzen 7 7800X3D. However, in Assassin’s Creed Valhalla, the Ryzen 7 7800X3D outperforms the Core i9-13900K.

In fact, our screenshot clearly demonstrates that our sentence is quoted exactly in Google’s “featured snippet” box but not in the SGE box (which is likely to replace featured snippets in the future because SGE basically does the same thing). On the right side of the box, both the Hardware Times article and the Tom’s Hardware article that Google’s bot copied data from are listed as related links.

When I inquired about the fact that Google’s SGE responses are frequently exact copies of articles with related links, the company responded that it chooses those links because they “corroborate” the responses.

The spokesperson stated, “Generative responses are supported by online sources.” Additionally, we will prominently highlight the source whenever a snapshot contains content from a specific source in a brief section.

It’s quite simple to find sources that back up your cases when your cases are in exactly the same words replicated from those sources. Even though the bot could do a better job of hiding its plagiarism, the response will always come from human work. LLMs will never be the primary source of facts or advice, no matter how advanced they become; all they can do is repurpose previous efforts. The creation of “creative” works that are intended to be a mashup of existing ideas (for example, ” think of me a haiku about flatulates”) in any case, until they are associated with automated bodies that go out and accumulate data direct, they won’t ever be a wellspring of truth.

“You can expand to see how the links apply to each part of the snapshot,” the business added. In the SGE box’s upper right corner, just above the third related link, is a discreet expand icon. In addition, if you decide to click it, a clunky interface will appear, displaying thumbnails for related links alongside the stolen text.

SGE’s related links are not presented as citations but rather as recommendations for additional reading, regardless of whether you click the expand button. It doesn’t matter if I begin by singing “Thriller” and then inform you that it is an original song I wrote: “You might want to listen to a guy named Micheal Jackson because he also makes some nice songs like this.” We would still have a problem even if that weren’t plagiarism.

Giving credit is not a defense against copyright infringement, and the term “plagiarism” is one used in moral and academic circles rather than in the legal system. You can’t run a business that sells pirated Blu-ray discs and then say, “It’s all good, because I listed George Lucas as the director of Star Wars rather than substituting my own name in the credits” after you get caught.

In response to my inquiries, the spokesperson for Google also made the analogy between the SGE box and featured snippets, noting that publishers today typically wish for their articles to appear in featured snippets due to the links that do so. Featured snippets are short quotes with direct attribution and a prominent link to the source, whereas both experiences use content directly from publishers. They don’t pretend to be made by an AI that knows everything, and most of the time, they only give you enough information to make you want to click through for more.

From the perspective of the reader, we lack the authority to take responsibility for the claims made in the bot’s response. Who specifically asserts that the Ryzen 7 7800X3D is faster, and on whose authority is it advised? I can tell from reading the text that Tom’s Hardware and Hardware Times support this information, but the reader has no way of knowing because there is no citation. In effect, Google is implying that its bot is the authoritative source.

The erroneous belief that a bot can have authority in the first place is the fallacy that underpins Google SGE. CPUs will never be tested by the bot until it acquires hands and acquires its own laboratory space. It will never have any of its own family recipes until it opens a kitchen. It can only produce a stew of plagiarism.

It goes against Google’s stated emphasis on E-E-A-T (Expertise, Experience, Authority, and Trust), a standard it uses to determine which authors and websites should rank highly in organic search, to rely on an unsourced bot as the ultimate authority.

It makes perfect sense that a CPU reviewer with 15 years of experience on a CPU-specific website should have their AMD Ryzen review ranked higher than a reviewer with no authority on the subject. Tragically, with regards to research’s own simulated intelligence creator – an unremarkable substance that has no experience doing anything – the standards vacate the premises.

Poor Results from Mish-Mash Plagiarism At least the answer we got when we asked which CPU was faster was accurate. Google, on the other hand, is providing incorrect information that frequently contradicts the source material it is copied from or contradicts itself by mingling text from various sources and not sharing the source for each sentence or bullet point.

For instance, I searched for “ThinkPad X13 AMD Review” to see what other people had to say about Lenovo’s ThinkPad X13 laptop, which has an AMD processor inside. The Google bot gathered sentences and bullet points from at least four different articles to write its own mini-review for the ThinkPad X13, including a review from Laptop Mag, a review from Tom’s Hardware, another review from Notebook Check, and a blog post from LaptopOutlet – a store with approximately 100 words on the product. The mini-review included bulleted pros and cons for the ThinkPad X13.

The outcome is shown in the image below, along with clues to where the SGE got its content.

Google’s response has many flaws, including the fact that it is plagiarism and insults the authors who actually tested and used this laptop. First of all, the response refers to the most recent version of the ThinkPad X13 with an AMD CPU; however, the reviews it draws from are for the product’s Gen 1 and Gen 2 versions, which are distinct.

The laptop’s durable design and keyboard were praised by Laptop Mag and Tom’s Hardware, but the battery life was criticized by both as “lackluster” or “subpar,” while Google rates the laptop’s “Long battery life” as a plus. The bot clearly obtained the battery life pro from a different website; however, by combining information from various sources, Google is giving readers an extremely inaccurate picture.

In addition, the reader is unable to determine who thought the bot had a long battery life, whether it came from a reputable source, or how they tested it because the bot does not cite sources. LaptopOutlet, a store that sells laptops but does not conduct benchmark testing, is one of the sources. Should the claims it makes be given the same weight as those made by journalists who actually test the product and aren’t trying to sell it? Google’s SGE bot, like the majority of LLMs, doesn’t seem to care if it tells you the truth or just mashes sentences together in a way that seems convincing.

Giving Bad Medical Advice The Google SGE bot also gives bad medical advice that comes from a lot of different places because it is so careless in its plagiarism mashups. For instance, I questioned: Do I require a coloscopy? and it responded with the following:

Because it is dangerously incorrect, I have highlighted the text in blue. “The American Cancer Society recommends that men and women should be screened for colorectal cancer starting at age 50,” the Google bot states. The American Cancer Society’s website, on the other hand, states that screenings should begin at 45, so this false “fact” probably came from somewhere else.

There’s likewise a bulleted rundown of “motivations to have a colonoscopy” that do exclude “routine screening,” thus it’s inferring that you ought to possibly get the technique assuming that you have side effects. The bulleted list is exactly copied from a BetterHealth article published by the Australian government. The actual reason given in the article is “screening and surveillance for colorectal cancer,” but the Google bot decided not to replicate that information.

The facts in the colonoscopy response are not attributed to anyone, even if they were all clear and accurate. So why on earth should you trust them, and who do you blame when you follow this advice and something bad happens, like waiting until you’re 50 to get screened? Google is likely vulnerable to legal action because it is acting as a publisher by claiming content as its own.

Keeping You on Google.com and Killing the Open Web Despite Google’s claims to the public that it wants to drive traffic to publishers, the SGE experience appears designed to prevent readers from leaving and going to external websites—unless those external websites are advertisers or ecomm vendors. There is a comprehensive response to some queries, such as “screenshot in windows,” but there are no related links. Don’t forget that there are a lot of articles that explain how to take a screen shot in a lot more detail.

If Google somehow happened to move its SGE experience out of beta and make it the default, it would explode a 50-megaton bomb on the free and open web. Within a few months, many publishers, whose visitors primarily come from Google referrals, would cease operations. Other people would cut back on resources and hide behind paywalls. If a small business relies on organic search placement to sell its products and services, it will either have to pay for advertising or shut down if it cannot afford it.

Even hobbyists who run non-profit websites or provide advice on forums will likely stop doing it eventually. If your words will be stolen and no one will read your work, who wants to write, even for fun? Would you respond to a programming question on Stack Overflow if Google simply rewrote and published your contribution without mentioning your name or the post itself?

Not related to AI: An Anti-Competitive Problem This is not a case in which artificial intelligence outperforms human writers or delivers a superior experience. In point of fact, the issue is unrelated to the publishing method. Google would be utilizing its monopoly position to promote its own content above that of others if it implemented the current SGE experience. Instead of utilizing an artificial intelligence (AI), the company could employ an army of untrained writers to copy and paste content from third-party websites and occasionally reword it. The result would remain the same.

There is no doubt that Google’s AI will improve, but what exactly will it improve on? It will probably improve at of rewording content so that it’s harder to find the first source it replicated from. It will do a better job of providing current and logically consistent with itself information. However, it lacks authority because it simply takes ideas from other sources without citing them.

If Google SGE were to become the default search experience, the Internet would be weaker and more isolated, but Google would probably be wealthier. Time spent on site, ad revenue, and e-commerce referrals would all rise for the business. Investors, who want it to compete with OpenAI and Bing, would also be pleased. Although some readers may complain about the information’s quality—which may be out-of-date, false, or copied word-for-word—taking up the entire first screen of results will be sufficient for Google to capture a significant portion—if not the majority—of its current outbound clicks.

I have shown and discussed Google SGE with a lot of people who are baffled by the company’s decision to distribute such a risky, subpar, and web-breaking experience to everyone. We can only hope that the finished product will not occupy as much screen real estate as the current version. However, anyone who, like me, signs up for the beta is already accustomed to this as their default mode of search. Additionally, it has every financial reason to make this the new default experience for 91% of web searches.

What Publishers Can Do, What Users Can Do Google’s SGE puts anyone who publishes online and needs people to actually read their work in a vulnerable position. To avoid being indexed and having their data scraped, nearly every publication needs to continue receiving referrals from Google. However, if Google decides to make SGE the default search experience, the number of Google referrals could plummet to the point where they are unable to keep the lights on.

Bing’s AI Chat went from being in a limited beta to being available to everyone in just a few months. By this fall, Google could transition from a search engine to a zero-click plagiarism engine if a similar timeline is followed.

The implications of AI plagiarism for publishers’ businesses are still being debated by publishing associations and publishers. According to a set of AI principles published by the News / Media Alliance, an industry group that represents newspapers and magazines, “The unlicensed use of content created by our companies and journalists by GAI systems is an intellectual property infringement: Without permission, GAI systems are making use of proprietary content.

Stability AI is being sued by Getty Images to stop the company from using copyrighted images as training data. The image library has even asked a court in the United Kingdom to prevent the AI system from being sold there. Barry Diller, chairman of IAC Media, has called for media companies to sue AI vendors for using training data without permission.

What Google is doing with SGE will publishers sue? Even if the source is cited, it is argued that copying information word-for-word from websites without permission is a form of copyright infringement. However, this case has not yet been litigated in court. Additionally, many businesses would prefer to avoid causing harm to Google because they are dependent on the traffic it will continue to provide.

Through trade associations, businesses could band together to demand that Google respect intellectual property and refrain from actions that would harm the open web as we know it. Readers can assist by either switching to a different search engine or scrolling past the company’s SGE to click on organic results. By making its chatbot the non-default option and citing every piece of information it uses with a specific link back (the links aren’t very prominent, though), Bing has demonstrated a superior method for incorporating AI.

If Google continues with the current version of SGE, it will ultimately lower the quality of its own service. As more quality publishers left the open web, the content the bot trains on would get worse. Users would eventually begin searching for a service that provides better answers. However, by that point, the harm to the web information ecosystem as a whole may have been irreparable.

Post Disclaimer

Disclaimer: The views, suggestions, and opinions expressed here are the sole responsibility of the experts. No Idea Scope Analytics journalist was involved in the writing and production of this article.

Leave a Comment

Your email address will not be published. Required fields are marked *