logo
EverydayChaos
Everyday Chaos
Too Big to Know
Too Big to Know
Cluetrain 10th Anniversary edition
Cluetrain 10th Anniversary
Everything Is Miscellaneous
Everything Is Miscellaneous
Small Pieces cover
Small Pieces Loosely Joined
Cluetrain cover
Cluetrain Manifesto
My face
Speaker info
Who am I? (Blog Disclosure Form) Copy this link as RSS address Atom Feed

September 5, 2024

Do LLM’s store what they’re trained on? A reason to think not.

If you ask chatGPT to give you the first paragraph of Moby-Dick, it gets it exactly right. This and other such examples are sometimes taken as evidence that it has kept a copy of that book in memory, a potential copyright violation. The makers of these Large Language Models say they don’t, or at least I haven’t come across any who cop to it.

My understanding is that they analyze the terxts they’re trained on looking for statistical relationships, keep the stats and ditch the texts. (In this, they’re similar to how search engines compile their indexes.)

But my understanding is quite limited. So, I poked around.

PROMPT: What is the first paragraph of Moby-Dick, verbatim?

CHATGPT:

Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people’s hats off—then, I account it high time to get to sea as soon as I can

Perfect.

PROMPT: What is the first paragraph of David Weinberger’s book Everything Is Miscellaneous?

CHATGPT:

The first paragraph of Everything is Miscellaneous by David Weinberger reads:

“If you keep your music in iTunes or your photos in Flickr, you’re using an application that lets you put a piece of information in more than one category at a time. That’s weird. In the real world—where our species evolved—everything has its place. Our books are on one shelf at a time, our plates are in one cupboard, and our shoes are by the door or in the closet. That’s the way it’s supposed to be because that’s the way the world is. Except, of course, it’s not.”

This opening sets the stage for Weinberger’s exploration of how digital organization challenges traditional, physical-world categorization.

That’s not a bad intro to the book’s themes:  it picks up some of the examples in the book, and it is certainly reminiscent of my style of writing. But it has nothing to do with the first paragraph of the first chapter:

Before the Web, browsing was usually a polite way of telling a salesperson to buzz off. “May I help you?”, a salesperson asks. “I’m just browsing,” you reply with a little smile.With that word, a customer declares a lack of commitment, With that smile, she asserts that she’s within her rights: “Just try and stop me, salesboy!”

I’m slightly embarrassed by that, but the point is that the real first paragraph doesn’t talk about anything mentioned in chatGPT’s attempt to quote it.

Further, while chatGPT can quote the opening of Alice’s Adventures in Wonderland, if you ask it to quote the opening of Carroll’s lesser known “Sylvie and Bruno” it responds:

One winter night, at a little town called Elveston, a telegram was received by the Sub-Warden, who sat sipping his coffee, and wondering lazily whether it was worth while to finish his cigar, or whether he would take a stroll down to the Club, and see if Captain Muriel had returned yet.

Here’s the actual opening:

—and then all the people cheered again, and one man, who was more excited than the rest, flung his hat high into the air, and shouted (as well as I could make out) “Who roar for the Sub-Warden?” Everybody roared, but whether it was for the Sub-Warden, or not, did not clearly appear: some were shouting “Bread!” and some “Taxes!”, but no one seemed to know what it was they really wanted.

The phrase “one winter” doesn’t appear anywhere in the original. Nor does “a little town called Elveston”. 

Finally, if you repeat the prompt, it gives  substantially different responses, including the unfortunately smutty these days:

‘Less bread!’ cried the poor. I looked again, and saw it was not bread they had needed, but more head.

It seems to unqualified me that this bolsters the claims of the LLM makers that they don’t keep copies of their training materials (or at least don’t make them accessible to the model in operation), and that the training-time statistical analysis of texts that quote other texts, plus the magic of transformer technology,  is sufficient to explain how they can provide verbatim quotes of oft-cited works. 

Am I wrong? It’s highly likely.

Tweet
Follow me

Categories: ai, copyright, machine learning, tech Tagged with: ai • chatgpt • copyright • llm • ml • scale Date: September 5th, 2024 dw

1 Comment »

March 27, 2024

chatGPT knows its letters

A friend on a mailing list pointed to the massive palindromic story “2002” by Nick Montfort and William Gillespie, posted on Feb. 20. 2002, AKA 20-02-2002. It’s 2002 words long. And while it’s not exactly a great read, it is an amazing accomplishment.

So, I asked chatGPT to confirm that “it’s”2002” is an accurate palindrome. It confirmed it.

Then I asked it to write a palindromic story, saying it could be quite short. It came up with:

"Eva, can I see bees in a cave?"

When I asked it to write the next palindromic  sentence, it came up with:

“Eva, can I see bees in a cave? Eva, can I see bees in a cave?”

which it acknowledged is maybe too simple a way out:

This repetition isn’t the most creative continuation, but it maintains the palindromic structure. Creating a palindromic story with varied content while preserving the integrity of the palindrome is an intricate endeavor, usually resulting in very simple and repetitive narratives. The key challenge lies in the reversal of letters and sounds, which typically leads to mirrored sentences rather than continued narrative development.

This was interesting to me, because the Large Language Models that produce chat AI applications like chatGPT turn the words they’re being trained on into tokens, which are unique numeric labels for each word, but also for strings of words, parts of words, punctuation, etc. So, when chatGPT is processing a prompt that contains the word “bees” it recognizes that as a particular token number. But you need to know about the actual letters, not just the token number, to construct a palindrome. So hat’s going on?

In this case, there’s a way around this seeming limitation (which we’ll see in a minute isn’t actually a limitation): “Eva, can I see bees in a cave?” exists on the Web as an example of a palindrome. So perhaps chatGPT didn’t construct it out of nothing.

But when given:

“mom bird rib taco cat taco cat bird rib mom”

it identifies it as a palindrome even though that phrase is a sort of googleWhack: it gets zero hits on Google.

When given the entire “2002” text,  and asked if it’s a palindrome, chatGPT confidently says that it is.

It also correctly says:

“mom bird rib tackle cat taco cat bird rib mom”

is not a palindrome, and points to the exact error.

So, I asked chatGPT how it recognizes patterns in letters the compose a word, it replied:

Even though my processing of text is based on tokens, these tokens include representations for individual letters, words, and punctuation marks, which allows me to analyze and generate text based on specific linguistic rules and patterns, including identifying palindromic structures.

A good answer. Or an excellent hallucination.

And since it was my friend Drew’s grandma who pointed out “2002” to her, I can conclude:

An answer, Drew’s Nana!

PS: No, my friend isn’t really named Drew. So sue me!

PPS: In case you’re wondering why the formatting of chatGPT’s responses are so inconsistent in this post, it’s because WordPress’ “Gutenberg” block editor is an enshittification of their plain old editor. But ChatGPT guided me to the “classic” block in the Block editor, and I’ll be making a more thorough switch back to classic. I’m just too annoyed to fix the formatting on this particular post.

Tweet
Follow me

Categories: ai, blogs, everyday chaos, machine learning, tech Tagged with: ai • llm • ml Date: March 27th, 2024 dw

Be the first to comment »

June 6, 2023

Apple’s Vision Pro intro video is the product

The Apple Vision Pro launch video is a timelessly great example of its kind. Indeed, With a $3,500 price tag for what is essentially a display,  the video is the actual product: it claims a new space for Apple.

At first glance, I was disappointed and skeptical: a cumbersome headset so I can watch a 2D virtual display? But it’s been so thoughtfully designed; they have, it seems, thought of everything. The video counters every objection that springs to mind. Do you want to show up in a Facetime with the dumb visor on? Oh, it erases it through the magic of AI! You want to be able to interact with real people in your physical space? Translucence to the rescue. So, good job, Apple!

I was impressed by the video under-selling its 3D capabilities. That took a lot of marketing restraint. But it would have distracted from what’s actually new: Apple’s intense focus on what makes this thing useful now with the software tools we have and with the way we work now, even though that won’t justify the price for hardly any of us.

At what price would I buy one? I won’t know until I try one, which I suppose is one mark of a genuinely innovative product. 

 


 

The Apple video in many ways reminds me of Bruce Tognazzini’s 1992 “StarFire” video that sketched a headset-free virtual workspace. It, however, was purely visionary with no product in sight. (Tog provides context for the video here.)

Tweet
Follow me

Categories: cluetrain, marketing, reviews, tech Tagged with: cluetrain • marketing • vr Date: June 6th, 2023 dw

3 Comments »

February 25, 2023

Trial by Fitbit

I watched some of the cross examination of Alex Murdaugh who is on trial for brutally murdering his wife and son — I’m a lawyer voyeur, as well as a reader of Bob Loblaw’s Law Blog — and happened to come in as the prosecution was pinning Murdaugh down with step-and-time data from Murdaugh’s cell phone. Sample:

8:05:35-8:09:52 p.m.: Alex Murdaugh’s phone records 54 steps.

8:05:46-8:15:24 p.m.: Paul Murdaugh’s phone records 303 steps.

8:06 p.m.: Paul’s phone begins moving from the kennels to the main house.

8:07:20 p.m.: Paul Murdaugh sends a Snapchat message to several friends.

8:09-9:02 p.m.: Alex’s phone records no steps, indicating he was not moving with the phone in his possession. He later told investigators he was sleeping during that time.

8:11:08-8:31:15 p.m.: Maggie Murdaugh’s phone is locked.

8:14-8:35 p.m.: Paul Murdaugh’s phone puts him at the main house.

8:15:55-8:21:45 p.m.: Paul’s phone records 140 steps.

8:17-8:18 p.m.: Maggie Murdaugh’s phone records 38 steps taken.

Note to self: The next time I plan on criming, leave my mobile phone at home watching PBS.

Tweet
Follow me

Categories: law, tech Tagged with: geolocation • law • maps • privacy Date: February 25th, 2023 dw

2 Comments »

January 14, 2023

How word processing changed my life: A brief memoir

I  typed my doctoral dissertation in 1978 on my last electric typewriter, a sturdy IBM Model B.

Old IBM Model 2 typerwriter
Figure 1

My soon-to-be wife was writing hers out long hand, which I was then typing up.

Then one day we took a chapter to a local typist who was using a Xerox word processor which was priced too high for grad students or for most offices. When I saw her correcting text, and cutting and pasting, my eyes bulged out like a Tex Avery wolf.

As soon as Kay-Pro II’s were available, I bought one from my cousin who had recently opened a computer store.

Kay-Pro II
Figure 2

The moment I received it  and turned it on, I got curious about how the characters made it to the screen, and became a writer about tech. In fact, I became a frequent contributor to the Pro-Files KayPro magazine, writing ‘splainers about the details of how these contraptions. worked.

I typed my wife’s dissertation on it — which was my justification for buying it — and the day when its power really hit her was when I used WordStar’s block move command to instantly swap sections 1 and 4 as her thesis advisor had suggested; she had unthinkingly assumed it meant I’d be retyping the entire chapter. 

People noticed the deeper implications early on. E.g., Michael Heim, a fellow philosophy prof (which I had been, too), wrote a prescient book, Electric Language, in the early 1990s (I think) about  the metaphysical implications of typing into an utterly malleable medium. David Levy wrote Scrolling Forward about the nature of documents in the Age of the PC. People like Frode Hegland are still writing about this and innovating in the text manipulation space.

A small observation I used to like to make around 1990 about the transformation that had already snuck into our culture: Before word processors, a document was a one of a kind piece of writing like a passport, a deed, or an historic map used by Napoleon; a document was tied to its material embodiment. Then the word processing folks needed a way to talk about anything you could write using bits, thus severing “documents” from their embodiment. Everything became a document as everything became a copy.

In any case, word processing profoundly changed not only how I write, but how I think, since I think by writing. Having a fluid medium lowers the cost of trying out ideas, but also makes it easy for me to change the structure of my thoughts, and since thinking is generally  about connecting ideas, and those connections almost always assume a structure that changes their meaning — not just a linear scroll of one-liners — word processing is a crucial piece of “scaffolding” (in Clark and Chalmer‘s sense) for me and I suspect for most people.

In fact, I’ve come to recognize I am not a writer so much as a re-writer of my own words.

Figures

  1. Norsk Teknisk Museum – Teigen fotoatelier, CC BY-SA 4.0
    https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons
  2. By Autopilot – Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=39098108
Tweet
Follow me

Categories: culture, libraries, media, personal, philosophy, tech Tagged with: writing Date: January 14th, 2023 dw

3 Comments »

January 2, 2023

“Background items added” from “Fei Lv”

[Please note important boldface corrections in this post – Feb. 1, 2022]

Just in case you’ve started getting notifications on your Mac that “Software from ‘Fei Lv’ added items that can run in the background. You can manage this in Login Items Settings”, here’s an explanation that I could not find anywhere on the Internet.

Error message from Apple

This seems to be coming from NordPass [Nope. Coincidence.], which I have been trying out as a replacement for 1Pass. a password manager. I like 1Password and it is a well-regard and trust password manager, but it’s UI has been getting overly complex for my tastes, mainly because I’ve entered too many redundant, broken entries. I recommend 1Pass and will probably be going back to it. But NordPass was offering a great intro deal, and I’ve been a satisfied user of NordVPN for years now.

I started getting the annoying Fei Lv notifications, and struggled to find what app, piece of software, or sneaky malware was causing them. Apple does not make it easy. It’s relatively easy to find in the log that the notification is happening, but not which app “Fei Lv” applies to. Neither did Google or Bing searches.

Trial and error, however, worked well. It looks like it’s NordPass [Nope]. So, if you’re hearing from the mysterious Fei Lv, try turning off NordPass through System Preferences and see if that does the trick. [It won’t] It did for me [For a few days].

i’ll consider turning it back on if NordPass reassures me that Fei Lv isn’t some malware that snuck on to my computer. [NordPass is not the problem. I still don’t know what is.]

Tweet
Follow me

Categories: misc, tech Tagged with: errors • notifications • passwords • tech Date: January 2nd, 2023 dw

4 Comments »

June 26, 2022

Life without Time Machine backups

While I’m waiting for Western Digital to replace my external backup drive — it lasted for 9 months, so that’s pretty good, right? — I’ve been finding comfort in using borg (free) (documentation) , wrapped in the LaunchControl UI ($18) to do hourly incremental backups to my flush Transcend SD card.

LaunchControl is still a little too techy for me, but I got it working pretty quickly. Recommended! And the Transcend is fast enough that I my Mac doesn’t hiccup when it’s being backed up to.

Tweet
Follow me

Categories: tech Tagged with: tech Date: June 26th, 2022 dw

Be the first to comment »

March 28, 2022

Semantic Wordle

There’s a new version of Wordle called Semantle — not one that I “predicted” — that wants you to find the target word by looking not for a chain of spellings but a chain of semantics. For example, if you started with the word “child” you might get to the answer as follows:

  1. Child
  2. Play
  3. game
  4. Chess
  5. Square
  6. Circle
  7. Donut
  8. Homer

In short, you’re playing word associations except the associations can be very loose. It’s not like Twenty Questions where, once you get down a track (say “animals”), you’re narrowing the scope until there’s only one thing left. In Semantle, the associations can take a sudden turn in any of a thousand directions at any moment.

Which means it’s basically impossible to win.

It is, however, a good introduction to how machine learning “thinks” about words. Or at least one of the ways. Semantle is based on word2vec, which creates text embeddings derived from an analysis of some large — sometimes very very large — set of texts. Text embeddings map the statistical relationships among words based on their proximities in those texts.

In a typical example, word2vec may well figure out that “queen” and “king” are semantically close, which also might well let it figure out that “king” is to “prince” as “queen” is to “princess.”

But there are, of course, many ways that words can be related — different axes of similarity, different dimensions. Those are called “vectors” (as in “word2vec“). When playing Semantle, you’re looking for the vectors in which a word might be embedded. There are many, many of those, some stronger than others. For example, “king” and “queen” share a dimension, but so do “king” and “chess”, “king” and “bed size”, and “king” and “elvis.” Words branch off in many more ways than in Wordle.

For example, in my first game of Semantle, after 45 attempts to find a word that is even a little bit close to the answer, I found that “city” is vaguely related to it. But now I have to guess at the vector “city” and the target share. The target could be “village”, “busy”, “taxi”, “diverse”, “noisy”, “siege”, or a bazillion other words that tend to appear relatively close to “city” but that are related in different ways.

In fact, I did not stumble across the relevant vector. The answer was “newspaper.”

I think Semantle would be more fun if they started you with a word that was at some reasonable distance from the answer, rather than making you guess what a reasonable starting word might be. Otherwise, you can spend a long time — 45 tries to get “city” — just generating random words. But if we knew a starting word was, say, “foot”, we could start thinking of vectors that that word is on: measure, toe, body, shoe, soccer, etc. That might be fun, and would stretch our minds.

As it is, Semantle is a game the unplayability of which teaches us an important lesson.

And now I shall wait to hear from the many people who are actually able to solve Semantles. I hate you all with a white hot and completely unreasonable passion.[1]

[1] I’ve heard from people who are solving it. I no longer hate them.

Tweet
Follow me

Categories: games, machine learning, tech Tagged with: ai • games • wordle Date: March 28th, 2022 dw

2 Comments »

January 12, 2022

Google Pixel 3 vs Pixel 6 Pro: Photos

I upgraded to a Pixel 6 because my Pixel 3’s charging plug no longer worked, the glass on the back was shattered, and battery life was down to 2-3 hours. I decided to splurge on the Pro version primarily because of its superior cameras.

But while I loved the photos the 3 takes, I’m wasn’t as happy with the 6 Pro. I don’t know enough about photography to be able to articulate why I liked the 3’s photo better, other than to say the colors and shadows were richer and deeper. The 6’s photos were clearer and more detailed, but I generally just liked the 3’s better.

Then I uploaded three sets of samples into this blog post and discovered that most of the differences were due to the displays on the two phones. Viewing them on my MacBook Pro, I think I probably like the Pixel 6 at least as much as the Pixel 3….except for the photo of the hand below.

Here are unedited photos from each so you can decide for yourself.

Pixel 3 -backyard
Pixel 3 -backyard
Pixel 6 - backyard
Piuxel 6 – backyard
pixel 3 - nail polish
pixel 3 – nail polish
Pixel 6 – nail polish
Pixel 3- hand
Pixel 6 - hand
Pixel 6 -hand
Tweet
Follow me

Categories: reviews, tech Tagged with: google pixel • google pixel 6 Pro • review Date: January 12th, 2022 dw

2 Comments »

November 30, 2021

When your Chromebook colors get trippy

Our 3 year old grandchild was randomly typing on my wife’s Chromebook and somehow made all of the images go wonky. This includes all images displayed in the browser, the system’s wallpaper, and even the icons.

I could not find any mention of this problem anywhere on the Internet, apparently because I insisted on using the word “posterize” to describe the images’ condition. The rest of the world apparently calls this “inverted.” I have been calling inverted images posterized probably since the late 1980s. It has never before steered me wrong. But according to dictionaries and what Google Search has learned from the Internet, I’ve definitely been misusing it.

But first, the solution to the Chromebook problem. I learned this from Iain Tait (@iaintait) who responded to my tweet asking for help. He pointed to this article in Chrome Unboxed. Our granddaughter unwittingly put the Chromebook into “high contrast mode.” Clicking Ctrl+Search+H will undo the little devil’s mischief.

Now, back to how I went wrong.

Posterization apparently was coined in the 1950s to refer to the process of turning a color image into the sort of stylized image often used in posters. Gradations in color are flattened, colors are brightened, and so forth, until the image would have been acceptable to The Beatles in their late psychedelic phase. Inversion is a 1:1 clipping of colors so that the original looks like what I think a color negative of it would look like, but I’m probably wrong about that too.

Here’s an example using a photo of our post-Thanksgiving walk (CC-BY-SA-NC by me).

Original:

unfiltered image of people on a street

Posterized:

Posterized version

Inverted:

Inverted version

But the real point of this post is to let Google Search see a few more instances of posterize, posterized, and posterizing in the same sentence as image inversion and Google Chromebooks so that the next fool who confuses posterization and image inversion when faced with an image inverted by Chromebook will find at least one damn entry that clarifies a mistake that apparently no one else has ever made.

Posterize inverted images Chromebook. Posterize inverted images Chromebook. Posterize inverted images Chromebook. Posterize inverted images Chromebook.

(The transformations are by Pixelmator Pro.)

* * *

Isian has written a lovely post about how our paths crossed after many years.

Tweet
Follow me

Categories: misc, tech Tagged with: chromebook • images • posterization Date: November 30th, 2021 dw

Be the first to comment »

Next Page »


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
TL;DR: Share this post freely, but attribute it to me (name (David Weinberger) and link to it), and don't use it commercially without my permission.

Joho the Blog uses WordPress blogging software.
Thank you, WordPress!