Those chocolate chip cookie recipes online are the real AI danger

Academically Impertinent — Tue, 29 Apr 2025 18:14:25 GMT

Ever try to google a recipe and you get the most annoying freaking article about how the author’s mom’s best friend’s adopted dog’s vacation to South America made them really want to learn how to make a proper Chilean Sea Bass? It goes on for pages. Normally, with my hands covered in flour or something sticky to coat the screen of my phone, I scroll and scroll and scroll to find the one ingredient that I can’t remember – getting angrier at the stupid article.

(Look! I made cookies and I didn’t need to know about your brother’s Zen retreat to Hershey, PA.)

The reason this happens is one of the biggest problems in AI. Everyone is worried about safety, and the destruction of the world, or Terminator, or job losses. Nope the biggest problem is the tons of crap that precedes the ingredient list or oven temperature in a recipe.

This is because any AI or Machine Learning algorithm is really good at optimizing what you told it you wanted. And yes, search engines like google are AI. You’ve been using types of AI for decades now.

You told it you wanted the best recipe for some chocolate chip cookies – but a computer doesn’t know jackshit about cookies – it knows about words that describe cookies. And really, it doesn’t even know that. It finds patterns of words that people click on when they type words about cookies. So rather than knowing anything about cookies, it has a proxy problem that seems to be close to it. Most of the time, a search engine is really good at finding words on webpages that are related to your search. Words describing what you want, but you didn’t know that you wanted … or didn’t know what they were – “Napoleon Defeat” —> “Waterloo”. They co-occur and are often answer your “information seeking need”.

(All the words on the left vs. only the recipe on the right)

But for that killer Paella, there aren’t enough words in my 15 step recipe to distinguish between my recipe and the 50,000 other people who have posted something. So seeing a lot more words has helped Google/Bing/whatever the fuck else is out there/Ask Jeeves to find that recipe. If I describe every possible detail about saffron, some of these are bound to be deemed related to paella, or better yet, even closer related to saffron than that other jerk posting his recipe without his personal travel blog attached.

This is a classic problem where the algorithm has over-optimized for the proxy problem. We call this the objective function. It is really good at finding related words for what you search for – and most of the time that’s what you want – but not all the time. It’s good at something that appears similar to what you want. And it will get as good as possible at doing that thing – but it’s not that thing. It isn’t really telling you how much wasabi you need. It is trying to win a game and get the best score. More words about “Japan” or “horseradish” make that little score go up.

So instead of finding the perfect recipe, it just rewards longer and longer articles. The law of unintended consequences.

And this is all those fancy AI bots are good at as well … making some little score go up. You feed the algorithm the whole internet and it gets good at knowing that “United States of” is probably going to be followed by “America” … or when you ask ChatGPT to talk like a pirate, that the last word is probably “Yarrrrr”.

This is the biggest problem in AI … we are hyper-optimizing for little scores that aren’t actually exactly what we want. We aren’t going to destroy the world … if anything, we are just going to flood our emails with verbose, AI generated content that is wayyy too long and resembles the quote-unquote perfect cookie recipe that no one actually asked for.

Me, I’d take the killer robots any day of the week over having to read another stupid story about some personal discovery when I’m trying to cook, but unfortunately that’s not what we are going to get.

Xenophobia or Racism around DeepSeek …. Or just incompetence?

Academically Impertinent — Tue, 11 Feb 2025 21:03:58 GMT

DeepSeek freaked out my Grandfather and College Roommate – but none of the AI researchers I know. I’ve spent the last 15+ years watching new AI models come out all the time. Progress is constantly being made and rarely a week goes by that I’m not impressed with a new model or technique. For me, and my colleagues, that’s what DeepSeek was … an impressive new model, but nothing that was mind blowing. There’s a lot of cool insights and neat engineering advancements in the paper … not to sell it short …. But that is scientific progress. It’s been fascinating watching the world (or at least the US stock market) freak out over DeepSeek the past few weeks, because it isn’t freaking out AI researchers. Most of us spent some time reading the paper, figuring it out, playing around with it online, but not having our world turned upside down. We do this all the time with our job. In fact, a lot of us had read the DeepSeek paper before everything blew up and had moved on. ("Why haven't you written about DeepSeek yet?"). Instead….

The questions I get from researchers are … “Why?” And “Is this xenophobia or racism?”

It’s weird, normally I get technical questions from technical people while the rest of the world asks about racism and the world ending with AI. With DeepSeek, I’m getting the AI researchers asking about racism. I’m sure this is definitely part of it, but I think a lot of it is just incompetence of the financial analysts who cover AI. I don’t think they understand it – or the vast majority do not. There’s been some good reporting and blog posts over the past few weeks as things have normalized a bit, but there’s still a lot of people freaking out about how a Chinese company did this.

Technology improves. The way we are training our AI models are incredibly stupid … we throw insane amounts of data at these models … pretty much the entire internet …. to do exactly what your iPhone’s stupid autocomplete algorithm does. You get some pretty neat results, but it isn’t really clear that brute force is the smartest way to do these things. There’s a landgrab going on right now with AI models and this simple, but expensive, way of doing things is the easiest. I don’t think any serious AI researcher expects this to continue. We’ve just incentivized Chinese researchers to move down this path a bit sooner since they can only have NVIDIA H800s instead of H100s.

We will continue to see advancements like this going forward. Who knows where the next crazy advancement is going to come from? It could be a foreign company no one has ever heard of, or it might be a student at a University in the US. Anyone actually in the field is probably expecting this. But Wall Street is likely to freak out again … and then recover a few days later.

Cost. $5.6 million. This is the thing that is really scaring most people – but is highlighting a fundamental disconnect between someone who has built models and the broader public following NVIDIA’s stock price. This is a major underestimate of the total cost. From their paper, they state that “official training of DeepSeek-V3”. That is one run … which doesn’t count all the runs that go into preparing for that training run, nor the costs of DeepSeek-V2 which impacted it, nor the costs of salaries, etc. Plus things break constantly when training models. You often have to go back and redo parts of the process. This engineering is hard and underestimated. Also, I remember hearing that an LLM that Bloomberg built was roughly $10M for their final training run. Under $6 million is super impressive, but it is still a lot of money. Engineers cost a lot more. Recent PhD hires are getting $500,000 salaries while Master’s students are getting $400,000. Sure labor is cheaper in China, but costs add up quite quickly beyond a single training run. I’ve heard a senior professor in the field hypothesize that funding the entire DeepSeek company, not a single run, is actually a billion dollars … so, yes you will still see a lot of CapEx from big tech companies going forward.

Great, so you can do a lot with less right now … now let’s see what happens when we scale that $5.6 million up to a hundred million. I bet we see this happening with a major US tech company (or Chinese tech company. These improvements do not necessarily have to be for the cheaper. We are instead going to see wayyy more powerful models coming out that are quite expensive.

Cheating? I hear a bit as well that DeepSeek is cheating somehow. Like, stealing other companies’ intellectual property. Based on things like number of parameters in the model etc., it doesn’t seem to be copied from any other model to start (no upcycling, etc.). However, there probably is distillation. OpenAI claims that DeepSeek copied some of their model’s outputs. Sure, this breaks the Terms of Service … but most of what we see in our AI models are things vacuumed up from around the internet. This is just slightly less socially acceptable than most of the other things we see. OpenAI doesn’t tell you where their data comes from. More open models like Meta’s Llama models do not tell you data either. This feels more like it is on that spectrum rather than complete corporate espionage.

What actually scares me: User Logs. What makes Google search work? It’s not some fancy algorithm … it’s that Google knows what people actually click on when they type in a query. That is worth more than anything. As people use DeepSeek more and more, this will be the most valuable thing that they get … user interactions.

My main takeaways are:

Cool, I should play around with Mixture-of-Experts more.
We will still throw a lot of money at GPUs.
User logs will be the main goldmine that this generates for DeepSeek.
~~Auto-Regressive model~~s iPhone autocorrect will continue to be key (see Janus below).

Some other thoughts about DeepSeek and Chinese AI models:

For a more technical audience, I think the following blog is super useful: https://epoch.ai/gradient-updates/how-has-deepseek-improved-the-transformer-architecture The start of the conclusion sums up what I think a lot of AI researchers think about DeepSeek:

“I see many of the improvements made by DeepSeek as “obvious in retrospect”: they are the kind of innovations that, had someone asked me in advance about them, I would have said were good ideas. However, as I’ve said earlier, this doesn’t mean it’s easy to come up with the ideas in the first place.”

Mixture-of-Experts! I expect a lot more work will come out talking about Mixture-of-Experts (MoE). This seems to be a key insight from DeepSeek. MoE has been around for basically ever (in AI terms). It was first introduced by Google in 2017 (MoE). It is actually older than the Transformer paper – the main AI model that everyone uses now. Meta (aka Facebook) used it in their No Language Left Behind (NLLB) paper a few years ago which is where I started seeing it gain more traction.

I started digging into more of DeepSeek’s papers and they basically say everything that they do … going back a while. In hindsight, if you’d been following their papers (but why would anyone) you would have seen that they have been going all-in on Mixture-of-Experts.

Alibaba Qwen-2.5-Max. Not to be upstaged by a rival Chinese company, Alibaba quickly pushed out a press release about it’s model and how it beats DeepSeek. https://qwenlm.github.io/blog/qwen2.5-max/ I spent a bit of time reading this and quickly realized that it is an unfair comparison. The world was shocked by DeepSeek R1, not as much by V3 (R1 is trained-off of V3). Alibaba’s press release is deceptive and they actually do not beet R1 (for now … I’m sure their model will improve). Here’s the figure they show … augmented with R1’s actual numbers.

Janus-Pro-7B came out shortly after the R1 model and lots of people took interest in this as well. This is the computer vision model. It too is impressive, but it didn’t shock me any more than any other tech company’s latest model. It does two tasks (including the harder generation task, aka making images) but I was most reminded of LLaVA-v1.5 which came out of the University of Wisconsin Madison. It’s interesting to see this, and Emu-3 which are showing that auto-regressive models are winning in this modality (images, not just text) as well. Overall, you can train those models with academic-level resources. In other words, you do not need to be a wealthy tech company.

And since we are on the theme of Ancient Rome, my prediction for when the next iteration of Janus comes out is the Ides of March. Based on a lot of other DeepSeek papers, I’m guessing that this will be using a lot of Mixture-of-Experts. I wholeheartedly expect the financial markets to freak out again.