Pondering AI | Transcript: LLMs Are Useful Liars with Andriy Burkov

LLMs Are Useful Liars with Andriy Burkov

June 11, 2025 / 47:00/E73

KIMBERLY NEVALA: Welcome to Pondering AI. I'm your host, Kimberly Nevala.

In this episode, I am beyond pleased to be joined by Andriy Burkov. Andriy is a Ukrainian footballer, or at least a football aficionado, who holds a PhD in artificial intelligence. More specifically, I believe, in multi-agent systems. Now, I am teasing him a little bit here, but I believe both of those things still to be true.

As the author of The Hundred-Page Machine Learning Book and, more recently, The Hundred-Page Language Models Book. Andriy is well-known for his deep expertise and no-nonsense takes on all things machine learning and AI.

Today, we're going to be discussing the current state of affairs relative to LLMs and agentic AI, as well as what we can all do to develop better intuitions about this tech. So, Andriy, welcome to the show.

ANDRIY BURKOV: Happy to be here, Kimberly. And everything was correct except for the footballer part. I don't know why-- why those chatbots keep calling me a Ukrainian footballer. There is none with this last and first name in Ukraine. But somehow, this information managed into their parameters, and there is probably no way to remove it.

KIMBERLY NEVALA: I did a little research. I heard that reference on a previous podcast, I realized and I figured that was so apropos, given the course of the conversation for today, so I couldn't resist throwing it out there.

ANDRIY BURKOV: Yeah, I also, apparently, worked for Google and started pretty many companies and sold them to various parties again, based on someone's claims.

KIMBERLY NEVALA: Well, congratulations. You're even more impressive than we already knew. Now, before we talk more specifically about LLMs and agentic AI, I'm curious how your own experiences working in the field have influenced your approach as an author, both in terms of the topics you choose to take on and the questions you seek to address in your books.

ANDRIY BURKOV: Well, of course, when I started to work on the language models book, I was like, OK, if I work on the language models book and there are language models, then, probably, it's a good idea to at least try to use them for writing.

And I was quite disappointed by the experience because it lies. It lies all the time. When I started, it was about a year ago. So of course, today, they changed, a little bit, the behavior thanks to reinforcement learning and larger data sets, larger parameter count.
What I noticed, like a year later change, is that previously, when it lies and you say it's wrong, then it answered something like, oh, yeah, right. It's wrong. This is a new version. And this new version could also be a lie, but at least you could make it change the position on some topic. But now, with these new reasoning models, they double down on lies very often. So they tell you a wrong thing. You say, this is wrong, and its answer is, no, this is not wrong. Here is why it's not wrong. And it adds additional lies to support the previous one. So I was disappointed.

But I found a good use, still. It's to polish my English because, well, I have a mix of Russian sentence structures and English words. So if I really feel like it doesn't sound well to me, I could go to an LLM and say, can you rephrase this so that it sounds more natural? And also, one good use case is to verify your own text. So of course, we all do mistakes, and sometimes we can miss something important.
And in your head, it feels logical. But you can still put this text back to an LLM and say, does it contain any mistakes? Is there anything that I missed or should have said differently? And it might find errors.

So, like, proofreading, copy editing? Yes. Machine translation? Yes. Writing, no.

KIMBERLY NEVALA: And one of the things I always appreciate about your work is you are very clear about the fact that you are not critiquing the tech, per se. Because the tech is what the tech is. But you're very targeted at critiquing the rampant hype or misrepresentations of what the tech can do, and, therefore, how it can or should be used.

You alluded, I think, to some of this in that introduction. But for those of us working in the field, for folks working within businesses, thinking about where and when and how to apply this, and equally important, how not to apply large language models in particular, what are some of the core ground truths, if you will, or limitations that we need to always acknowledge or keep front and center? Not to avoid using it, but in order to use and apply these systems correctly.

ANDRIY BURKOV: Yeah, I just will answer the first part about criticizing.

So I'm actually surprised that not more people do this like I do because if you are a scientist and you are in this domain, you know that many claims that those CEOs and venture capitalists do about what it will be next year, in five years-- it will replace this profession, that profession-- those claims have no grounds at all. So they just made them up. And why scientists don't become vocal more often-- I know that Yann LeCun criticized these claims. But other than him and Gary Marcus, probably, I cannot really remember anyone-- any other scientist doing this.

So from the practical standpoint, I think that the biggest challenge for me and for everyone else who tries to use and find good use for this technology, the challenge is that we don't know the data set that was used to train it. And why it's important is because, well, maybe two years ago, when ChatGPT was first released, we were under an impression that it really thinks, it really can give you answers that no one else could think of.

But now we understand that they are limited to the data set on which they were trained. And this is, perfectly, the same scenario as when you work on, let's say, a classifier, and you want the classifier to distinguish between different kind of animals, for example. And if you don't have pictures of some animal in your training set, your model - even if you say, OK, you have this class for this animal - but if you don't provide data, the model will always be wrong about this specific class.

So this is the same thing about LLMs. So if something, some let's say use case, wasn't represented by the documents in the training set, well, then the model will hallucinate a lot in this specific domain. So why it's the problem is because you don't know whether you will waste your time or you will save time.

And this is important, because, for example, when I code and I start from scratch… I just posted yesterday that during this weekend I built an entire application for professional translators because I am translating my book from English to French. And I was bored to do this manual back and forth between my text file and LLM. You cannot post the entire chapter. You need to paste three, four paragraphs. Otherwise, it will rewrite it wrong.

So I decided, OK, why, rather than wasting my time doing this back and forth, why wouldn't I build an application that will do this? I will just click the button, translate me the next batch. It will communicate with an LLM, receive all needed make different validations, verifications, and then show me the result that I can then correct slightly, click Next batch, next batch, next batch, and do this.

So I didn't know whether I would succeed, whether I will save time rather than wasting time trying to build this application myself. And it turns out that for this specific application, after two days of work, I now have an application. I even connected the billing and the authentication, everything.

But if you work in some other domain - for example, some very niche application. For example, I don't know, maybe some drivers for some hardware, and you want an LLM to generate code for you, I would bet that it will not work very well. So you will try. You will explain what your hardware is and what you want this driver to be able to do with it. But because low-level code is rarely online, usually those GitHub projects are about user-facing applications rather than middleware or even the firmware.

So you will try. You will see that it doesn't work. You will go to a different LLM. You will try again. It doesn't work. So eventually, you will realize that you wasted several hours, maybe a day, and it doesn't work. So maybe if you spend this day by just coding it by hand - if you are, of course, a professional developer - then you will save time by working by hand.

So it's the same for all domains. The problem is we don't know, for this specific tool, what it's capable of. And there is no way for us to know unless we have access to the training set. And those companies they don't want to share their training set, for many reasons. Well, they say that it's competitive reason.
There are also copyright reasons. They don't want everyone to see that the data sets contain stolen books. Like DeepSeek, in their paper, said, yeah, we wanted to increase the quality of our model. So we downloaded LibGen archive. And once we added this data our model has become state-of-the-art. So they don't want to disclose for this reason either.

So yeah, it's kind of first tool that we have that we don't know what it can do. You have a hammer. You know what it can do. You have, I don't know, a screwdriver. And again, you know what it can do. You have AI, and you have no way to know.

KIMBERLY NEVALA: Yeah, I wasn't sure whether I should be, we should be, applauding or just groaning over DeepSeek's admission of that. Since it's probably likely true for a lot of the other models as well. So there's

ANDRIY BURKOV: Well, I'm sure it is, and I will not accuse anyone.

But everyone knows that the best quality of LLMs output comes from the high-quality data. So when they start training it, usually they start pre-training. They start with the least quality data. And then, gradually, they add more and more quality and more and more complex. And when I say, quality, it's books. And when I say complex, well, it's code, or also scientific papers with a lot of examples of how to prove, let's say, a specific theorem and so on.

So if you don't have this quality data, your model will be fluent but it will be fluent in a useless way. So when you see that they stay competitive, despite Chinese claiming that they steal data, you are wondering why. If you say that you don't steal, how come you are as good as they who steal?
And it's not a tiny thing. Like, if you take a LibGen, it contains almost every book ever published. It's a lot.

So how can you just look at it and say, oh, yeah, well, I can use it, but I won't because I'm a good person?

KIMBERLY NEVALA: [LAUGHS] Now, you said earlier that, for instance, you have expertise and you code. And maybe you haven't coded in certain areas, but you're able to use it. Probably because you have a level of experience and intuition that tells you when what is being provided, the output, is at least within the realm of reason or possibility. That may be poorly stated.

But something that gets brought up a lot with large language models, or this type of model is what people call the expert paradox. Which they will say, the catch-22 here is: in order to be able to know whether this is just confidently confabulating something or making it up - a hallucination, as it's more commonly called - you actually have to know something about the topic as well. And therefore, a lot of these claims about where this will drive efficiency and productivity are really, really thin.

How do you think about that issue? Sometimes I've heard that called, as I said, the expert paradox. And does that limit how or when we should think about utilizing these tools or incorporating these tools into things like business workflows?

ANDRIY BURKOV: Well, I think that the problem with the LLMs is that they are regular machine learning models. And regular machine learning models, they don't know what they don't know. There is no way for them to distinguish between the training data of high-quality and reliable and the training data of low-quality and unreliable. Or there is no way for it to distinguish between the situation where it should provide an answer and the situation where it shouldn't because there were no data to support this.

So every time it tells you Yes or it tells you No, it's a gamble for you as a user whether you trust it or not. Because this Yes or this No, first of all, it doesn't come with any kind of a certainty score. And even if it did come with a certainty score, this score would be arbitrarily wrong. Because usually, in neural networks, if you train a classifier then the predicted class will always have a certainty of about 90%, 95%, 98%, independently of whether the prediction is correct.

Like when I use it, for me it's always a question of: how important a problem would it be for me if I blindly trust it, this answer? So sometimes, in many situations in life, you have no idea at all. And you would like just to have some idea. So if you ask a question and you get some idea, I would say that maybe 90% or 80%, 90% for the most basic questions, would be OK, would be reliable. And maybe 10% of it would be wrong. But again, when they are wrong, is it a deal breaker for the user?

Often, you would just prefer to be slightly informed rather than uninformed at all. Or slightly wrong because we are often, in life, we are often slightly wrong. When we discuss something between different people and some might say, oh, this works this way. And someone else might say, no, no, this work that way. So eventually, you can google or open Wikipedia and really read about how it works.
But between the two persons, one was wrong but they thought that they were OK. And they could still live their life and it didn't hurt their overall life experience.

So for everything mission-critical, of course, you need to verify every claim. When you work, let's say, on a scientific paper and you ask an LLM to prove some property of your algorithm and it gives you a proof, you cannot just take it and put it in the article and publish. So you still need to read through the proof. If it contains wrong assumptions or direct mistakes, you should go back and say, here, it's wrong. And at some point, you can even realize that the LLM cannot prove this theorem, so you will have to prove it yourself.

But again, starting from something that doesn't work it's often better than starting entirely from scratch. So this is how I usually use this tool. So it's a provider of almost completed project that you can either be able to complete or not. But at least you don't start from scratch.

KIMBERLY NEVALA: There was the old saying that says, you can make a good decision with bad information, as long as you have a sense of how bad the information is.

ANDRIY BURKOV: Yeah.

And by the way, something I often say is that to really maximize your benefit from these models, you should be an expert in the domain. So if you want it to prove a theorem, you should be a scientist in this domain. In this case, you will see where it's wrong, and you will correct it. If you wanted to write code, you should be an experienced coder so that when you see that it applies a band-aid instead of addressing the core issue, you will say, no, no, I don't accept the band-aid. Find the core issue.

But if you just accept it blindly, everything that it generates, you will end up having some application that kind of works. But if you really try to improve it in the future, you will see it's so poorly done that it's better to rewrite it from scratch.

So if you are an expert, you will benefit the most. But many people, unfortunately, are not experts. And for them, having access to this tool, it's probably their closest opportunity to discuss with someone around this domain. Imagine if every one of us had access to a professor from a university that experienced in some domain. Imagine how cool it would be. So we want to build something, write something, invent something. And we could call the person and say, hey, I'm stuck with this problem. And they would say, yeah, you should try this tool and read this paper, and so on. We could all become very productive builders. But very few of us can afford having such connections.

So ChatGPT or whatever GPT is the closest thing that we can have. You can always ask very hard questions to an LLM. Of course, it's a gamble whether the answer will be accurate, but the alternative would be no answer at all.

KIMBERLY NEVALA: You will hear, particularly, about hallucinations when we're looking at domains, particularly within a business context, where there's a really low tolerance for error. Or the implications of it coming out with a wrong answer are fairly profound. And then we bring in things like RAG (or retrieval-augmented generation), fine-tuning, other types of techniques.

Can you give folks a sense of how those techniques, maybe use RAG as an example, of fine-tuning work? And then what the boundaries of, not boundaries of correctness, but to what extent that those can really address what I would say is more of a feature than a bug, this issue of "hallucinations”?

ANDRIY BURKOV: Well, RAG it's, of course, a good feature because LLMs, as we discussed earlier, were trained on some data set. You don't know exactly what it's composed of. So when you can put your documents that you trust into the context and get answers to your questions, of course, it's a very good feature.

Well, it doesn't exclude hallucinations entirely. They still can happen. And it was demonstrated on multiple occasions that you can put, let's say, a scientific article into the context and start asking questions about the article. And the LLM will provide answers that are not supported by the article text, or they are supported, but entirely in a different way. So you still have to be cautious about hallucinations, but of course, they are reduced a lot. When I say, of course, it's just from experience, not because I know, better than everyone else, how it works. So yes.

But the downside of RAG is that now, if your model could provide you an answer based on the training data, but you say, forget about it, this is my data. So if there is some hole in the information in your provided document, the model might just assume that there is nothing to fill this hole because you say, answer my question based on this article. So the model, if you didn't use RAG, might have provided you with an answer that is actually better because it was trained on the data that is more detailed or a wider range of information, rather than just what you provided in your document.

So it's this precision versus recall competition in machine learning. So if you want your model to make as few mistakes as possible, it will probably work only for 10% of the inputs. For the rest of them, you will not have any prediction at all. Or if you want your model to give predictions for all possible inputs, then, in this case, the accuracy of those predictions will go down. And it's a smooth curve. So you need to choose where you want to be. Do you want to provide as many answers as possible, or you want to provide as few errors as possible? And it's always a trade-off, that you cannot satisfy. You cannot have both.

So RAG, in this sense, it's similar to it. So you say, OK, I will only accept answers based on this document because I trust it. And it will give you maybe half of the answers that you might ask. But if you remove this RAG, it will give you 100% of answers. But some of them, obviously, will be less accurate. But some maybe will help you in your business.

So maybe it makes sense to try both. So give the user two perspectives. For example, OK, this is the answer based on our knowledge base. And this is the answer based on overall training set. So use both of them, depending on your aversion to risk. So can you tolerate more information with higher chance of error or not?

KIMBERLY NEVALA: And I get this question a lot, but I suspect you'll have a better way of explaining it than I will. Which is: someone will take an example or a document and ask it to be summarized. And the summary will pull information that, A, is not at all represented in the input document, and sometimes it's completely unrelated. How does that happen?

ANDRIY BURKOV: Can you repeat the question?

KIMBERLY NEVALA: So, I've had this exemplar where folks have said I asked, maybe it was Copilot or I put it into ChatGPT, to summarize a document. And it came back with the summarization that, for the most part, might have been right. Maybe it didn't summarize the points they thought were most important or not most important, which is something to be aware of. But then it'll throw in something that's a factoid or a reflection that was not part of that source document at all. And I'm interested how you explain to folks who are not analytically inclined or algorithmically inclined why that happens.

ANDRIY BURKOV: Yeah. Well, the reason for this is that the model, when it starts generating the answer or the summary, for example, it doesn't know where it will end. So it starts with predicting the first word or token and the second word, and then the third word. So the prediction is based on what has already been predicted and the original input.
So if it started by predicting something that logically leads to predicting this next word. So the model has a conflict now. So the model wants to predict something that is very logical, given its initial prediction. But it's also asked to be conditional on the original input. And what choice will model make?

Well, the model doesn't make choices. We, as designers, we say, OK, every time you predict a new word, we give you a distribution of probabilities for every given token. And we sample, from this distribution, the next token. So in this situation, the model might use a word so the model might give a high weight to two words. One word is more logical, based on the sequence that has already been generated. And another word also has high probability, based on the original input that you want to summarize.

And then it's a chance. So you sample from this distribution. You have two words equally probable. You sample the wrong one. And the model, instead of "potato" puts "tomato." And then it's stuck with this. So it's already predicted tomato rather than potato. So it needs to continue to talk about tomatoes and not potatoes.

And this is why, for example, these reasoning models, this is why they improve the user experience and overall model quality. It's because this reasoning phase that is usually hidden - well, today, you can unroll it and see what was going on - but usually, it's not for the user. It's for the model. So it's a buffer, where the model can put different predictions: different summarization, different summaries, different versions of how to phrase different sentences. And then once it has put enough in this reasoning sequence, then it says, OK, now I have enough here to generate a better-quality summary.

It's because it now has not just access to the original document in the prompt and the beginning of a sequence. But also, it has access to this large conversation, an internal conversation where it can put different facts. So if it put enough different facts in this conversation, it can also condition on them. So even if this final generation is still based on this token-by-token prediction now it can condition, also, on much, much more factors. So "tomato" will not outweigh "potato," probably, because it has put a lot of support for potato and not for tomato. So it's just one reason.

Another reason, as I said, the model has blanks in its training data. So even if you put a document you want to summarize but this document comes from an entirely different universe of documents, like something that is very rarely online, the words in this document and the sequences of those words will look very new to the model.

And when I say it will look very new, it means that the model will embed them. It means convert into some internal numerical representations based on very poorly trained parts of its set of parameters. So if it represents your input in a very poor-quality numbers, then the output will be also poor-quality. And it will not be poor-quality in a good sense. It will be poor-quality in an unpredictable sense. So sometimes your summary might be entirely wrong just because the representation of your original document was not high-quality.
KIMBERLY NEVALA: And so is it fair to say that when it comes to using LLMs, even when we're thinking about applying those in production systems, that prompt engineering is not all you need here? That we're still going to need to rely on good old-fashioned engineering and application life-cycle rigor?

ANDRIY BURKOV: Well, again, it's the same problem here.

So people often say, oh, you just didn't use the right prompt. With the right prompt, it would solve your problem. But again, if your problem is not in the training data distribution, and you have no way to know, then no matter how hard you try and you write your prompt, it will make mistakes.

And well, for example, let's say we take coding. And I said that we can very quickly get something that works but not perfect and then go from there. So this first initial prototype that works, it will not be hard for you to obtain. Because, for example, if you build a user-facing application, you need a login page. You need the Billing page. You need the Projects page, the Documents page, and so on.

But then if you want those documents that your user will work with in your application to be processed in some unique way. So you invented a new way of converting documents into something or extract something from them. You cannot just ask the model, oh, I want you to extract all, I don't know, location names from the document. This will work very well because the model has seen so many locations. But if you say, I want to extract some molecules having a certain structure, and these molecules you are the first person on Earth working on them, the model will not know how to do it.

So basically, if you want to make your LLM generate code that implements your novel algorithm that applies to those molecules in the document, then you will have to write, line by line, exactly what you want your code to do. And if you write line by line, well, this is regular coding. You just replace Python with English. But you don't save much because writing line by line, it's still very time-consuming.

And I think you would rather prefer to write in Python. Because at least when you write in Python, you can be sure that when you run it, it will most likely work or do exactly what you want. But if you replace Python with English and wrote a solution line by line, you are never sure that it will implement this verbal algorithm in exactly the way that you intend it to.

So yes, prompt is still important. You need to put as many details in your prompt as you can. And I often do, for example, before I ask an LLM to generate code, I put a prompt. And then I say, rewrite my prompt for clarity, and let me validate it first. And add to the prompt any details that you think I missed but you would rather prefer them to be there. So the LLM will write your prompt, but in more structured and more detailed way. I read it. I can copy it, paste, add some additional constraints or instructions. And then I ask for the code. This improves, a lot, the probability of getting high-quality result.

But again, as I said, because I don't know whether this specific functionality was present in some training document or not, it might be good. It might be not good. And it's not always the prompt to blame. There are more fundamental reasons for this.

KIMBERLY NEVALA: Yeah, thank you for that.

It may seem like a bit of a pedantic point, and perhaps it is. But I hear a lot of frustration again probably from, in a lot of cases, business leaders or program leads who get stuck on how fast it was to develop something really quickly, even using an LLM to develop an initial proof or prototype. And then can't quite figure out all of the daylight that comes between doing that and then actually having to get something fool-proof or to a level of accuracy or standard that they feel comfortable with in production.

It can be very frustrating, then, for them to say, well, in fact, the final solution may have nothing to do with your LLM. It's one way to use it as an on-ramp to proving the solution. And it may or may not need to be part of the final. So I think that is helpful for the audience as well.

ANDRIY BURKOV: Yeah, and I think that I really appreciate about working with LLMs is that when you work in AI or you work with machine learning and let's say you build some complex system where you need to plug different machine learning models, plug in them in some specific places to implement the workflow of your application. So for example, the inputs flow here. Then some model predicts whether it's this or that. And then you implement some hard-coded logic. And then, again, there is a decision made by a model.

So in the past, to get the first working prototype, you had to program all, to train all, these models from scratch. So you needed to gather the data set. You needed to design the code that would feed this data set into a model and train. And then you would work on making it better, feature engineering and so on.
But now, you can replace all these or most of these places with an LLM. So you just say, you act as a classifier, and you have a choice between three classes. And the documents that you will receive, or your inputs, will look like this. And you will get this classifier, like this. Of course, it will probably not be the perfect classifier, but you will have all of them in the right places in almost no time. A

And then once you finalize your entire system, then you will say, OK, this one, this classifier, works more or less good. I don't think that it makes sense to invest time and train your own. But this one I would really make it more accurate because it makes too many false positives or something like this. So you reach your proof-of-concept stage very fast and then you iterate. In the previous times, you would reach your proof-of-concept phase after maybe six months or a year of work. And in the end, you might say, well, yeah, if doesn't work so well as we expected. We passed to something else. So you could save half a year or a year by paying $20 a month.

So you never know. Sometimes it's a huge time saver. Sometimes it will make you lose time. So you should be just experienced enough to distinguish the situations.

KIMBERLY NEVALA: Yeah. So, really quickly, to be or not to be agentic seems to be the question of the year right now. Can you tell us a little bit about how the definition of an agent or agentic AI has changed or morphed of late? And what organizations need to be mindful of when they are evaluating, sourcing, or deploying AI agents today?

ANDRIY BURKOV: Well, I think that for many people, they heard about agents for the first time, maybe, last year. But I made my PhD in multi-agent systems. And it was 15 years ago.

So the science of agents and multi-agent systems, it has been there for decades. If you open the most used textbook, currently, in universities Artificial Intelligence: A Modern Approach from Russell and Norvig, the chapter 2 is called "Agents." And you have a definition of what it is.

So in very simple terms an agent is an entity that is acting in some environment and it has some notion of optimality. So it understands what actions it can execute in the environment. It understands what feedback it might receive from the environment. And based on the feedback received from the environment it knows how to choose the next optimal action in this environment.

So the agent has a purpose because it was optimized for a specific environment, and it wants to achieve some goal in this environment. But what they call agents today? It's mostly they just took an LLM, an off-the-shelf LLM, and in the system prompt they proclaimed that it's now an agent. So you just say, you are a very helpful agent, and you can use search engine if you need some information. You can use, I don't know-, a calculator if you need some math. And now, you must solve this problem, this business problem, for me.

So, OK, you proclaimed it an agent, but how do you know that it actually will accomplish your goal with any degree of success? You don't. And even if you ask it, OK, repeat what was your goal, it will repeat it because it's a language model. This is what they do. They repeat text. So it will rewrite exactly what you said in different words. And you will say, oh, wow, you really understand what I mean. Cool. You are a real agent. No, it just repeated your words. This is what it was trained to do.

So now you put this agent in some environment and you wait. After some time, it will come back with some result. For some cases, maybe it will. For example, if you say that your goal as an agent is to bring me news on some specific topic. So it will execute a search once a day, download all the news, and make some digest, like bullet points of what's important. Yes, it will work. Why? Because the LLMs were trained on a lot of internet content. So it knows what the news is, what an article is, what the HTML is, what bullet points are. So it will do it. If you ask it to make a reservation in a hotel, maybe it will do it, too, because, again, it has seen so many code of hotel reservation system on GitHub and elsewhere that it knows what the main stages of hotel reservation are.

But if you ask it something like, OK, let's say you have a company that builds structures. And sometimes you miss some piece and you need a replacement. And the replacement comes from a different supplier. And it comes in a different description of a piece. And it made of similar material, but not exactly, so, not exactly the same. And then you say, oh, you are an agent, and you will find the best piece for my tractor. I give you 99.99999% guarantee it will buy you a piece for a BMW or for Mercedes because it has no idea about your tractors. It has no idea about tractors, probably, at all, because I think, on the internet, maybe there is not much information, detailed information, about what tractors contain inside.

So it's a gamble. Again, if the data supporting your agent is online, there are chances that it will work. If not, depending on how much you will lose if it fails, sometimes it could be a catastrophe for a business if you just blindly put it online.

KIMBERLY NEVALA: So really, in the interest of time, and at the risk of grossly oversimplifying, I would summarize that by saying all the previous caveats, constraints, and cautions that you have spoken of during this session apply in this brave newish – not really, maybe renamed - world of agentic AI.

Well, with that, I know we are up on a hard stop there, and I appreciate your patience with our technical difficulties. Tech will get you every time, as I said. And really appreciate your insights and time today.

ANDRIY BURKOV: It was a pleasure to be here. And let me know. Maybe next year, we can talk about something else.

KIMBERLY NEVALA: Yeah, no, that would be great. I would love to get you back. Maybe we circle back next year and also talk about the issue of are we over indexing on LLMs right now and then undervaluing other forms of AI. And I still want to get your thoughts on how we can help those who are not, as you called it, algorithmically inclined develop good intuitions about this technology. So we will take you up on that and circle back in the new year. That would be awesome.

To continue learning from thinkers, doers, and advocates such as Andriy, please subscribe to Pondering AI now. You'll find us wherever you listen to your podcasts, and also on YouTube. In addition, if you have questions, comments, or guest suggestions you can write to us directly at PonderingAI@SAS.com.

Creators and Guests

Host

Kimberly Nevala

Strategic advisor at SAS

Guest

Andriy Burkov, PhD

Machine Learning Expert, Author

LLMs Are Useful Liars with Andriy Burkov

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere