Orchestrating Public Sector AI with Taka Ariga

KIMBERLY NEVALA: Welcome to Pondering AI. I'm your host, Kimberly Nevala.

It has been an absolute pleasure to orchestrate this episode with Taka Ariga for you. In addition to being an accomplished cellist, Taka is the founder of Sol Imagination, and has previously served as the Chief Data Officer, Chief Artificial Intelligence Officer, and the Chief Data Scientist at the US Office of Personnel Management and the GAO, or the Government Accountability Office. So welcome to the show, Taka.

TAKA ARIGA: Thank you, Kimberly. Thank you for having me.

KIMBERLY NEVALA: Oh, absolutely. Now, have you always had an affinity or inclination that you wanted to work in the public sector, or is that something that has come about naturally through the course of your career?

TAKA ARIGA: You know, honestly, after graduating from college, I really wasn't sure what I wanted to do. I spent honestly most of my time practicing my cello at Johns Hopkins, pursuing an engineering degree. So that tells you where my priorities were.

But after graduating, I ran into this consulting firm, and they were focusing on delivering some financial management analytics for the Pentagon. And so that's where I got my start. But I have to say, I got bitten by the sort of analytics bug pretty early on. And so been pursuing that path around analytics, data science.

And eventually, once I joined GAO, AI became really the focus in terms of not only the possibilities of how to deliver public sector mission, but the risks that are inherent to these kinds of emerging technology. And so this would have been working at the public sector level, especially at the across federal agency.

But I, as much as I can, try to dialogue with my partners at the international, state, and local level to make sure that we're learning from each other from a best practices point of view. But also, if we don't have to reinvent the wheel, we don't have to reinvent the wheel. So it's been certainly a very interesting, I think, journey so far. And so I'm looking forward to continuing that.

KIMBERLY NEVALA: And we're glad to see that you have at no point thrown up your hands and decided to just pick up the bow full time. Although I will be curious if you've ever been tempted.

TAKA ARIGA: I have. As a matter of fact, recently, I was asked to do a session on the intersection of AI and arts. And to be perfectly honest, it's a topic I've been avoiding because, as a classically trained musician, I have this ivory tower view of my tradecraft. Like, AI shall not infringe upon my playing.
But given that I was invited to give this talk, I had to really figure out what can AI do relative to the different kind of sound, different kind of output. And so I did struggle a bit.

But I think the long story short there is that I don't think AI will replace musicians. It will actually spur, I think, different imagination, different possibilities, much faster than how you might be able to do it traditionally. Mixing different sound, different styles. But it's not going to replicate, it's not going to replace, really, focus around lyricism, around tonality.

And so it was very interesting exploration. I have to say, it was very uncomfortable because I really did not want to intersect AI with what I do on sort of personal time.

KIMBERLY NEVALA: Yeah. Yeah. Well, as somebody who just recently tried to pick up the bow and is trying to learn to play the cello, I can say I have such great appreciation for humans doing that. More than I do necessarily when it is engineered. But I definitely take your point there, and maybe we'll come back to that as well.

TAKA ARIGA: Absolutely. Yeah.

KIMBERLY NEVALA: So you have worked both in the trenches, if you will, of the US government within public sector and also are working with organizations more broadly. Where do you see AI driving value today and/or which areas do you think are most promising at the current moment?

TAKA ARIGA: Yeah, I think where we are on this AI journey is that the novelty of prototypes are starting to wane a little bit. A lot of organizations sort of rushed out there to buy things or develop things. They were able to successfully come up with a prototype. And now, we have these questions around, OK, that's a great prototype, but how do you scale them? How do you actually measure value?

So from a public sector perspective, I usually talk about this in three tranches. There are the personal productivity kind of capabilities - like Office 365 Copilots, summarizing meetings, transcribing meetings, things of that nature - that are I think very pervasive. And a lot of people are certainly using it.

And then, the next layer up to me are some of these operational types of use cases. Coding, for example, software coding. Some of these fraud, waste, and abuse kind of capabilities, I think, are certainly exciting, but that's a different layer of complexity.

And then, I think the most difficult part, at least for the public sector, is around mission effectiveness. How do we make sure that the warfighters have the latest intelligence? How do we make sure that we're connecting veterans with the right benefits? Those kind of mission applications have high consequences. If we get it wrong, somebody might die. If we get it wrong, we might infringe upon somebody's privacy or civil liberty.

And so I think the mission part is difficult. We're seeing a lot of adoptions around personal productivity. We're seeing a lot of use cases around these operational use cases. And more and more, as we get more muscle memory on how we develop these, how do we deal with the governance issue, I think we are starting to see the scaled applications of AI.

And so I think really that's, to me, the theme of 2026 is, how do we go beyond these novel prototypes?

KIMBERLY NEVALA: And are there areas where we have aspirations to use AI that, given the nature of public sector organizations and given the nature of the mission or the charter of these organizations, where you think we're either overreaching with AI? Or maybe we have aspirations that don't align up well with what the technology can or just should deliver in this context?

TAKA ARIGA: Yeah, no, that's a great question. When we talk about governmental entity, usually the word bureaucracy might come into mind. We have layers and layers of these analog processes that we build for decades on decades. It's very difficult to then just bolt on a purely digital and, more importantly, probabilistic solution like AI to say, well, if I put AI here, I am magically going to solve all of the existing challenges. The underlying processes themselves may have been defective to begin with. So automation alone doesn't usually, I think, address the totality of issue.

And within public sector, I think one of the key differences than maybe commercial sector is that we're not in the business of monetizing data. As a matter of fact, we're in the business of protecting non-public sensitive information for the purpose of serving our mission. So how do we deal with the governance of that underlying information? How do we deal with the sort of sovereignty and integrity of governmental data?

So one example here is that most of us are familiar with ChatGPT as a browser-based service. But for public sector entities that is probably one of the most gnarly use cases. Because you don't want someone to upload a non-public pre-decisional policy document just to have a rewritten version of it. You don't want to upload performance evaluation with personally identifiable information just to have it summarized. And so those are the kind of things that we really want to think about. In terms of how do we protect privacy? How do we protect the integrity of the data?

But also, how do we think about all of these proliferating unstructured data? We're drowning in policy documents and reports and PDF files. So a lot of our systems are traditionally built to handle structured data, transactional information. And so, how do we make sure that we can integrate within that environment, but also provide the level of context? The sort of semantic meaning of a certain term for one agency may have a different meaning at another agency. But as a user, we don't necessarily think about that semantic consistency.

So I think context building will at least be a very important part of the AI journey for the public sector.

KIMBERLY NEVALA: And do you think leaders are more aware of this now?

Because, certainly, I think with things like ChatGPT, large language models, they look like the perfect almost solution off the shelf for dealing with a lot of these kind of unstructured documents, right?

TAKA ARIGA: Yeah.

KIMBERLY NEVALA: Upload them. It'll summarize them. Right. It in theory or at least in publicity or PR, will help just dig through and organize that for you by pulling out the bits. But what I hear you saying is that the data foundation still really matter here. And it's not a panacea. And so maybe you can speak to that a little bit.

But also, we've started to notice this use, particularly in popular usage - and I'm wondering if you're seeing this at the decision maker level too - where we're using the term AI very often to mean something like ChatGPT. So how do we also reconcile that sort of vision of AI if, as you said, ChatGPT is the AI?

TAKA ARIGA: Yeah. The way I talk about AI often with the decision makers is in the context of different archetypes. Certainly, ChatGPT - and I'm not picking on ChatGPT - there's Anthropic, there's other Perplexity type of browser-based services. That's one archetype.

Most of us are familiar because we might use ChatGPT to plan our vacations. We interact with different forms of AI for recommendations. For, even within Office 365, we may have a sort of Copilot within our home to help us create presentation, Excel spreadsheet, things of that nature. But that's just one archetype. There are other archetypes that are more bespoke.

So for example, at OPM, we were trying to develop an AI use case to streamline the federal hiring experience. Now, I can guarantee you there is no commercial product built for specifically that purpose. And within a federal entity, there are a lot of unique security requirements. So for product vendors that have never heard of that concept around FedRAMP, we're having a different level of conversation. And so under those circumstances, it may make sense for an agency to build their own AI-enabled solution, whether that's within Cloud or in some other construct. But that's a completely different archetype. You're building that solution inside the agency within the infrastructure boundary. You may have more controls around what comes in, what goes out. And so your system is a little happier under that construct.

Another archetype deals with all of these bespoke products that we may purchase. So Grammarly is a good example of that. Canva is a good example of that. Many of these products have AI-enabled capabilities, but they're very narrowly focused. So Grammarly is not going to do my policy analysis for me, but it will be great at re-imagining my writing. And you started to see this proliferation of these narrow, focused AI that, a lot of times, they're not integratable. They're not interoperable. So you end up with all these downstream manifestations of not only discrete product, but also the kind of risks that are inherent to AI.

Especially, I always say, one of the key differentiators within the public sector versus maybe a commercial entity is that we don't get to say, oops. We can't say we violated your privacy, my bad. Or, somehow, we infringe upon your rights, or we release some additional documents. So we have to be very mindful of how we manage our infrastructure. How do we deal with the data foundation? The technology piece is actually relatively easy, right?

And your first question is around: are leaders starting to recognize these foundational issues? I think without a doubt, the answer is yes. If you go back a couple years on the OMB inventory of federal AI use cases, it's been growing at a pretty amazing clip. I think the first inventory was less than 1,000, and the next year was like 1,300. And the last year was about 1,700. We're still waiting for the 2025 inventory. That should drop any moment now.

And so I think a lot of agencies they're understanding the possibilities what AI can do. But I think they're learning the hard lesson that, for example, we can't just shove all these documents into a model and expect some clarity. We can't just bolt on some retrieval augmented generation solution and expect context. We have to train our workforce. This concept of human in the loop is meaningless if the workforce themselves are not understanding the probabilistic nature of AI. So, in other words, the answer you get on Monday may be different than the answer you get on Thursday. Under that scenario, how do you make decisions?

So I think all of that is now coalescing together in the sort of heightened expectations around, well, we're spending all this money. What is the value that we're getting out of it? And I think there is no substitute for really focusing on the foundational elements of data, AI governance, technology, cybersecurity, workforce development. All of that has to really work together in an agile way so that the governance doesn't come across like a brake to innovation but can coexist. And, as a matter of fact, be an accelerator to innovation.

KIMBERLY NEVALA: Now, one of the things that we've seen, or I've seen, is that one of the counterintuitive metrics for governance working well is the ability of that organization for governance to say no. Not necessarily to say yes. Because right now, in a lot of places, it is almost like AI is becoming the culture. It's the short-term, which is AI in everything, everywhere as the objective.

But organizations also know it's pretty easy to say yes to everything. It is very difficult to, in a structured yet resilient and responsive way, know when to say no and when to put the right guardrails up. And so this struggle to ground or balance aspirations for innovation with the capacity to deliver, to balance that appetite for innovation with the tolerance for risk and change, is something that existed well before AI but in this current moment is particularly acute.

What are you seeing in terms of that? And where are organizations going astray? And what can they do to maybe bring that all together in a more effective way?

TAKA ARIGA: Yeah, and I couldn't agree with you more. When I was at GAO, I always say I built my career based on the number of no's that I tell people. And it's not because I'm trying to be skeptical or cynical about AI. It's really trying to focus.

What is the underlying problem that you're looking to solve? We should not be walking around the agency hallway with an AI hammer to say, who has a nail? Who has a nail? But really, what is the bottleneck? What is the challenge that you're facing? Let's focus on the problem and then evaluate whether that problem actually requires AI to address them and is worth the level of complexity.

Because AI is not something that you set it and forget it. There are a lot of issues around hallucination. There are issues around data drift and model drift and things of that nature. So you have to be able to sustain it. So even a fantastic idea, when I was leading the Innovation Lab, I would always ask, well, if we were magically successful, how many people would this benefit? Five people? 50 people? 5,000 people, or the entire population? And so how do we scale that?

And also, the questions around ownership. Does your organization, can you, sustain this product if we're successful? Because when I was dealing with the Innovation Lab, I'm not leading the sustainment lab. My job is to explore the use of emerging technology. But once we're successful, we have to transition that to an IT organization or to a product owner. And if there's no viable path towards that product ownership, you don't get to just drop an idea and say, this is now your problem.

So that is really thinking through the life cycle of idea, great! But do we have a path to scalability? And can we focus our limited resources that we have to the use cases that deliver the most bang? Again, back at GAO, we did a fairly in-depth study to say what is the agency's capacity to productionalize any software capabilities, AI or otherwise? The short answer was one. Given the size of the agency, given the complexity of our mission, we can, at best, productionalize and scale one solution.

So having 30, 40, 50 different prototypes floating around, it's not actually helpful. If anything, it's probably a great example of wasteful spending. You are learning that muscle memory, which is great. But we're spending all these monies around tokens. We're spending all this money around infrastructure with no viable path around how do we scale, how do we deliver value. And frankly, who cares if we're successful?

And so at GAO, our innovation product life cycle was very much focused on these iterative questions that we ask along the way. We do rapid prototyping and we tried to make sure that we're confirming the value proposition. And all the way along the path here, we have one eye towards scalability. To say if we're successful, what would it take to scale? What would it take to sustain? What would it take to upkeep this capability going forward?

And by the way, since we're a public sector organization, how do we give this capability away so that other agencies don't have to reinvent the wheel? We're not here to monetize. We're here to help other public sector agencies be more effective. So if OPM is developing a use case to improve federal hiring experience, that use case is applicable to every other federal agency, as well as state and possibly even local level. So if we're able to crack that nut, I always say plagiarism. This is not college. We love plagiarism when it comes to--

KIMBERLY NEVALA: Please plagiarize.

TAKA ARIGA: Plagiarism is a great thing, right? We want to make sure that we are uncovering best practices, but also lessons learned. Give it to someone else. Maybe they can build upon it, and we can learn back from them.

I don't think we do enough of that at the public sector level. It's always, oh, let's not air our dirty laundry, which I don't really understand why. There is no shame, in terms of trying something that doesn't work out. We don't necessarily use the mentality from a Silicon Valley of fail fast. I don't think that's an appropriate mentality for a governmental entity. You have fiduciary responsibility to make sure that you're using taxpayer dollars wisely, effectively.

But you can pivot that to say, can you learn as much as you can but also share that learning so that you're not really reinventing the wheel along the agency boundary? And we're starting to see some of that at the governmental level. And General Services Administration has pushed out this one gov approach towards AI procurement. So you're starting to see federal government acting as one single buyer as opposed to thousands of individual buyers. I think that makes sense. We're starting to really think about how do we break down data silos, caveat: correctly. And so I think a lot of these sorts of efforts, while not as fantastical sounding as AI, are a necessary foundation that we have to build for AI to be successful.

KIMBERLY NEVALA: And so one of the things that you've been alluding to throughout that is rethinking perhaps governance and developing a more responsive governance model. And I know that you did develop an AI accountability framework. But thinking about governance - and even there I was about to say "agile." But I don't think agile, agile comes with some expectations, I think, for speed. And speed is not always, it is often, but not always the objective, as you said.

Are there particular elements when we think about developing a governance framework for AI and beyond that we need to really think about? Particularly in terms of things like policies and permissions so that it is appropriately responsive and resilient, but not overly so?

TAKA ARIGA: Absolutely. And we piloted something similar at OPM. We didn't call it agile governance, but that's what it was.

I think one of the first goals of that is making sure you have this horizontal purview around different risks. AI risk is not a data scientist's problem. It's not a CIO's problem. You have to have lawyers at the table. You have to have procurement folks at the table. You should probably have your bargaining unit representative at the table if the human capital impact is where it's supposed to be. You should have your data people. You should have technology people.

And so the way I was thinking about it is, how do we have these horizontal conversations in a rapid manner that doesn't necessarily live within your traditional governance model? Which oftentimes goes something like this: we meet once a month for an hour as an agency and we talk about everything that is facing the agency from a risk perspective. So at best, I will get a five-minute agenda to talk about AI.
Now there's no possible mechanics that we can address the nuances and the decision making required in that five-minute period. Especially in a room full of people that may not necessarily be swimming in the weeds of AI solution development.

So part of what we did is have an offshoot of that agency governance structure where we actually met on a biweekly basis. We took sort of a two pizza rule to say we're going to meet quickly. But one of the key enablements there is that we were empowered to make decisions. So if we see certain risks, we will ask those product owners to make certain adjustments. If we don't see certain evaluation procedures done, we will say, you know what? This is not allowed to go forward unless we have certain things. We were comfortable making decisions, even though we may not always have the completeness of that information.

And so I think that was an important aspect. To say how do you make sure that you are talking horizontally more frequently. But empower them to make decisions along the way so that you're not having developers going all the way out to just realize, oh, we got to start from step two again. That was, I think, the beginning of how we were thinking about AI in particular. How do we structure that governance at the speed of innovation and empower to really make those tactical changes?

Some of these changes were not popular. When I first joined OPM, I knew people were using commercially available AI services like ChatGPT and Perplexity. I just wasn't sure how many services people were using. So I worked with our IT organization and did a perimeter scan of how many AI services people were accessing. 55. That is a shocking, almost unmanageable number of disparate AI services. So that presented a risk itself right away.

And we were in the process of developing a compliance plan and some of this governance structure. So we made a decision to shut down access to all 55 services. That did not make a lot of friends for me. But I think folks understand once this sensitive nonpublic information leaves the agency's boundary, there is nothing that we can do. And we're not going to risk our reputation based on some terms and conditions stipulated in a contract because if something goes wrong, it is not the product vendor that are necessarily liable. It's me, first of all, as the chief AI officer. But also the agency itself then owns that risk.

So that was an example of some of these risk-based decisions that we took early on. But it's not just about shutting down. It's making sure that we provide an alternative. To say instead of going onto the public internet to access this capability, some of these general AI chatbot solutions, we stood up quickly to say experiment there - as opposed to through an external browser. And so that was making sure that we didn't just cut off the risk, but we also provided alternatives for some of these use case developments to continue to progress.
So that's an example. And so these days, when I talk with clients, that's the kind of thing I think about. To say, don't view governance as a break. View governance as, actually, accelerating to innovation. Because if you do it right, it will help you prevent all of these reworks and technical debt that you likely will incur downstream.

KIMBERLY NEVALA: And if I understood what you were saying there as well correctly, part of governance, then, is providing people the confidence to make decisions under uncertainty, in uncertain conditions. And you also alluded - or not alluded, said - earlier that part of the workforce enablement and part of this transition is helping people understand the difference in decision making when they're dealing with outputs from a probabilistic versus a deterministic solution. So I'm interested in how you have approached that from a learning and development perspective? Whether that's in the context of deployment or more broadly, both within your roles at OPM and GAO, or what you're seeing outside today as you work with organizations across the spectrum.

TAKA ARIGA: Yeah. And there's no shortage of training content. All you have to do is go on YouTube, and you type in any AI-related topic, and you'll get hundreds, if not thousands of tips and videos to teach you the mechanics. Anywhere from data engineering all the way to some of these model tuning conversations. And certainly, there are a lot of training service providers that talk about prompt engineering, some of these architectural concerns. So I don't think there's a shortage of training content.

Where I do see a sort of desert is function-specific training to say, if I'm an accountant, if I'm an analyst, if I'm a lawyer, how do I use this kind of AI to do these kinds of functions? So if I'm a policy person there's an agency developed AI and enablement; how do I use that tool in a way that allows me to bring in different data sets and analyze these contents correctly and methodically, and in an appropriate way? And so I don't subscribe to the notion that AI will replace our jobs. I do think the difference will be accountants and analysts and lawyers and procurement officials that use AI well versus those that don't.

But the predicate on that is you need to make sure that a procurement official knows how to use a very specific kind of AI. Teaching him or her how to use ChatGPT, that is useful as it may be foundational introductory conversation. But it doesn't solve how do we create terms and conditions? How do we make sure that we have record keeping? How do we make sure that we have some sort of audit trail with some of these workflows that we're developing? So all of that are function-specific training that only the agency themselves can develop.

So this is where the human capital organization's role is very important. To say this is not only to set the stage to say we're building AI to drive efficiencies and effectiveness. But this is the training that we'll provide so that you can be successful using it. And that goes, I think, a long way towards what you were suggesting, the kind of permission structure.

When I give AI-related talks, I sort of paint this landscape of reality of AI. The policymaker, the agency leadership thinks we are just wearing the Superman cape when it comes to AI. We can do all sorts of things. We can vibe code. We can deliver really valuable mission outcomes. When I talked to my boss about AI usage they cringe a little bit and say, you're using AI for what? Is that appropriate? And then I talk to the technology executives, they're melting. There are all these new skill sets that they're needing around AI engineering, around context engineering, around data governance that traditionally, maybe, they haven't had those within those organizations. So that's, I think, the opposite direction of the cost-saving mandate that perhaps they're having.

And then what that really meant is that collectively, we're just shrugging, to say, well, AI is great. But I'm not sure what I'm supposed to do with it, or what I am allowed to do with it. I certainly don't want to get my wrist slapped if I did something inappropriate. And so anecdotally, I was starting to see some of this "speakeasy" way of using AI. To say, hey Taka, don't tell anybody, but let me show you what I have on my phone. I'm like, oh my God. Don't tell me that.

So I think we need to be very clear about, for your organization, what is AI meant to do and not meant to do? And how do we then train function-specific sort of upskilling to those individuals so that they can use them successfully and with confidence? My friend, Sarah Moffat, said something recently that really resonated with me, to say AI adoption will move at the speed of confidence, not at the speed of innovation. All these new models, all these new exciting technologies, they're great. But if the users don't have confidence behind it, you can spend all the money you want. You will see a variable in terms of adoption.

KIMBERLY NEVALA: Yeah. And I think this has been a longstanding, I don't want to say problem, but area of concern that we've had to address. Particularly within the public sector where we're dealing with services like social services. So it might be access to really critical benefits, or housing, or food support, or child supports, or making really consequential judicial decisions. And again, this has been something that's been here from the good old-fashioned machine learning predictive days, which is still quite in use these days, and is certainly getting, I think, just increased attention with the advent of the latest and greatest flavors of AI around the foundational models' LLMs.

Because developing confidence is again, it's a spectrum. Which is to say, when should I be confident? When should I not be confident? Do I understand when something is making a prediction, for instance, whether it's a risk score about an offender or whether someone needs supports. And I'm exposing my own bias here for hoping that the government will get to a point where we start to use some of these smart capabilities to provide services. As opposed to only go after sort of fraud, or try to decide who shouldn't be getting things, but who should, and being proactive on that side.

But that issue of appropriately calibrating your confidence is both having to understand the context in which you're using the information and also the baseline capacity, or capability, of the technology itself. So is that also something that you've seen? And how do you think about addressing, or recommend people address, that?

TAKA ARIGA: Yeah, absolutely. And that's one of the reasons I'm actually very grateful to have started my public service career with the Government Accountability Office. Because if you've ever met an auditor, they are professional skeptics. They get paid to basically criticize what you did wrong with your program. And that extends to emerging technology like AI. They don't like anything that's automagical even if the answer is correct.

So when we were developing use cases within the Innovation Lab, we really had to be mindful of first of all, where is this response being cited from? There's a difference between trusted, curated data sources in which a response has been generated from versus the best hits of the internet out there. As you might imagine, GAO reports can't possibly rely on Reddit commentary or even citing Wikipedia. They may be useful for background research, but we need to make sure that we are couching those citations appropriately.

Second part is reasoning. Before you ask a foundation model to go retrieve something, or to perform some sort of comparative analysis, give me your plan of attack. To say, step one, do I need this? Do I even have access to this information? Do I need this additional context from the internet, for example?
So if you are asking an LLM to do a calculation, maybe I need, as a stake in the ground, what is today's date? Just verify what is today's date, as I may not necessarily have that information, but I need to retrieve that from an internet API. And so go through that reasoning. Let me approve your plan of attack before you go execute. And once you retrieve that response back then give me the reasoning again how you came up with that particular rationale.

And then, by the way, remind me that I'm not perfect, that you should still review the LLM output. And some of this is a technical issue in making sure that we configure those LLM responses to reduce the sycophantic behavior. Don't just complement the user on how smart and fantastic that they are. Dial down the toxicity. Dial down the creativity. We're trying to use this capability for a very, I think, in a good way, bureaucratic mission. So dial down all of those well-documented examples of just adorations and sycophantic behavior. We don't need any of that. And that's part of the configuration that we can do. And we can curate the data set so that the responses that we get have much higher confidence.

And I'll give you an example. Certainly, when I was at GAO relying on historically published audit reports is one major source of data. We would love to be able to do some trend analysis of Medicaid spend, for example; compare and contrast different findings on the same programs. We also relied on congress.gov because that was the source for all of the legislative activities. We didn't have to rely on, necessarily, Google results for that. So we took a lot of care in curating what we considered as trusted data source, and we applied the kind of context engineering, we applied the kind of reasoning, but also the user interface in a way that really put the human-centered design front and center.

So in other words, this is not, like, data scientists and AI engineers created these tools. Now everybody, please go use it. We really tried to solicit feedback in terms of what are the features that are necessary? What are the features that are not necessary? And certainly within that journey, there are a lot of surprises. We always have some supposition on how people might be interacting with AI but maybe that's not how they're intending on using it. And then I think testing is a big part of this conversation; to say this is not build it and they will come.

We need to test it not only to confirm the features, but figure out how we might monitor, for instance, if something goes wrong. And so how do we capture all of the prompts? How do we capture all the responses? How might we actually use AI to analyze those prompts and responses for appropriateness? How do we solicit feedback from the user to say, was this response helpful, not helpful, relevant, not relevant to help us, really, with the notion of continuous improvement?

That is, I think, a new paradigm in the technology space. Because usually we deploy a solution and we move on to the next one. But AI is something that you have to continue to care and feed to make sure that it still does what you're intending on doing. And by the way, shut it down if the utility of that system no longer holds. So I think there are a lot of these operational mechanics that we really have to think about from a user's perspective.

The one issue that I am a strong believer is that - and I learned this lesson the hard way - everyone says they want innovation. Nobody wants change. That doesn't work. We shouldn't just use AI to automate existing processes. I think it's critically important for us to have the room and space to deliberate. Should we actually reimagine what that process should have been now that we have AI?

I'll give you one example. Auditors are very much report-centric way of operating. We have this audit. We produce this report in a PDF format and we move on to the next audit. We produce that PDF report. And that PDF report production process takes upwards of months and months because you have to collate narratives. You have graphic design. You have this quality assurance process that goes back and forth.
I'm of the mind to say, well, reading PDF reports is probably not how most people consume knowledge these days. Can we take more of a digital first, digital native approach to those insights? Critical insights, critical findings, critical recommendations. Not in the PDF form, but maybe in a chatbot response that we can serve out to the general public? And so that means reimagining that product development cycle. Instead of treating everything as a PDF first, maybe PDF as a secondary consideration.

And so really having the courage and the space to allow for that reimagination, I think, is a critical element of AI adoption.

KIMBERLY NEVALA: And the example you just gave is interesting as well, because in that circumstance, the chatbot, if you will, that LLM-based interface, is really about communicating the outputs. It is likely or unlikely that that is the full-stop holistic system that actually develops what those outputs in the analysis are as well.

So as you were talking about the rubric that you were somewhat using for thinking through, particularly, designing systems that are LLM-based, it struck me that we could abstract that up a level and use it also to help drive thinking about what is it that we are actually trying to do? What's the problem? And what, if any, flavor of AI or analytics is the right one to solve that problem with the level of confidence, repeatability, et cetera, et cetera that we need.
So that we're also ensuring that-- Because there is such an amount of work that really is required with the newer technologies that look like you can just take them off the shelf. But what we know about that is you really, really can't. That there may be other old-fashioned analytics, old fashioned machine learning algorithms, that are faster, cheaper, easier, and more appropriate for the piece.

And so having that kind of a decision-making rubric that allows folks to think through and make sure we're picking up, as you said, the right solution for the right problem is interesting or important as well, I would think.

TAKA ARIGA: Yeah, absolutely. I think part of this excitement towards AI adoption largely stemmed from the amount of money that has been sunk in by many of these foundation model companies. I think I read some figure somewhere that the totality of investment in AI last year exceeded the GDP of Saudi Arabia. So we're talking about an extraordinary sum of money.

So of course, there's a tendency to say, well, we got this fantastic foundation model. Let me walk around the agency with an AI hammer to see what can I strike. But we also live in a reality where the resource is not unlimited. At some point, I think it's very appropriate for people to ask the question and say, well, you promised productivity gain. Was it meaningful? You could spend $20 million buying a certain product. But if your productivity gain is marginal was that really worth the taxpayer dollar?

And I mentioned fraud, waste, and abuse contexts as well. It's not good enough just to identify potential anomalies. We need to be really thinking about what is the level of error that is potentially introduced in that conversation because you don't want to send your investigator down some false-positive rabbit hole. That is just going to consume that individual's energy. And so I've talked to compliance folks as well to say, if you're just going to give me a bunch of red flags that are false positive, I almost rather you didn't do it.

So that then becomes a backlash towards adoption. To say we can't just be identifying anomalies. We have to really think about the cause and implications of just dumping all those anomalies on the doorsteps of the investigators. Because AI is not at a point where you can, again, automagically say, I found fraud. You're guilty. That's not how that works. We identify anomalies. Then we have to go through some procedure to validate, to investigate, and substantiate. And it is within the court of law that really then determines whether something is fraudulent or not.

So it's really thinking about the end users in mind to say, how do I elevate that level of confidence so that the AI solution is additive, not a burden that's placed upon investigators and auditors?

KIMBERLY NEVALA: And does this tie into - you've written, and I believe spoken as well - especially as we move forward with aspirations and applications from agentic AI where we are automating certain decision making. And again, in the public sector, in some of these areas, there is a heightened level of risk just due to the context in which these decisions are being made.

And you've said we are developing almost a reflexive reliance on humans in the loop. Why is that problematic or not a panacea? And are there particular concerns that we paper over, or maybe miss entirely when we just assume a human in the loop is the way to go? I mean, this is one good example of if you're sending them all the stuff and it's all just wrong at some point, their eyes are also going to glaze over. But what else do you think about when you think about this reflexive reliance we're creating?

TAKA ARIGA: Yeah. I think part of that is as, maybe, civil servants, we want to be in a space where the professional judgment can apply.

So there are use cases where I actually don't think human in the loop is appropriate. Cybersecurity, for example. AI can go through millions and millions of cybersecurity logs much faster to identify anomalies and you probably want that solution to automatically take some action, right? Stop the access, whatever the case. And under that circumstance, you may not want to insert a human and say, could you take a look at this before I do anything? By the time that happens, it may be too late. So I think there are certainly use cases where a human in the loop at a selective level may be appropriate, but not always the necessary element.

The second part of it is we all like to feel like we're relevant. So if AI is doing all of our work, well, what am I supposed to be here doing? So part of that is not necessarily to always think about how do we insert humans along the way? Really thinking about from an outcome perspective, what are we trying to do? And if we're able to give you 20% productivity gain, might you do that work in a higher value way, as opposed to inserting yourself, reviewing that particular outcome?

Now there are certainly tons and tons of examples where you want to make sure that there are human reviewers to those outputs. So I mentioned audit, for example. If you're comparing and contrasting different policy documents at the international level, federal level, state and local level, you'd probably want an experienced individual at that particular policy to say, OK, this reasoning makes sense or doesn't make sense. Let me try different things. We certainly want to be able to do that.

But it's not necessarily helpful if that person can't really identify, maybe, some of these outputs are hallucinated outputs. You have all these lawyers that are getting sanctioned by judges for citing fictitious cases. Not because they're incompetent, but some of these citations come across very credible. But we didn't train these lawyers with the kind of muscle memory that you should probably validate that. Even if you get the citation link, click on that link and corroborate that link. In fact, is it valid? You have law libraries at your law firm. Look up that case to see if it even exists.

And so there's a tendency for us to see a response to say, oh, I reviewed it, but it's good, it sounds good to me, so let me go with it. And so that goes back to the notion of workforce training. How do we make sure that we are continuing to hone in that professional judgment, professional skepticism, in a way that is helpful. But not necessarily to always interject the notion of oh, we have that human in the loop, and assume the problem goes away.

At least within the public sector, there are certain tasks that it, by design, should be deliberative. It should take time. Policy iterations, for example. You probably don't want a congressional staffer to say, dear ChatGPT, how do I create policies around x, having it draft, and then having members of Congress start voting on it. I don't think that's necessarily appropriate. But you might want to iterate draft from AI; compare that to other study groups having listening sessions. I think AI could have a role in summarizing or collating those contents, but there are specific public sector functions that is absolutely appropriate to be deliberative.

So in the context of will agentic AI take over, at least I don't think within the public sectors. I think even in a lot of commercial organizations, you're starting to see, well, we don't probably want to use AI-generated code and put it in a production environment automatically. Probably someone should take a look at it and make sure that it is appropriate, and it's accurate. It's free of defects, bugs, and all that good stuff.

So I do think agentic AI will be disruptive. And frankly, with the notion of agentic AI, do the loops even exist? If AIs are just automatically taking these actions, where are you planning on inserting humans along the way? So I think this is where going back to the notion of governance, to say agentic or not, AI or not, what is the workflow that we're envisioning here? And what are the intersections of appropriate human oversight? It could even be machine oversight. You can have AI evaluating the outputs of AI. I think that's perfectly fine.

But really, thinking about more in a hybrid space as opposed to one or the other. I think we're the last generation that likely will manage an all-human workforce. Going forward, the reality likely will be a hybrid workforce of humans and machines. And so knowing how we can leverage both strengths, I think, is appropriate. As opposed to, oh, we don't trust machines so humans always have to intervene at some point.

KIMBERLY NEVALA: Very interesting. All right. So a final question and then a final confession. Or an opportunity for a final confession, I will say. So my final question for you, really, is what question or questions are folks like myself or leaders that you're working with broadly not asking or do we need to really be focusing on as we move forward here in this brave new world?

TAKA ARIGA: Yeah. I often stand on this soapbox, in that I think an organization that is not data-ready is likely not going to be AI-ready. I'm having a hard time thinking of any organization that can just rely on whatever is coming across from the internet. So if you're using enterprise-level AI, you likely have to rely on your internal data; whether that's customer service, whether that's sales, whether that's public sector policy, things like that.

So that means you have to get your data engineering, data architecture, data governance right. And part of that means data quality. If it's bad quality why are you bothering shoving that into an LLM? Interoperable, in terms of it's machine readable. And by that, I don't mean PDF reports. You have to really parse that out in a way that is easy to consume by a foundation model. And the third piece here is context building. We can't just shove documents, and photos, and videos, whatever, into an LLM and just expect clarity.

How do we build that context engineering in such a way that we have clarity around what topics we're talking (about), what entities that we're talking (about), what are we not talking about. So the word "finding," for example, in an audit-specific context means very different than maybe finding a widget or finding a treasure. So those are the kinds of context engineering that I think are super important.

Otherwise, what I see a lot of organizations say is this is interesting output. It's giving me a lot of interesting reasoning. I'm not sure I'm going to trust it, because something is not quite right. So they're sort of swimming between the 60% to 80% confidence which is OK, interesting. But I think most organizations would like to rely on decisions based on at least 90-plus percent confidence in terms of the output. And so I think that is the kind of conversation that we're trying to get to, to say, technology part? That is relatively easy. But get your data ecosystem right.

KIMBERLY NEVALA: All right. Now, because I'm completely fascinated by your being such an accomplished cellist. And I was completely charmed by the title of your talk, which you titled "Confession of a Hypocritical AI Evangelist” where you were talking about reconciling or thinking about that intersection between your work as a classically trained human musician and how AI works. So I guess I will use that as the pretext for saying, for the purposes of this conversation and for the audience, is there sort of a last confession, Taka, a last word you'd like to leave with them?

TAKA ARIGA: Yeah. For someone who works in the AI world, I am maybe a little over-cynical when it comes to what I can and should do.

Certainly, I'm not new to the fact that I can use ChatGPT to create a travel itinerary and I've done that before. It was OK. It wasn't the greatest itinerary. I know you can use agentic AI in a way that says, oh, help me find the lowest airfare from this point to that point. And by the way, go and book it for me. I have some trepidations around how much information I personally want to relinquish in the name of productivity. And even in writing, I sometimes hesitate using AI to help me rewrite drafts, because that is an opportunity for me to think through my own reasoning; to say, how do I want to convey this particular topic? How do I want to convey this narrative? And if I can't clearly write it out myself, I don't think I can talk about it in a convincing way.

So I do have some sort of resistance towards, in my personal life, to say, oh ChatGPT is great. And I'm a subscriber to it but I probably am not using it as much as people think that I'm using it. Same thing with all of these AI tools that we have at home. Alexa, the Siris of the world. I'm in a huge fight with Alexa right now because it won't play the radio station that I wanted to play and I literally have to punch it out on the keyboard. Which kind of defeats the purpose of the convenience.

So knowing that it is not perfect, there are certain complex policy, governance-related issues that I just would like to personally think through with a level of clarity that I'm confident in. And maybe, at some point, I will rely on AI to refine around the edges. And one of the great use cases, I love how AI can help me create illustration in a way that I can't possibly create in Adobe Illustrator. To say, here's the concept I'm describing. I used that for the paper I published around AI culture eats strategy because there was nothing that sort of fit the bill; that said, well, how do I describe a human brain eating a robot as a concept of a sort of culture-eating strategy? And that took no more than five minutes, tinkering around the edges using ChatGPT. So that was great. But the actual content of that paper, I really resisted using AI to write it in a way that's convenient for me, because I need to lay out the logical order of that narrative.

KIMBERLY NEVALA: Well, that is an excellent, excellent reflection to end on. So with that, I will thank you for your time, and all of the insights you've shared with us today, Taka.

TAKA ARIGA: Thank you, Kimberly. It was fun.

KIMBERLY NEVALA: Awesome. Now to continue learning from thinkers, doers, and activists, such as Taka, you can subscribe to Pondering AI. You'll find us wherever you listen to podcasts and also on YouTube.

Creators and Guests

Kimberly Nevala
Host
Kimberly Nevala
Strategic advisor at SAS
Taka Ariga
Guest
Taka Ariga
Founder of SOL Imagination
Orchestrating Public Sector AI with Taka Ariga
Broadcast by