The Shape of Synthetic Data with Dietmar Offenhuber

KIMBERLY NEVALA: Welcome to Pondering AI. I'm your host, Kimberly Nevala.

In this episode, I'm so pleased to bring you Dietmar Offenhuber. Dietmar is the Department Chair of Art and Design at Northeastern University. In addition to his professorial duties, he researches the material and sensory aspects as well as the social implications of environmental information and evidence construction.

For purposes of this conversation, construction may be the operative word, as he joins us to discuss both the philosophical and the practical ramifications of synthetic data. Welcome to the show, Dietmar.

DIETMAR OFFENHUBER: Thanks, Kimberly, for the invitation. It's great to be here.

KIMBERLY NEVALA: Absolutely. Now, your educational background is interesting because it spans from architecture to a Masters in Media Arts and Science and a PhD in Urban Studies and Planning. And I think the link between architecture and urban studies and planning is probably somewhat intuitive to most of us. Maybe not so much media arts and sciences. So tell us a little bit about what inspired your path and has ultimately led to your current interest in how we understand and use data today.

DIETMAR OFFENHUBER: So I would say everything that I did was always-- I was always interested in cities. In cities, as kind of human-like places where social interactions happen, where a lot of activity, a lot of change, a lot of intersections with technology, and so on. So my interest as an architecture student was always mostly on the urban level. But I actually, in the middle of my architecture studies, dropped out of school in order to join Ars Electronica, which is a media art organization that had a research lab. And it was like a very interesting opportunity to do this. I later returned and finished my studies.

But so this connection between data, cities, and visual representations or perception of cities, that that was a place that I started from. Media arts and sciences, that is the Media Lab at MIT where I joined a group, Judith Donath's Social Media, as it was called back then. So she was really interested also in connections between mediated spaces and physical space.

And my PhD then was in Urban Planning. So I would say, compared to architecture, much more of a social science perspective. But again, focusing more or less on similar topics: dealing with urban data and how does it correspond with urban activity, with what people are doing, what people are experiencing, and so on. So in a way, that kind of connection with data representation, visualization, and cities was always the connecting element.

KIMBERLY NEVALA: Yeah. As you've been talking about it, that just makes good sense. That connection between the information and the data we have and how does that actually relate to the activity and the engagement in reality, so to speak.

You recently wrote a paper on the topic of synthetic data and how synthetic data may challenge us or require us to shift some of our traditional perspectives on data or how we conceptualize data and understand data. And before we talk about that shift, it would probably be important for us to define what we mean by synthetic data for purposes of this conversation. So when you think about synthetic data and talk about synthetic data, what is it that you're describing?

DIETMAR OFFENHUBER: So I mean, I think in the context of this conversation, we're probably mostly going to focus on data that is generated by AI models to somehow resemble observations like traditional data that mimic data.

But of course, if we think about synthetic data, it is, of course, a much broader topic. Data that is artificially generated, not derived from observations of the world, generated by algorithms by all kinds of operations, is, of course, a very old topic.

I mean, if you look at what the network scientists are doing in the 1950s, 1960s, they generated random networks to study them. Those were already a form of synthetic data. Or in statistics, you have a lot of statistical methods that require you to have a complete data set, not having any gaps in it, because certain operations don't work otherwise. So what do you fill it with? Do you put in some zeros that might create a artifact or a bias?

So this data imputation is also a very old topic; how to augment the data set. And then, of course, you have all kinds of other-- synthetic aspects creep into data sets in all kinds of different ways. And I'm sure you have had a lot of that on your show. But it starts, if you're talking about urban phenomena, where you deal with census data, the boundaries of the census tracts shift. So somehow the data sets are not really compatible. So you have to involve some form of synthesis to harmonize them to make them comparable.
And that can be very, very small interventions or it can be very big interventions.

So I kind of, in this paper,I made a small typology that covers some of these synthetic aspects.

KIMBERLY NEVALA: And without going into necessarily all the typologies that you identified, because one of the things that jumped out and again - I think it's obvious when you think about it in retrospect - it was that synthetic data really does exist on a spectrum. There's maybe those two discrete endpoints and then some waypoints in between.

So as you think about it what are maybe those endpoints? And why is it important, or is it important, for folks to differentiate where they are on that spectrum when we're thinking about how and where to apply different synthetic data techniques?

DIETMAR OFFENHUBER: Yeah. I mean, I guess there are different dimensions of it.

One of them is really: why do we synthesize data? And I think right now the two main common reasons are, first, to protect privacy through differential privacy. There, the synthetic methods are a little bit more gentle because it's more a kind of recombining. Just making sure that individuals are not disclosed or that their identities are not disclosed in a data set - which is actually quite difficult. But that's a different topic.

To this other end of the spectrum where we're really looking at synthetic training data that are not only just, let's say, augmentations of data sets that we don't have. But sometimes they can be complete, improbable outliers that are meant to really make the models a little bit more robust. Or we have data sets that represent different outcomes of the same situation. We only have one world and there's only one way; there's no counterfactual scenario. But synthetic data allows us to generate counterfactual scenarios to train models in these different possibilities.

KIMBERLY NEVALA: One of the things that the paper proposes is that when we are thinking about and applying synthetic data, we might need a different conception or a way to understand it. And I'd like to talk about the difference in those models and why that might make sense. And perhaps just starting with what our traditional conception of data has been; which you called the representational model. So what is the representational model of data that we've all known and loved historically?

DIETMAR OFFENHUBER: Yeah. I mean, so data are or is a form of representation. This is how most people would understand it in the sense that data sets consist of symbols that reference things that exist in the world. So, for example, like location data point references a place in the world.

And the idea of what I call the representational model is that the only thing that matters is the meaning of data in this reference, is in the data value. So it doesn't matter if the data set is etched in stone or sung from a mountain top or found on the internet. It is just the data value and what it represents in the world that matters. Which, of course, is very convenient because then we can combine data from many different sources. And we know, of course, that it's a reduction, it's a simplification. But that is very often a good trade-off to make.

But of course, there's also a disadvantage with this because if we completely ignore where data come from: the different origins come with different biases, come with different issues, and so on. If you, again, take the example of geolocation data, it might be acquired from a GPS device. It might be geocoded from an address. It might be acquired by just throwing a dart on a map. So all these methods come with different errors and biases. But if I only focus on the data content, I don't see that.

KIMBERLY NEVALA: And there is an implicit assumption in this model as well. And that is that the data directly has some sort of - maybe it's not 1 to 1, but 1 to something - relationship to an actual real-world concept or construct or thing.

Dr. Erica Thompson talks about this a little bit. And she very much cautions us that the data we collect and the models we erect are, just as you said, representations, not replications of the actual physical object or phenomenon that we are trying to model. Although it is very easy for us to lose sight of that, I think, sometimes. What was the old byline? That all models lie, but some are useful.

You also note that even in this traditional view of data, there are some shortcomings. And I think it's a related, although slightly different, point which the paper points out: that sometimes data is just a byproduct. It's not really an encapsulation, truly, of the real-world concept or construct. Why was that important for you to point out and for us to understand?

DIETMAR OFFENHUBER: I mean, of course, now, this is a very good point, because even on the most basic level, not all data forms, or not all data that we work with is, a representation of something in the world. Sometimes it's really just a byproduct of some algorithm or something. So you could think about it as second-order representations. But then it's not super useful. But often, they are just artifacts or objects that are used in some form of operation.

But I think before we go into these alternative ways to conceptualize data, you mentioned also other disadvantages of this representational model. And I think one important one that I want to come back to is that if we think of data as references to things in the world, the world is, of course, also not static. The world is constantly changing, constantly evolving. Also in response to the way how we use data or how data is operationalized in the world. So that is also not really considered here.

And I think, I mean, of course representation has always been a very fraught concept, but it's also useful in some ways because it allows us to do science. It allows us to use deliberate assumptions, to use deliberate reductions to make, to use data in a way that transcends their original purpose. So we can now look at a data set of night lights on Earth from space and use that as a proxy for economic activity. So all of that somehow depends on a certain understanding of representation.

But representation is not the only way we can look at data. So I'm not saying one thing is wrong and the other is right. But those are kind of different lenses. And representation or media in general have that effect: that they create their own reality. Like once I have a data set, the data set creates its own reality. And then I start looking at the world through the lens of that data set. So for that reason alone, I think it's important to look at alternative conceptualizations of how to think about data.

KIMBERLY NEVALA: Well, and it is interesting because even in that very simple example of looking down and trying to use lamps and lighting as a measure of economic activity or progress, again, at a point in time. If we're looking at data as something that is, I would say, static or immutable versus sort of fluid and representing a point in time and not representing an absolute reality. Because certainly, there's other reasons why you might not have a lot of lighting in an area that still is economically prospering or developing. So it's one interpretation.

That's one aspect of it, which is that not all interpretations are the same. But also to that point, which is it's a point in time and it doesn't necessarily represent - and maybe this is particularly true when we're talking about people-related data - the fluidity and the potential of reality and how it can change.

DIETMAR OFFENHUBER: Yeah, absolutely. So I think one alternative way to think about it - and I think we all know this from our daily experience - is that meaning is not necessarily explicitly in the words that we are saying but, very often, it is in the way how we say it. So meaning is in which situations we say it. And so meaning is very often in the context.

And what I describe as a relational concept of data basically means that it is that the meaning is contextual. It depends on how people use a data set to make an argument in a concrete situation, let's say in a courtroom. It's about its materiality, how that impacts its use. So this would be an alternative aspect that is deliberately stripped away from the representational model to simplify things. But we can still take all kinds of data sets and look at them through a relational lens.

KIMBERLY NEVALA: And so do I understand you correctly there that with a relational lens, we are not assuming, or maybe presuming or projecting, that the information that we have in front of us is, in fact, a replication or a duplication or a direct representative of a concept or a construct. But it's specific to a usage in a context of the information; both based on when and where it was collected or how it was generated and for what purpose it was generated.

So is this idea of value, performance. and context more intrinsically linked to our understanding of what we're trying to do with the data than it might otherwise be in that kind of representational model? Where we assume that the representation is just, again, maybe it's a digital duplicate of something real?

DIETMAR OFFENHUBER: Yeah, exactly.

I mean, focusing on the representation often makes us overlook all these other aspects that contribute to what we consider meaningful or not. And this more relational approach to data - that really looks at the way people use it, how people share it among themselves. Is it something very ephemeral? Is it something that is very immutable? What kind of settings does that data-- Yanni Loukissas' critical data studies, academic talks about data settings instead of data sets.

So I mean, all these things are, of course, important in the way we use data in the real world. And if we reduce it just to the content of the data, then we ignore all of that.

KIMBERLY NEVALA: And why is this model, the relational model in particular, do you think more productive - I don't know if productive is the right word, you can tell me the right adjective there - but a better lens through which to think about and view synthetic data in particular?

DIETMAR OFFENHUBER: Yeah. I mean, it's interesting, because all these criticisms also apply to any data set that we are using. But synthetic data makes it particularly clear.

First of all, because representation goes out of the window right from the beginning. Because this particular kind of synthetic data sets that we talk about, that mimic real data, they don't have a referent in the world. They just mimic other observations. They are simulated observations of the world, but they don't have a direct referent. So there is no representation. So that's the first thing.
But then I go a little bit more in detail into the way we think about data quality in relation to synthetic data. Which completely inverts our understanding of data, as I argue. So for example, if we consider the data quality of, let's say, census data set or something that represents a population. We of course want to know: how well is that population represented in the data set? So that, of course, goes back to the question like: how did people knock on doors and ask people for their information? So this kind of evidentiary chain of custody that points back to the phenomenon is really the one thing that is very important.

Now, if we look at synthetic data, it's exactly the other way around. How useful a data set is is not necessarily a question of how well it represents particular people in the world. Because there is no connection; it just mimics the kind of observations. But let's say how successful is an AI model that has been trained with this particular synthetic data set in recognizing things in the world?

And if we consider these different use cases of synthetic data - whether it's for privacy protection, whether it's for training or conducting statistical analysis - we find that a certain degree of artificiality of a synthetic data set is appropriate for one use case, but maybe not for another.

So in a way, it's no longer about how well a synthetic data set represents the world, but what a model that has been trained with it is able to do. And that's a very, very different way to think about data, very more relational.

KIMBERLY NEVALA: Yeah. I was going to say it's a quiet, but really fairly profound point as you're thinking about that. Does that also mean that when we're using fully synthesized or generated data, that our conception of a ground truth also shifts? And if so, why or how?

DIETMAR OFFENHUBER: Yeah. I mean, ground truth is, of course, a very interesting topic in machine learning. And I'm working with Beth Coleman from University of Toronto on a paper along those lines.

But in machine learning, of course, ground truth-- Well, let's say in the traditional sense, in geography, ground truth means that, OK, we have a satellite image. We analyze the satellite image and try to figure out what's going on. So in order to figure out whether that is really the correct inference, someone has to go there and see if the situation is actually the way we read it in the satellite scene. So that would be the ground truth: that you actually go to a place and you actually figure out what is going on there. Which, of course, is also like in a military context, a very important idea.

But in machine learning, it refers to the training data set. So the training data set is the ground truth. The training data set could be entirely artificial generated by a 3D game engine to create the artificial city that is used to train a vision system for an autonomous vehicle. But that's the ground truth we are talking about here.

Again, this kind of representational quality is-- I'm not saying it's not important. Obviously, it is important to understand what kind of reality gap we are talking about. But everything is a little bit more complicated.

KIMBERLY NEVALA: You made mention a little bit earlier here about the idea of counterfactuals and our ability to generate data and synthesize data to play through different scenarios that don't necessarily have to relate explicitly to something that's happening "on the ground" or in reality.

So it would seem to me that there's a liberating factor in our ability to do that. As long as we don't lose sight of the fact that this is, or should be, thought of as a counterfactual. Not as sort of real-world evidence. Do I have that correct? And how should we think about that with synthetic data?

DIETMAR OFFENHUBER: I mean, there's an interesting aspect.

I mean, first of all, data is not just for analysis. I mean, we can do, we do, a lot of things with data. And data analysis is just one of them.

But very often, data is also some kind of speculative artifact for thinking through a scenario. And this is, of course, what is important in training models where you are not just trying to train it on the most common cases but also on extreme outliers that are not really encountered on a day-by-day basis.

But in a way, since I'm coming more from social science, this idea of data being more than just bureaucratic evidence to support decisions that I may have made already in the first place, and I'm just looking for data to just kind of justify that--

KIMBERLY NEVALA: That doesn't happen, Dietmar.

[LAUGHING]

DIETMAR OFFENHUBER: -- But as something, imagining, OK, I have some kind of agent that throws speculative scenarios at me, always with synthetic data sets, and my decisions change that. Then you have a different counterfactual world that is constantly evolving.

So in that sense, I think we can also, as long as we understand what we are doing, use this kind of speculative quality or capacity of, yeah, of hallucinations or synthetic data artifacts.

KIMBERLY NEVALA: But when does that get dangerous? So is it when we lose sight of the fact that these are, in fact, again, back to that, how they relate? That this is really a relational representation and we start to believe that this is, in fact representational of reality? Or is it also just an understanding of how we interpret the model, the outputs of a model, differently?

DIETMAR OFFENHUBER: I mean, it's inherently very dangerous because data sets don't always come with a label of how exactly they were generated, where they come from. And in the sciences, that is maybe a little bit less of a problem than, let's say, in all those different professions where people are professional data analysts who just don't really go into the metadata all that much.
And so, for me, also in my other research, this question of data origin: what are the processes, what are the social processes that lead to a data set is, again, where the interesting stuff is happening.

So yeah, this danger of data pollution with synthetic data, biases that basically self-replicate and get worse and worse. I mean, all of those are considerations. But it's already implied in a way. So I would say it's not limited to synthetic data, but synthetic data makes everything more obvious as a problem.

KIMBERLY NEVALA: And you mentioned your work and interest in the (or) from the social sciences perspective. One of the things that we hear often with synthetic data is: this is great because it allows us to expand the patterns in the data. And I'm interested in your thoughts about that. Does this only allow us to expand or create a bigger volume of data for things that we already know? Or does synthetic really help us capture what someone sometimes references as all possible patterns?

DIETMAR OFFENHUBER: I mean, again, it really cuts both ways. And it's a fascinating question.

So on the one hand, just doing data analysis from real-world data sets, they are usually very messy. And sometimes those errors or mistakes that they contain are the actual evidence that I'm looking for. So like nobody can just say, OK, let's disregard this, it is an outlier. It doesn't matter. It may be the only thing that matters.

So these kinds of frictions and contradictions that we sometimes find in data sets - that usually have their source in a particular way, in how the data were generated - are important, are meaningful. And we can look at them in different ways. We can look at them as biases that need to be corrected. We can look at them as errors or other kinds of anomalies that we have to exclude. But we have to look at them from all kinds of different perspectives. And with synthetic data, there's a little bit of this danger that we stop doing that.

If I look at data from a material perspective, noise always has some interesting meaning in it. Sometimes, it's the noise that tells us what we want to know. So I can give you an example. New York has this data set of GPS data from all taxi rides. And you see this extremely blurry in Midtown and in downtown. And this is, of course, because the tall buildings and the narrow streets interfere with the GPS signal. But suddenly, when I just look at the two-dimensional geodata that I have, suddenly I can kind of extract the three-dimensional shape of the city from the noise, from the error.

So it's not always clear what can be discarded in a data set. And we can't really say what are the relevant features in a data set. So with synthetic data there is a danger of really making it too good to be true.

KIMBERLY NEVALA: That representational idea, I suppose, relates in some ways to the idea of bias and the very real fact that in a lot of cases, there are populations or profiles or scenarios that aren't represented in our digital data for a number of reasons. And so synthetic data has come around (here) as well.

Another of the significant value propositions is that we can use it to debias our data. You take a bit of a - or pose a bit of a - provocative question or provocation relative to this. Is synthetic data and an effective way to debias or get rid of bias in data sets?

DIETMAR OFFENHUBER: I mean, you can maybe correct one particular aspect of it. But that doesn't mean that the whole data set becomes more representative. But this is what a lot of people are working on.

They said, OK, the data set has known bias. Like, we can deal with known biases. But maybe we were always able to; if we know a bias, we can do something about it. But the problem is always the bias that we don't know anything about.

And of course, biases also implies that there is a certain gold standard or some truth or some true representation. We say, OK, I have a population to compare my data set to. How does it match? How does it fit? So this is I can quantify the bias here. But, if we talk about cultural phenomena, it becomes a little bit more harder to conceptualize biases that represent a certain perspective of thinking versus another one. So we can't really quantify that in the same way.

But let's say we stay focused on just on this kind of quantitative idea of bias that we want to correct for. I mean, there are a lot of researchers in synthetic data who propose methods of debiasing by just adding missing observations into the training data set. And often in surprisingly crude ways. It's a very brute force approach.

If all what you want to do is to shift the mean a couple of percentage points, then that's possible. What I'm wondering is: does that, is that, a viable strategy to, at scale, think about aspects of the population that we are not aware of?

KIMBERLY NEVALA: And so if we back up a little bit and take a wider view: why was it - or why is it- important for you in your work to really take the time to research and think critically about both the nature of synthetic data and where and how it's being applied today?

DIETMAR OFFENHUBER: Yeah. I mean, a couple of years ago, I got really interested in data materiality. I wrote this book, Autographic Design, which is about ways to visualize the material origin of data, to work with physical traces. And this interest was triggered by my observation that a lot of activists in environmental activism, social justice, or people who look at misinformation from a forensic side are increasingly using very material or physical displays to make a case for evidence. And that happened after 10 years of data literacy and visualization literacy.

So like, OK, if visualization and data analysis is so superior, why do they go back to these very analog ways of showing information? So I was interested in these kinds of methods, of showing the data origin as a part of the larger discourse around how do we think about a particular phenomenon.
But what struck me was that these kinds of very nonrepresentational approaches that are central to these methods are also visible, generally, in the field of AI. Where we often don't have an ability to really trace data or outputs from a model back to a thing in the world, to a representation. So we somehow have to deal with this data generation process that is somewhat opaque. But we can kind of probe it. We can experiment with it. We can do certain things.

So this is where I think synthetic data is interestingly somewhat similar to these kinds of trace phenomena in the world that just pick up a lot of fingerprints as they move from hand to hand.

KIMBERLY NEVALA: Interesting. Now you said the question, one of the questions, you were thinking about there is: why were we reverting to more analog - for some reason now I'm just over indexing on the word representation myself in my questions - but to more analog ways of thinking about or visualizing information. And I have to ask: why were we?

DIETMAR OFFENHUBER: Why. Well, I mean, it's interesting, because if you look at the big controversies, let's say climate change or pollution. It's not really the interpretation of a visualization that is put into question; it's always the data source.

So people don't say, well, this visualization is great, but you can also look at it that way. They immediately jump to this point where they say, well, how do we know that the data you're working with is even valid? And this happened when the famous hockey stick chart became circulated in the media. The first challenge to it was: oh yeah, there's this tree ring data set that is just way too sparse, and we can't really draw any conclusions from these few tree rings. So it was immediately a debate about the data source.

The problem with visualization and data analysis is that they can't really help us to make these arguments, because visualization can only start once we have data. And the data origin, as we know, gets easily forgotten because data generate their own reality.

So the question of how to make data collection more accountable. Especially when the data set is collected by activists or amateurs who don't have the institutional credentials. They have to try very hard to show that how they collected the data is compelling. So, in other words, they need to very tightly connect the phenomenon that they care about to the data that they generate. And this is where you really get into these very material dimensions.

KIMBERLY NEVALA: Yeah. And you can see that as we move into the world. Because you said it's the data they generate. And in that case, the data they're generating may have very well been data they were collecting and inputting based on observation or physical discussions or surveys or what have you.

But today, the data we're collecting may represent what we have generated based on an idea that we had. And so answering that question about how do, or why should, I believe that the information that you're giving me is valid based on the assumptions you used in generating it gets really, really interesting and really sticky.
Does this then change how we have to think about data literacy, thinking about model literacy, if you will, and how we teach and understand data and information?

DIETMAR OFFENHUBER: I think so. I mean, I think so.

Data literacy is an interesting word. I have some problems with it, but I also use it because it is important.

But literacy implies that someone taught you in elementary school how to read a book. But this is not necessarily how we interpret visualizations and images. Even if we don't have a canonical way of decoding it, we might see some meaning in it. And we might construct meaning together with other people.

I mean, this is what the whole critical cartography movement in the '90s made the case: that the way we read maps is not that we study the legend in detail and then say, oh, what does this symbol mean? But we are at the subway platform together with other people trying to ask them: how do I get there based on this map? And then people just construct meaning from whatever is at their disposal in this particular situation. And the legends on the map are rarely ever really consistent unless you are, let's say, a sailor and work with the NOAA charts that are very consistent and very precise. But there, the legend is a book that has 100 pages. And the maps, the charts, are very reduced.

So I mean, I think this data literacy is, in a way, taking the responsibility from the designer and putting it on the audience. So you say, OK, you are supposed to learn how to read in school. It's not the responsibility of the author if you can read the book because you're supposed to be able to read. That argument doesn't fly in visualization. The responsibility of the author is much bigger, I think, because they have to somehow anticipate that construction of meaning in a certain situation.

KIMBERLY NEVALA: And you had made the comment earlier about the traditional-- I’ll call it a problem, charitably. Where, even in traditional analytics, data analytics, you have a decision you have made and then you go to find the data that supports it. And it seems to me that if we want to be in a position - particularly in an era where a lot of the data is being synthesized, so therefore, it is a projection of a belief about the world or the state of things in relation to some objective and outcome - we really do have to take more onus to explain why we think that is an appropriate projection, an appropriate maybe even counterfactual.

Otherwise, we really do open the door to this sort of continuing problem we have now. Which is: it's so easy for folks now out to go out and generate their own view of the world. I think you might have referred to it somewhere as the malleability and the ease today of generative AI systems. Which makes it so easy for us, everyone, to fabricate their own, I think you called it, Instagram reality of data. And it just puts additional pressure on this point, doesn't it?

DIETMAR OFFENHUBER: Yeah. I mean, that's for me a really scary, scary aspect of this whole story where, yeah, you can no longer distinguish which parts of the data set are observations, which parts are assumptions, which parts are-- And even observations contain, obviously, all these assumptions, as we all know. But things get much more complicated with synthetic data.

And it has something to do with this way how we also scaffold meaning. Like there is no real ground truth where we kind of dig down and say, OK, this is where we can anchor everything. We can just compare and contrast things. And contrasting things is the way we learn, I would say. But it becomes incredibly difficult if you have media bubbles where you have these contrasts are prepared for these situations. Where you have scaffolds that support few points in many different, yeah, directions.

KIMBERLY NEVALA: So I want to take just a slight turn. Because in the interest of nonrepresentational computing systems, I feel like it would be remiss if I didn't ask about your water computer collaboration with Orkan Telhan.

DIETMAR OFFENHUBER: So I mean, again, this came out of this interest in nonrepresentational ideas of data. Which, on the one hand, started with this data materiality that went into the nonrepresentational aspects of digital data. But it also took me back to a lot of analog computational approaches.

You could say if you take a wind tunnel, a wind tunnel is a computer to calculate turbulence. The problem is it doesn't give you a number. It instantaneously gives you a picture of that turbulence, but you have to measure it somehow. Which is the problem with analog computers. But the idea of having AI systems and having computational approaches that don't require silicon and digital artifacts and digital representations was something compelling.

There's a whole field in computer science and physics that's called physical reservoir computing. So they look at physical phenomena. It can be anything from water ripples to vibrations or light impulses, light reflections, to replace some parts of deep learning models. And just by using the partly chaotic, partly deterministic qualities of these phenomena to transform data in a way that makes it easier to be picked up by more simple neural networks that don't need as much energy to train.

And this notion of using physical media to calculate. As an urbanist originally, I was interested in how can we think of a city as a computer? And so we currently have a water computer in the Venice Biennale of Architecture that predicts the time of day in Venice, based on how disturbed the water surfaces in the canals are. And it uses water to calculate that.

And I think a lot of people are recently interested in these kind of nonconventional computing approaches because everyone, of course, knows the immense energy consumption that machine learning, AI, needs. And they also understand that not everything needs the numerical accuracy and determinism in order to get something useful. So a lot of people are looking at bringing in these kinds of analog aspects into machine learning.

KIMBERLY NEVALA: That is absolutely fascinating, I think, that in an age of all things digital, and going digital faster and more digital, that we are harkening back to more analog perspectives. And I never heard of this particular area of computation and physics. We'll provide the link to that in the show notes.

So as we're wrapping up here, and you've been looking at and thinking critically about synthetic data, what do you think are some of the most important open questions or things we should be aware of as we move forward in this world where synthetic data and generated information becomes more and more prevalent?

DIETMAR OFFENHUBER: I mean, there's no single point that I can highlight because all of these aspects that we discussed are super meaningful in one particular aspect.

But for me, it really comes down to widening our understanding of data a little bit more.

That data is not always a digital file. It can be a physical artifact. It can be a physical trace. Data is not always, a data set is not always, something to be analyzed through statistical means. It can be something that plays a role in the way people debate a certain issue in all kinds of other ways. Data can be provocation. Data can be a prompt. Data can be a counterfactual scenario.

So thinking about data a little bit outside of these very narrow confines, I think, is one way we can talk about all of these scary aspects and dangers, maybe a little bit more nuanced.

KIMBERLY NEVALA: I love that, and I think that's a great note to leave with all of us who are buried in analytics and AI and data this and that up to our armpits, as you will.

So thank you so much, Dietar-- Dietmar. Goodness. Thank you so much, Dietmar. I really appreciated your time and the insights today.

DIETMAR OFFENHUBER: Thank you, Kimberly. It was great. Thank you very much.

KIMBERLY NEVALA: To continue learning from thinkers, doers, and advocates such as Dietmar, you can find us wherever you listen to your podcasts, and also on YouTube. In addition, we want to hear from you. So if you have comments, questions, or guest suggestions, please write to us at ponderingai@sas.com. That's ponderingai@sas.com.

Creators and Guests

The Shape of Synthetic Data with Dietmar Offenhuber
Broadcast by