Empiricism and Rationalism in Modern Machine Learning
Models, data, models of learning from data, models of data
Ernest Gellner’s Legitimation of Belief, which he considered the fullest statement of his philosophical point of view, argued chiefly that empiricism and rationalism fail as theories of knowledge but:
The failure—if this be granted—of such models as explanations, does not exclude their usefulness or validity as norms, as charters of cognitive practices. Perhaps they may constitute sound and cogent norms, well-founded principles governing the limits of cognitive comportment and propriety, even if they fail as specifications of the underlying mechanism of cognition.
The book fleshes out those norms in some length. On the one hand, the empirical tradition which demands that all arguments be testable and requires a manner by which they be tested. Exactly what it means to be testable and what the manner of testing is to be is something the empiricists and their intellectual descendents do not agree upon among themselves, but the spirit of testing and testability is the important thing.
On the other hand, the rationalist tradition demands that everything be explained in terms of mechanics, of structures. In other words, in terms of simplified models whose internal logic is thoroughly fleshed out.
These statements seem completely banal to us now because they are so permeated into our culture, the basic assumptions we have about the right ways to make progress in our understanding of the world. Gellner’s argument is that the adoption of these norms was a world-historic event the consequences of which are still playing out in our time and have only begun to be understood.
Gellner of course rehearses the familiar problems with each tradition if they are taken as explanations rather than norms. Empirical “data” is always structured by assumptions; to put it simply, observation is not free of theory. Meanwhile, theory without observation, without testing, is a recipe for abstractions without any explanatory value whatsoever. Much of capital-T Theory in the academic world, due to the hostility to anything its practitioners perceive as “positivism,” experiences this particular failure mode.
Now, I want to use Gellner’s framing to think about “synthetic data,” a term of art from the world of machine learning, with the caveat that I am really just an interested outside observer, and as far from an expert on these topics as can be.
Machine learning itself is interesting from the point of view of the empiricist-rationalist distinction. A machine learning model clearly is a model, therefore it bears some relationship to the rationalist tradition. But what is it a model of? It is a model of how to learn from observation. No one seriously thinks it is a model of human cognition in all of its complexity, but I do not think the word “learning” is off the mark (though I still in principle refuse to refer to Large Language Models as “Artificial Intelligence”). The teams that make search engine algorithms know what inputs their models are going to receive—phrases that users input into the search engines, the links that those users end up clicking on, how long it takes them to click, how soon before they perform another search, and so on—and attempt to model how a system might take those inputs and produce better outputs from them over time.
In short, a machine learning model is like an empiricist built by a rationalist. Here is a mechanistic articulation of a thing that takes in observations and puts out successively more accurate outputs based on those observations.
Synthetic data is interesting in this context because it demonstrates how limited this model empiricist is by necessity. Synthetic data is, as its name implied, “fake” data that is itself produced by models. The classical empiricists would no doubt be horrified at the very idea. What could be more solipsistic than to create a model of what observations we might have if we actually attempted to observe the world? Why not just cut out the middle man and observe the world??
Well, the issue is the nature of the inputs that go into a machine learning model. Not everything in the world can be realistically introduced to the stream of “observations” available to these models. Not everything in the world is as readily structured as a search query or browsing behavior. But we might know something about the rest of the world, and we might even know it due to empirical investigations rather than rationalist modeling. Social scientists produce mountains of studies of human behavior all the time, for example. Programmers, who are human beings capable of reading and thinking through the implications of what they have read, can learn from these studies and attempt to model the implications of them that are difficult to directly capture in a form that is digestible to their machine learning models. Synthetic data can therefore been seen as something like an interpretation of the world, and like all interpretations may be supported by stronger or weaker evidence if modelers are called upon to justify it. Of course, being pragmatic by nature (after all, this is all done largely for profit), they’re more likely to simply point to the results, if the result is machine learning models with higher quality outputs.
There’s still something amusing to me about creating a model of how to learn from inputs, on the one hand, and then creating a class of inputs yourself in order to “teach” the model the correct “lesson.” Though now that I put it like that, I suppose it isn’t all that conceptually different from education itself!