Part III · The Physics of Thought

III.C — Time Crystallized in Weights

9 min read · 1,609 words

The copy is cheap; the search was not. This asymmetry—between the negligible cost of duplicating a trained model and the enormous cost of producing it—is the economic puzzle that defines the emerging regime. In a digital medium, the artifact is infinitely replicable; the process that produced it is not.

The cost of running a trained model—inference—is only part of the story. The larger cost, and the one that creates durable economic value, is the cost of producing the model in the first place: the search process that discovered, out of an astronomical space of possible configurations, a particular arrangement of weights that happens to be useful.

This is where the concept of thermodynamic depth becomes relevant. The term was introduced by Seth Lloyd and Heinz Pagels in a 1988 paper that asked a deceptively simple question: what makes one physical state more complex than another?1 Their answer was that complexity is not a property of the state itself but of the history that produced it. A state is complex if producing it required a large amount of computational work—equivalently, if a large amount of entropy had to be dissipated along the way. Thermodynamic depth measures how much irreversible processing was needed to arrive at a given configuration.

The intuition can be made concrete. Diamond and coal are both made of carbon atoms, arranged in different crystalline structures. The difference in their properties—hardness, transparency, conductivity—is a consequence of that structural difference. The structural difference itself reflects a difference in thermodynamic history: diamond forms under conditions of extreme pressure and temperature, sustained over geological time; coal forms from the compression of organic matter under much milder conditions. The atoms are the same; the history is different. Reproducing the structure requires paying the thermodynamic price, or acquiring an existing artifact that already paid it. (The social overlays on diamond pricing are real but separate from this physical point.)

A trained neural network is analogous. The weights of a model—the billions of numerical parameters that determine its behavior—are not random. They are the product of a search process that explored a vast space of possible configurations and converged on one that minimizes a loss function. The search is not omniscient; it is an optimization constrained by architecture, data, and objective. What crystallizes is not truth but a usable compressive structure that survives deployment. That search process is called training, and it is expensive: it requires feeding enormous quantities of data through the network, computing gradients, adjusting weights, and repeating the process millions or billions of times. Each step dissipates energy. The final configuration of weights is a record of all that work.

The cost is not merely computational in the abstract sense. It is physical. Training a frontier language model requires thousands of GPUs running continuously for weeks or months, consuming megawatts of power and generating corresponding quantities of waste heat. Total expenditures for frontier training runs now exceed $100 million, with some estimates reaching several hundred million, though figures vary with architecture, scale, and accounting methodology. These are invoices paid in joules and dollars, and they represent an irreversible expenditure that cannot be recovered.

The result of that expenditure is a file—a collection of floating-point numbers that can be stored on a disk and copied at negligible cost. This is the paradox that confuses intuitions about the economics of AI. The marginal cost of copying a trained model is essentially zero: bits can be duplicated as easily as any other digital information. But the fixed cost of producing the model in the first place is enormous, and that fixed cost is what gives the model its durable advantage. The copy is downstream of an original thermodynamic expenditure that had to be paid at least once.

Charles Bennett, in work roughly contemporaneous with Lloyd and Pagels, developed a related concept he called “logical depth.”2 Logical depth measures the computational resources required to produce a given output from a minimal description of it. A string of random digits has low logical depth because there is no shortcut to producing it; you just have to list the digits. A string that encodes the first million digits of pi has high logical depth because there is a short program that can generate it—but running that program takes a long time. The depth is in the computation, not in the description.

Bennett’s point was that depth, not mere complexity, is what distinguishes interesting structures from random ones. A random configuration of matter has high entropy but low depth; it can be produced by any number of uncontrolled processes. A crystal has low entropy and low depth; it can be produced by straightforward physical assembly. A living organism, or a trained neural network, has high depth: it is the product of an extended, selective, information-rich process that could not have been shortcut.

The value does not reside in the bits themselves—the raw information content of the weight file—but in the thermodynamic depth of the process that produced them. The file you can copy is a compressed record of that history, and the record has value precisely because the history was costly.

There is an obvious counterargument: if the file can be copied, and copying is cheap, then why does the original cost matter? Once the model exists, anyone who obtains a copy gets the benefit of the search without paying for it. The fixed cost is sunk, and competition should drive the price toward marginal cost—which is essentially zero.

This argument is correct as far as it goes, but it misses the structure of the problem. Access to the model is not automatic. The weights of frontier models are closely held; even when released, terms of use may restrict commercial applications. Control over the weights is a form of property right, and property rights can sustain prices above marginal cost. Running the model at scale requires infrastructure that is not free—inference consumes compute, and compute consumes energy. A model that costs nothing to copy still costs money to deploy, and that cost scales with usage.

More fundamentally, the frontier is always moving. A model trained today will be superseded by a model trained tomorrow, with more parameters, more data, more compute, and better performance. The thermodynamic expenditure is not a one-time cost but an ongoing race. The company that stops investing in training falls behind; the company that keeps investing stays at the frontier. The sunk cost of a trained model is only sunk until the next generation arrives. Distillation and imitation can transfer capability at lower cost than first-principles training—a smaller model can be trained to mimic the outputs of a larger one, compressing capability into fewer parameters at a fraction of the original training cost. This compresses rents within a generation, but it does not eliminate the cost of advancing the frontier. The distilled model is still downstream of the original thermodynamic expenditure; it accelerates diffusion but does not eliminate the search cost for whoever reaches the frontier first.

The practical meaning of “intelligence,” in this context, is uncertainty reduction that survives selection. A model is intelligent insofar as it can take an input—a question, an image, a situation—and produce an output that reduces the user’s uncertainty about what to do or believe. The reduction is valuable because it saves time, avoids errors, or enables actions that would otherwise be impossible. And it survives selection because the model, having been trained on vast quantities of data, has learned patterns that generalize to new cases.

But the uncertainty reduction is not free. It was purchased at the cost of the training process, which itself was a form of selection: the gradient descent algorithm explored the space of possible models and selected, step by step, the configurations that minimized the loss. Training is the internal selection procedure; deployment is the external one. The intelligence in the model is the crystallized residue of internal selection, but durable value accrues only where internal and external selection align. A model that minimizes training loss but fails to reduce user uncertainty on deployment has depth without value.

The implications for capital should now be clear. In an economy where intelligence is a factor of production, the scarce resource is not information—information can be copied—but the capacity to produce new intelligence. That capacity is grounded in the physical infrastructure of training: the chips, the power, the cooling, the data, the engineering talent, and the organizational capability to coordinate all of these at scale. The model file is the output; the training pipeline is the asset.

This is the sense in which energy is being “structured into computation” and computation is being “structured into intelligence.” Each layer of structure represents an irreversible transformation: energy into switching states, switching states into learned weights, learned weights into uncertainty-reducing outputs. The value accumulates at each step, and the value is proportional to the thermodynamic depth of the process. The file can be copied, but the history cannot be reproduced at frontier performance without paying comparable cost.

Intelligence is physical. It has a thermodynamic cost. And the cost—the irreversible thermodynamic expenditure of the search—is what makes the advantages durable. The next part develops the formal machinery to describe this process: the production function for intelligence, the role of selection in shaping value, and the implications for pricing, investment, and the structure of the emerging regime.