<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <author>
    <name>Francisco Romaldo Fernandes Mendes</name>
  </author>
  <generator uri="https://hexo.io/">Hexo</generator>
  <icon>https://franciscormendes.github.io/gallery/favicon-32x32.png</icon>
  <id>https://franciscormendes.github.io/</id>
  <link href="https://franciscormendes.github.io/" rel="alternate"/>
  <link href="https://franciscormendes.github.io/atom.xml" rel="self"/>
  <rights>All rights reserved 2026, Francisco Romaldo Fernandes Mendes</rights>
  <subtitle>
    <![CDATA[Machine Learning & Statistics]]>
  </subtitle>
  <title>Francisco Mendes</title>
  <updated>2026-04-10T16:42:50.520Z</updated>
  <entry>
    <author>
      <name>Francisco Romaldo Fernandes Mendes</name>
    </author>
    <category term="statistics" scheme="https://franciscormendes.github.io/categories/statistics/"/>
    <category term="bayesian-statistics" scheme="https://franciscormendes.github.io/tags/bayesian-statistics/"/>
    <category term="a-b-testing" scheme="https://franciscormendes.github.io/tags/a-b-testing/"/>
    <category term="statistics" scheme="https://franciscormendes.github.io/tags/statistics/"/>
    <category term="experimentation" scheme="https://franciscormendes.github.io/tags/experimentation/"/>
    <content>
      <![CDATA[<div class="series-box">  <div class="series-label">Series</div>  <div class="series-name">Bayesian Methods and Experimentation</div>  <ol class="series-list"><li class="series-item"><a href="/2024/07/19/bayesian-statistics/">Bayesian Statistics : A/B Testing, Thompson sampling of multi-armed bandits, Recommendation Engines and more from Big Consulting</a></li><li class="series-item"><a href="/2024/11/08/consulting-ab-testing/">The Management Consulting Playbook for AB Testing (with an emphasis on Recommender Systems)</a></li><li class="series-item"><a href="/2024/08/04/rct-your-way-to-policy/">No, You Cannot RCT Your Way to Policy</a></li><li class="series-item series-current"><span>Bayesian A/B Testing Is Not Immune to Peeking: Insights from the AV Marketplace</span></li></ol></div><h1 id="The-Setup"><a href="#The-Setup" class="headerlink" title="The Setup"></a>The Setup</h1><p>Imagine you are running an experiment to test the efficacy of a rewards program built to incentivize the use of autonomous vehicles in a ride-share marketplace. AVs cost more to operate than driver cars, so the business case depends heavily on whether riders can be nudged toward them at sufficient volume. The rewards program is the nudge — discounts, points, whatever it takes — and you need to know if it works.</p><p>The catch is that the rewards program itself costs money for every day it runs. Every subsidised ride is a line item. So there is real pressure to end the experiment as early as possible. Enter some Bayesian fanatic who proposes the solution: run a Bayesian experiment instead of a frequentist one. The argument is that Bayesian methods allow you to check results continuously and stop the moment you have sufficient evidence, which would dispense entirely with the need for a fixed sample size, the indignity of waiting, and <em>crucially</em> the problem of peeking.</p><p><img src="/gallery/thumbnails/xkcd-frequentist-bayesian.png" alt="XKCD #1132 — Frequentists vs. Bayesians (Randall Munroe, CC BY-NC 2.5)"><br><em>The Bayesian in this comic is right about priors. The Bayesian in our meeting was right about priors too. Neither of them was right about the experiment being cheap.</em></p><p>My disagreement was vigorous enough that simply asserting it felt insufficient, and so I brought the math, which has the considerable advantage of being harder to dismiss than mere opinion.</p><h1 id="Frequentist-Sample-Size"><a href="#Frequentist-Sample-Size" class="headerlink" title="Frequentist Sample Size"></a>Frequentist Sample Size</h1><p>To set the baseline, here is the standard frequentist formulation. We are testing whether the rewards program (arm B) increases AV ride take-rate relative to no rewards (arm A), where $\theta$ is the probability a rider chooses an AV:</p>$$H_0: \theta_A = \theta_B, \quad H_1: \theta_B > \theta_A$$<p>With Type I error $\alpha$ and power $1-\beta$, the required sample size per arm is:</p>$$n_\text{freq} = \frac{\left( z_{1-\alpha/2} + z_{1-\beta} \right)^2 \left[ \theta_A (1-\theta_A) + \theta_B (1-\theta_B) \right]}{(\theta_B - \theta_A)^2}$$<p>where $z_q$ denotes the $q$-th quantile of the standard normal distribution. The numerator grows with the variance of each arm; the denominator shrinks with the effect size squared. If the rewards program moves the AV take-rate only slightly, you need a very large experiment. This was, in fact, the source of the cost anxiety — the expected lift was small, which meant the required sample size was large, which meant the rewards program would run for a long time at a loss.</p><p>This is the formula the Bayesian fanatic wanted to escape. On to the proposed alternative.</p><h1 id="Bayesian-Sample-Size"><a href="#Bayesian-Sample-Size" class="headerlink" title="Bayesian Sample Size"></a>Bayesian Sample Size</h1><p>The Bayesian formulation replaces the frequentist error guarantees with a posterior expected loss criterion. We approximate the posterior on each arm’s conversion rate as Gaussian — reasonable for proportions with sufficient data:</p>$$\theta_A \mid D_A \sim \mathcal{N}(\hat{\theta}_A, \sigma_A^2), \quad\theta_B \mid D_B \sim \mathcal{N}(\hat{\theta}_B, \sigma_B^2)$$<p>with posterior variances:</p>$$\sigma_A^2 \approx \frac{\hat{\theta}_A (1-\hat{\theta}_A)}{n}, \quad\sigma_B^2 \approx \frac{\hat{\theta}_B (1-\hat{\theta}_B)}{n}$$<p>Instead of controlling Type I error, we set a threshold $\epsilon$ on the probability of selecting the wrong arm:</p>$$p_\text{wrong} = \mathbb{P}(\text{choose wrong arm}) < \epsilon$$<p>Solving for $n$, the required sample size per arm is:</p>$$n_\text{bayes} = \frac{\hat{\theta}_A (1-\hat{\theta}_A) + \hat{\theta}_B (1-\hat{\theta}_B)}{(\hat{\theta}_B - \hat{\theta}_A)^2} \cdot \left[ \Phi^{-1}(1-\epsilon) \right]^2$$<p>where $\Phi^{-1}$ is the inverse standard normal CDF. Look at the structure. It is identical to the frequentist formula. The variance terms are the same. The effect size in the denominator is the same. The only difference is the squared prefactor: $\left[\Phi^{-1}(1-\epsilon)\right]^2$ instead of $\left(z_{1-\alpha/2} + z_{1-\beta}\right)^2$.</p><h1 id="Example"><a href="#Example" class="headerlink" title="Example"></a>Example</h1><p>Put some numbers on it. Suppose the baseline AV take-rate is 50% and the rewards program is expected to lift it by 2 percentage points:</p><ul><li>$\theta_A = 0.50$, $\theta_B = 0.52$</li><li>Frequentist: $\alpha = 0.05$, power $= 0.8$ $\implies z_{1-0.025} + z_{0.8} \approx 1.96 + 0.84 = 2.8$</li><li>Bayesian: $\epsilon = 0.05 \implies \Phi^{-1}(0.95) \approx 1.645$</li></ul><p>Setting aside the variance terms, which are identical for both, the sample sizes scale as:</p>$$n_\text{freq} \propto (2.8)^2 = 7.84, \quad n_\text{bayes} \propto (1.645)^2 = 2.71$$<p>On paper, the Bayesian approach needs roughly a third of the frequentist sample. If you are the person trying to minimise the cost of subsidising AV rides, this looks like exactly what you wanted, and it is the kind of result that tends to end conversations in rooms where people are more motivated by the cost of the experiment than the integrity of it. It is also, as it turns out, not quite right.</p><h1 id="Bayesian-Is-Not-Immune-to-Peeking"><a href="#Bayesian-Is-Not-Immune-to-Peeking" class="headerlink" title="Bayesian Is Not Immune to Peeking"></a>Bayesian Is Not Immune to Peeking</h1><p>The critical assumption buried in the Bayesian sample size formula is that you collect $n_\text{bayes}$ samples and <em>then</em> evaluate the stopping criterion. You do not evaluate it after every ride. You do not check it at the end of each day because finance is asking. You do not peek.</p><p>Peeking is the practice of inspecting results before the planned sample size is reached and stopping early if the numbers look good. It is what invalidates frequentist tests when p-values are checked repeatedly mid-experiment — the false positive rate inflates because you are effectively running multiple tests and keeping the best result. The same logic applies to the Bayesian posterior.</p><p><img src="/gallery/thumbnails/xkcd-significant.png" alt="XKCD #882 — Significant (Randall Munroe, CC BY-NC 2.5)"><br><em>Run enough tests, check often enough, and green jelly beans will cause acne. The Bayesian equivalent: check the posterior enough times and your rewards program will appear to work. The AV subsidy line item does not care which framework licensed your false positive.</em></p><p>If you evaluate $p_\text{wrong} < \epsilon$ continuously and stop the moment it dips below threshold, you have not run the experiment described by the formula above. You have run something different, with different — and worse — statistical properties. The Bayesian framing does not make this problem disappear. It reframes it. The stopping rule is still a rule, and it must be respected as such.</p><h1 id="The-Deeper-Point"><a href="#The-Deeper-Point" class="headerlink" title="The Deeper Point"></a>The Deeper Point</h1><p>Now consider what happens when you align the frequentist and Bayesian parameters. Under a non-informative prior and Gaussian approximation:</p>$$\left[ \Phi^{-1}(1-\epsilon) \right]^2 = \left( z_{1-\alpha/2} + z_{1-\beta} \right)^2$$<p>The two formulas are identical. After one round of experimentation, you can always set $\hat{\theta}_A = \theta_A$ and the sample sizes converge exactly. The Bayesian framework is not buying you a smaller experiment — it is buying you a different interpretation of the same data collected over the same period, subsidising the same number of AV rides.</p><p>The cost of the rewards program does not go down because you chose a different statistical paradigm. The experiment still needs to run for exactly as long as the sample size demands, the rides still need to be subsidised for the duration of it, and the rewards program still costs the same amount of money regardless of what you call the statistical framework governing your decision.</p><p>If there is a genuine desire to reduce experiment duration, the honest levers are: a larger expected effect size (better rewards design), higher tolerance for error ($\epsilon$ or $\alpha$), or accepting lower power. Switching from frequentist to Bayesian and calling it done is not one of them.</p>]]>
    </content>
    <id>https://franciscormendes.github.io/2026/04/10/bayesian-vs-frequentist-sample-size/</id>
    <link href="https://franciscormendes.github.io/2026/04/10/bayesian-vs-frequentist-sample-size/"/>
    <published>2026-04-10T00:00:00.000Z</published>
    <summary>A ride-share AV rewards program, a Bayesian fanatic, and the claim that Bayesian experiments let you peek. They do not. Here is the math.</summary>
    <title>Bayesian A/B Testing Is Not Immune to Peeking: Insights from the AV Marketplace</title>
    <updated>2026-04-10T16:42:50.520Z</updated>
  </entry>
  <entry>
    <author>
      <name>Francisco Romaldo Fernandes Mendes</name>
    </author>
    <category term="autonomous-vehicles" scheme="https://franciscormendes.github.io/categories/autonomous-vehicles/"/>
    <category term="machine-learning" scheme="https://franciscormendes.github.io/categories/autonomous-vehicles/machine-learning/"/>
    <category term="autonomous-vehicles" scheme="https://franciscormendes.github.io/tags/autonomous-vehicles/"/>
    <category term="sensor-fusion" scheme="https://franciscormendes.github.io/tags/sensor-fusion/"/>
    <category term="signal-processing" scheme="https://franciscormendes.github.io/tags/signal-processing/"/>
    <category term="machine-learning" scheme="https://franciscormendes.github.io/tags/machine-learning/"/>
    <category term="embedded-ml" scheme="https://franciscormendes.github.io/tags/embedded-ml/"/>
    <content>
      <![CDATA[<h1 id="Introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h1><p>In autonomous driving, perception systems typically rely on photons i.e. cameras, lidar, and radar. But what if we could also listen to the environment, capturing sound cues that are invisible to traditional vision-based sensors?</p><p>There are many intuitively appealing use cases where an additional sensing modality could enhance awareness of the surroundings. Acoustic sensing itself is not new in automotive systems. For example, ultrasonic sensors have long been used for short-range applications such as parking assistance. Extending this idea to environmental sound sensing—allowing a vehicle to effectively hear its surroundings—has been explored by organizations such as the Fraunhofer Institute and Renesas Electronics. At CVPR ‘23 we had the Princeton Computational Image lab create 2D “images” using beamforming (more on this later) from passive acoustic listening and fused this with RGB camera data. </p><p><img src="/2026/03/07/acoustic-sensor-fusion/beamforming.gif" alt="Acoustic Beamforming for Multimodal Scene Understanding: Related work that uses a microphone array to create a pixelized output.  "></p><p>While the Princeton paper was highly influential to this work, our client was interested in passing certain scenarios only without overly relying on (or expending energy on) a highly complex multi-dimensional sensor modality. In this post we explore several motivations for adding a simpler version of passive acoustic sensing to the autonomous vehicle sensor stack.</p><div style="text-align:center;">  <img src="/2026/03/07/acoustic-sensor-fusion/emergency_vehicle_camera.gif"     style="display:block; margin-left:auto; margin-right:auto; max-width:100%;">  <p><em>Sneak Peek of our solution: Flashing red/cyan vehicle is emitting sound</em></p></div><h3 id="Why-consider-acoustic-sensing"><a href="#Why-consider-acoustic-sensing" class="headerlink" title="Why consider acoustic sensing?"></a>Why consider acoustic sensing?</h3><ul><li>Obstructed-view scenarios are increasingly emphasized in safety standards such as Euro NACP. Detecting hazards before they become visible is critical for improving safety metrics.</li><li>With the rise of autonomous systems in defense and security applications, additional sensing modalities may provide a differentiator when competing for contracts.</li><li>Sound does not require line-of-sight (LoS). Important events such as children playing in the street, emergency vehicle sirens, or approaching traffic can be detected even when visually occluded.</li><li>Sound is a natural communication modality for humans, and could provide a mechanism for richer interaction between the environment and the ego vehicle.</li><li>Acoustic signals can intrinsically provide directional information (heading), which can improve situational awareness metrics such as MAPH (Mean Average Precision with Heading).</li><li>Beamforming+RGB outperforms RGB alone in challenging occluded scenarios</li></ul><h3 id="Key-disadvantages"><a href="#Key-disadvantages" class="headerlink" title="Key disadvantages"></a>Key disadvantages</h3><p>Acoustic sensing also introduces several challenges:</p><ul><li>Passive acoustic systems typically provide Angle-of-Arrival (AoA) information but not reliable distance estimates.</li><li>Performance can degrade due to vehicle noise, wind noise, and environmental interference.</li></ul><h2 id="Toy-Example-Acoustic-Direction-Improves-Early-Detection"><a href="#Toy-Example-Acoustic-Direction-Improves-Early-Detection" class="headerlink" title="Toy Example: Acoustic Direction Improves Early Detection"></a>Toy Example: Acoustic Direction Improves Early Detection</h2><p>To illustrate the value of acoustic sensing, consider a simple scenario:</p><ul><li>An emergency vehicle approaches from the bottom-right relative to the ego vehicle.  </li><li>Acoustic sensing estimates the direction of arrival using TDOA between microphones, but cannot determine distance.  </li><li>Camera and lidar only detect the vehicle once it enters their field of view.</li></ul><p>In the simulation, the vehicle moves toward the ego vehicle. The acoustic system continuously estimates a sextant, or directional sector, while the camera and lidar begin detecting the vehicle only after it enters their sensing range.</p><p>This allows the fusion system to gain early directional awareness, giving planning systems a chance to anticipate the approaching vehicle before visual confirmation. Even though the acoustic angle estimate is noisy, it provides information beyond the field of view of both camera and lidar. After fusing with lidar and camera data, the system produces more accurate position estimates.</p><p><img src="/2026/03/07/acoustic-sensor-fusion/acoustic-kalman-filter-2.png" alt="Sensor Fusion Toy Example"></p><h3 id="Context"><a href="#Context" class="headerlink" title="Context"></a>Context</h3><p>The work described here was originally developed at Reality AI, which was later acquired by Renesas Electronics to explore the commercial feasibility of passive acoustic sensing in automotive systems. My role focused on scaling the solution and validating it across different environments.</p><p>We conducted experiments using simulated emergency sirens in multiple environments, including:</p><ul><li>controlled warehouse setups  </li><li>busy urban streets  </li><li>open environments with realistic traffic noise</li></ul><p>We also collaborated with external partners to collect additional datasets and explore multi-sensor fusion approaches.</p><p>In this article, I will explore PAMVON (Passive Acoustic Monitoring for Vehicles and Objects)—a system that uses microphone arrays, signal processing, and machine learning to detect and localize important acoustic events in the driving environment.</p><p>The work described here was originally developed at Reality AI, which was later acquired by Renesas Electronics to explore the commercial feasibility of passive acoustic sensing in automotive systems. My role focused on scaling the solution and validating it across different environments.</p><p>We conducted experiments using simulated emergency sirens in multiple environments, including:</p><ul><li>controlled warehouse setups  </li><li>busy urban streets  </li><li>open environments with realistic traffic noise</li></ul><p>We also collaborated with external partners to collect additional datasets and explore multi-sensor fusion approaches.</p><h1 id="Passive-Acoustic-Monitoring-PAM"><a href="#Passive-Acoustic-Monitoring-PAM" class="headerlink" title="Passive Acoustic Monitoring (PAM)"></a>Passive Acoustic Monitoring (PAM)</h1><p>Passive Acoustic Monitoring (PAM) detects environmental sounds without emitting signals. Instead, the system passively listens for events in the surrounding environment such as emergency vehicle sirens, horns, tire skids, engine noise, drones or machinery, and even children playing in the street.</p><p>The key advantage of this approach is that sound does not require line-of-sight. Important cues can be detected even when they are visually occluded, in low-light conditions, or in adverse weather. This makes acoustic sensing particularly attractive for early warning scenarios, such as an approaching ambulance that has not yet entered the field of view of the vehicle’s cameras or lidar.</p><p>Recent developments in multimodal large language models also change how one might think about acoustic perception. Rather than requiring a rigid classifier that assigns each sound to a predefined category, modern multimodal systems can reason over audio signals more flexibly and incorporate them into a broader contextual understanding of the scene. In practice this means the acoustic signal can act less as a strict classification task and more as an additional stream of environmental information that the perception system can interpret alongside vision and other sensor modalities.</p><h1 id="Microphone-Arrays-and-Beamforming"><a href="#Microphone-Arrays-and-Beamforming" class="headerlink" title="Microphone Arrays and Beamforming"></a>Microphone Arrays and Beamforming</h1><p>Sound (like light) travels in a straight line and therefore we need at least 4 microphones to provide an accurate estimate of the angle of arrival of the sound wave.<br>A single microphone provides limited spatial information. To estimate where a sound originates, passive acoustic monitoring systems typically use small arrays of microphones. By observing the time differences between when a signal reaches each microphone, the system can estimate the direction of arrival of the sound source. Arrays also make it possible to improve signal quality by combining signals from multiple sensors.</p><p>In practice this enables several useful capabilities. The system can estimate the direction of arrival of a sound, approximate the location of the source under certain assumptions, and improve the signal-to-noise ratio by combining measurements across the array.</p><p>Beamforming is the signal processing technique that makes this possible. The idea is simple: signals arriving from a particular direction reach each microphone at slightly different times. By applying the appropriate delays and summing the signals together, the array reinforces sounds from the desired direction while suppressing sounds from other directions.</p><p>The microphone array can be visualized like this:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">Mic1 ----------- Mic2</span><br><span class="line">   \               /</span><br><span class="line">    \             /</span><br><span class="line">     \           /</span><br><span class="line">       ( sound )</span><br><span class="line">         source</span><br><span class="line">     /           \</span><br><span class="line">    /             \</span><br><span class="line">   /               \</span><br><span class="line">Mic3 ----------- Mic4</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>In practice the system estimates the relative delay between microphones using cross-correlation. When a sound arrives at the array, it reaches each microphone at slightly different times. By computing the cross-correlation between pairs of microphone signals, the system can estimate the time difference of arrival between them.</p><p>These time differences constrain the direction from which the sound could have originated. With multiple microphone pairs, the system can estimate a consistent direction of arrival for the source.</p><p>Once the delays are known, the array can also combine the microphone signals in a way that reinforces sounds coming from that direction while suppressing others. In effect, the array behaves like a steerable listening sensor that can focus on different parts of the acoustic scene.</p><h3 id="Angle-of-Arrival-AoA-Estimation-via-Cross-Correlation"><a href="#Angle-of-Arrival-AoA-Estimation-via-Cross-Correlation" class="headerlink" title="Angle of Arrival (AoA) Estimation via Cross-Correlation"></a>Angle of Arrival (AoA) Estimation via Cross-Correlation</h3><p>In a microphone array, a sound source reaches each microphone at slightly different times. By comparing these signals, the system can estimate the relative delay between them. A common way to do this is through cross-correlation, which measures how similar two signals are as one is shifted in time relative to the other.</p><p>For two microphone signals $x_1(t)$ and $x_2(t)$, the cross-correlation can be written as</p>$$R_{12}(\tau) = \int x_1(t) \, x_2(t+\tau) \, dt$$<p>The time shift $\tau$ that maximizes this correlation corresponds to the time difference of arrival between the two microphones:</p>$$\tau_{\text{max}} = \arg\max_\tau R_{12}(\tau)$$<p>If the microphones are separated by a distance $d$, this delay can be converted into an estimate of the angle of arrival:</p>$$\theta = \arcsin\left(\frac{c \cdot \tau_{\text{max}}}{d}\right)$$<p>where $c$ is the speed of sound.</p><p>In real environments, reflections and background noise can make the correlation peak less reliable. A commonly used approach to improve robustness is generalized cross-correlation with phase transform (GCC-PHAT). This method emphasizes phase information in the frequency domain and reduces the influence of signal magnitude differences:</p>$$R_{12}(\tau) = \mathcal{F}^{-1}\{\frac{X_1(f) X_2^(f)}{|X_1(f) X_2^*(f)|}\}$$<p>Here $X_1(f)$ and $X_2(f)$ are the Fourier transforms of the microphone signals. The peak of $R_{12}(\tau)$ provides a stable estimate of the arrival delay, which can then be used to infer the direction of the sound source.</p><h1 id="Signal-Processing-Pipeline"><a href="#Signal-Processing-Pipeline" class="headerlink" title="Signal Processing Pipeline"></a>Signal Processing Pipeline</h1><p>Passive acoustic monitoring typically follows a structured processing pipeline:</p><ol><li>Preprocessing: The raw microphone signals are filtered to remove irrelevant frequency bands, and gain normalization ensures consistent amplitude levels across microphones.  </li><li>Time-frequency analysis: Signals are converted into spectrograms using the Short-Time Fourier Transform (STFT), revealing how frequency content evolves over time.  </li><li>Beamforming: Directional enhancement techniques, such as delay-and-sum or cross-correlation-based beamforming, focus on sounds from specific directions while suppressing noise and interference.  </li><li>Event detection: Open-source neural networks, including VGGish, convolutional-recurrent networks (CRNNs), and transformers, analyze the spectrograms to detect and classify events such as sirens, horns, or tire skids.</li><li>Localization: Time Difference of Arrival (TDOA) estimates, often computed using GCC-PHAT cross-correlation, are combined across microphone pairs to infer the direction of incoming sounds and, in some cases, approximate source locations.</li></ol><p>This pipeline allows the system to transform raw audio into actionable information for autonomous vehicle perception, providing early warning of hazards even when they are outside the line of sight of cameras or lidar.</p><h1 id="Acoustic-Sensor-Data-Representation"><a href="#Acoustic-Sensor-Data-Representation" class="headerlink" title="Acoustic Sensor Data Representation"></a>Acoustic Sensor Data Representation</h1><p>In a generalized form, data from a passive acoustic monitoring array can be represented as a tuple capturing the relevant information for fusion:</p>$$\displaystyle z_{\mathrm{ac}} = (\theta, \sigma_\theta, c, f, t, p_{\mathrm{ego}})$$<p>Where:</p><ul><li>$\theta$: Estimated angle of arrival (AoA) of the sound, typically computed using TDOA and cross-correlation (GCC-PHAT).  </li><li>$\sigma_\theta$: Uncertainty of the angle estimate, reflecting noise, reverberation, or low SNR.  </li><li>$c$: Sound class probability vector produced by the ML model. The classes correspond to ambulance, police, and other unknown loud sounds. For example, $c = [0.7, 0.2, 0.1]$</li><li>$f$: Frequency-domain features, such as Mel spectrogram or STFT frame, optionally used for downstream ML fusion.  </li><li>$t$: Timestamp of the measurement, to allow temporal alignment with other sensors.  </li><li>$\mathbf{p}_{\text{ego}}$: Pose of the ego vehicle when the measurement was captured, typically $(x, y, \psi)$ in 2D or 3D coordinates.</li></ul><p>This representation allows the acoustic signal to integrate easily into perception and fusion pipelines:</p><ul><li>$\theta$ provides a directional prior for early detection.  </li><li>$c$ informs semantic understanding of the source.  </li><li>$\sigma_\theta$ can be used in probabilistic fusion (e.g., weighted averaging, Kalman updates).  </li><li>$f$ allows future retraining or fine-tuning of ML models.  </li><li>$t$ and $\mathbf{p}_{\text{ego}}$ allow projection into bird’s-eye view (BEV) maps or occupancy grids alongside camera and lidar data.</li></ul><p>For an array of $N$ microphones, the raw signals can also be stored as:</p>$$\mathbf{X}_{\text{raw}} = [x_1(t), x_2(t), \dots, x_N(t)]$$<p>These raw signals are processed into the generalized form above, providing a compact yet rich representation for sensor fusion.</p><h1 id="Simple-ID-Based-Matching"><a href="#Simple-ID-Based-Matching" class="headerlink" title="Simple ID-Based Matching"></a>Simple ID-Based Matching</h1><p>Before exploring a more technical late fusion approach, we first evaluated a simpler strategy based on ID matching. In this setup, acoustic detections were associated directly with annotated object identities in the dataset.</p><p>The acoustic classifier produced class probabilities for events such as ambulance sirens, police sirens, or other loud sounds. When the classifier detected a high probability ambulance siren, we matched that event to the corresponding object detection annotation in the scene. In practice this meant associating the acoustic event with the object ID labeled as an emergency vehicle in the perception dataset.</p><p>One challenge is that the acoustic detector often produces a directional estimate much earlier than the moment when the vehicle becomes visible and is annotated by the vision system. The acoustic pipeline provides an angle of arrival $\theta$, but not a direct range estimate. To place this information in the BEV representation, we projected the acoustic bearing into the map by creating an artificial point along the direction of arrival at a fixed distance $d$ from the ego vehicle. The distance was chosen to be larger than the field of view of the camera and lidar sensors so that the acoustic signal could represent a potential source outside the current perception range.</p><p>This artificial point can be written as</p>$$p_{ac} =\begin{bmatrix}x_{ego} \\y_{ego}\end{bmatrix}+d\begin{bmatrix}\cos \theta \\\sin \theta\end{bmatrix}$$<p>where $(x_{ego}, y_{ego})$ is the position of the ego vehicle in BEV coordinates. As the vehicle approaches and eventually enters the sensor field of view, the projected acoustic point becomes spatially consistent with the detected object.</p><p>This approach relies on the object detection pipeline already identifying vehicles and assigning consistent IDs across frames. The acoustic system then acts as an additional signal that confirms the presence of a specific type of vehicle.</p><p>Although simple, this method is surprisingly effective. The acoustic cue provides early detection of emergency vehicles, while the vision system provides precise localization and tracking. By linking the acoustic classification to existing object IDs, the system can quickly identify which tracked object is likely producing the sound.</p><p>This ID-based matching served as a useful baseline before implementing a more general late fusion approach using probabilistic tracking and bearing measurements.</p><h1 id="Late-Fusion-with-an-Existing-BEV-Pipeline"><a href="#Late-Fusion-with-an-Existing-BEV-Pipeline" class="headerlink" title="Late Fusion with an Existing BEV Pipeline"></a>Late Fusion with an Existing BEV Pipeline</h1><p>While the ID-based matching approach provided a strong baseline, it relies on the object already being detected and assigned an identity by the perception pipeline. In many cases the acoustic signal appears earlier, before the vehicle enters the field of view of the cameras or lidar. To make better use of this early directional information, we extended the system using a more formal late fusion approach.</p><p>In this setup, acoustic sensing was integrated on top of an existing lidar and camera perception stack. The vision and lidar pipeline already produced tracked objects in bird’s-eye view (BEV), including estimates of position, velocity, and uncertainty. The acoustic sensor then contributed an additional bearing measurement, which could be incorporated into the tracking framework to refine object estimates and improve situational awareness.</p><p>After lidar and camera fusion, each tracked object is represented by a state vector</p>$$\mathbf{x} =\begin{bmatrix}x \\y \\v_x \\ v_y\end{bmatrix}$$<p>where $(x,y)$ represents the position of the object in BEV coordinates and $(v_x, v_y)$ represents the velocity components. The tracker also maintains a covariance matrix</p>$$\mathbf{P}$$<p>which represents the uncertainty of the state estimate.</p><p>The acoustic system produces a bearing measurement corresponding to the direction of arrival of the sound:</p>$$z_{ac} = \theta$$<p>where $\theta$ is the estimated angle of arrival relative to the ego vehicle.</p><p>If the ego vehicle is located at position $(x_e, y_e)$, the predicted bearing of a tracked object can be written as</p>$$h(\mathbf{x}) =\arctan2(y - y_e, \; x - x_e)$$<p>This function maps the tracked object position into the expected acoustic measurement.</p><p>The difference between the observed bearing and the predicted bearing is the innovation:</p>$$\mathbf{y} = z_{ac} - h(\mathbf{x})$$<p>Because the measurement model is nonlinear, we linearize it using the Jacobian</p>$$\mathbf{H} =\begin{bmatrix}\frac{\partial h}{\partial x} &\frac{\partial h}{\partial y} &0 &0\end{bmatrix}$$<p>For the bearing function this yields</p>$$\frac{\partial h}{\partial x} = -\frac{y - y_e}{(x-x_e)^2 + (y-y_e)^2}$$$$\frac{\partial h}{\partial y} = \frac{x - x_e}{(x-x_e)^2 + (y-y_e)^2}$$<p>Given acoustic measurement noise $R_{ac}$, the Kalman gain can then be computed as</p>$$\mathbf{K} =\mathbf{P} \mathbf{H}^T(\mathbf{H} \mathbf{P} \mathbf{H}^T + R_{ac})^{-1}$$<p>The updated state estimate becomes</p>$$\mathbf{x}_{new} =\mathbf{x} + \mathbf{K}\mathbf{y}$$<p>and the covariance is updated as</p>$$\mathbf{P}_{new} = (I - \mathbf{K}\mathbf{H})\mathbf{P}$$<p>Since the acoustic sensor only provides directional information, this update primarily reduces uncertainty perpendicular to the acoustic ray while leaving uncertainty along the ray largely unchanged. In practice, this allows acoustic measurements to improve the tracking of objects detected by lidar and camera without requiring modifications to the existing perception pipeline.</p><h1 id="Final-Output"><a href="#Final-Output" class="headerlink" title="Final Output"></a>Final Output</h1><p>The final output of the system is represented in Bird’s-Eye View (BEV) space. The acoustic information can be projected into this space using either of the two methods discussed earlier.</p><p>In the example scene below, the ego vehicle drives past a stationary car that is simulated to emit an emergency vehicle siren. The figure illustrates how the acoustic signal integrates with the rest of the perception stack.</p><p>On the left, we show the acoustic output tagged with an object ID from the real-time object detection system provided by the customer (likely based on a model such as YOLO).</p><p>In the centre, we show the BEV representation, where the estimated angle of arrival (AoA) from the microphone array is plotted as a ray originating from the ego vehicle. Because the clip is only six seconds long, the visualization shows a ray pointing in the direction of the detected emergency vehicle sound from the start of the sequence. In this case, the microphones detect the siren before the object enters the field of view of either the camera or the lidar.</p><p>Once the vision-based detector identifies the vehicle, the AoA estimate can be associated with that object, with small corrections applied if necessary to account for sensor alignment or localisation error.</p><p>On the right, we show the lidar point cloud for the same scene. In this example, the acoustic output is not annotated in the lidar view, although such a visualization is also possible.</p><div style="text-align:center;">  <img src="/2026/03/07/acoustic-sensor-fusion/emergency_vehicle_camera.gif"     style="display:block; margin-left:auto; margin-right:auto; max-width:100%;">  <p><em>Camera: Flashing red/cyan vehicle is emitting sound</em></p></div><div style="text-align:center;">  <img src="/2026/03/07/acoustic-sensor-fusion/emergency_vehicle_bev.gif"     style="display:block; margin-left:auto; margin-right:auto; max-width:100%;">  <p><em>BEV: Acoustic AoA Plotted</em></p></div><div style="text-align:center;">  <img src="/2026/03/07/acoustic-sensor-fusion/emergency_vehicle_lidar.gif"     style="display:block; margin-left:auto; margin-right:auto; max-width:100%;">  <p><em>LiDAR</em></p></div><h1 id="Implementation-Considerations"><a href="#Implementation-Considerations" class="headerlink" title="Implementation Considerations"></a>Implementation Considerations</h1><p>The passive acoustic monitoring pipeline can be implemented efficiently on embedded automotive hardware. In our implementation, the audio processing pipeline, machine learning inference, and angle of arrival estimation were designed to run on a single MCU core. This includes signal preprocessing, spectrogram generation, neural network inference, and cross-correlation based localization.</p><p>The system was implemented on Renesas automotive controllers, specifically the RH850 microcontroller family. Audio input processing, AI target detection, and angle of arrival estimation ran on a single RH850 core alongside the A2B audio stack. In this configuration the full acoustic pipeline occupied roughly 300 KB of code space, even while running in a debug configuration and without aggressive optimization.</p><p>This relatively small footprint makes it feasible to deploy acoustic sensing alongside other perception tasks without requiring specialized hardware acceleration. On RH850 devices, significant CPU, flash, and RAM resources remain available for additional vehicle functions.</p><p>Microphone array configurations can also be adapted depending on coverage requirements. A four-microphone array provides approximately 180 degrees of coverage, while an eight-microphone configuration enables full 360 degree sensing around the vehicle.</p><p>In practice, the computational requirements depend on the complexity of the processing pipeline. Efficient PAM processing can run entirely on automotive-grade microcontrollers such as the RH850. Larger microphone arrays or more complex neural networks may benefit from more powerful automotive SoCs such as the Renesas R-Car platform. Regardless of the hardware platform, maintaining real-time processing is critical so that acoustic events can be incorporated into the perception pipeline with minimal latency.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">   Microphone Array</span><br><span class="line">(4 or 8 digital microphones)</span><br><span class="line">         │</span><br><span class="line">         │</span><br><span class="line">         ▼</span><br><span class="line"> +------------------+</span><br><span class="line"> |   A2B Audio Bus  |</span><br><span class="line"> | (Automotive Audio|</span><br><span class="line"> |   Backbone)      |</span><br><span class="line"> +------------------+</span><br><span class="line">         │</span><br><span class="line">         │</span><br><span class="line">         ▼</span><br><span class="line"> +----------------------+</span><br><span class="line"> |   RH850 MCU          |</span><br><span class="line"> |----------------------|</span><br><span class="line"> |  Audio Preprocessing |</span><br><span class="line"> |  STFT / Spectrogram  |</span><br><span class="line"> |  VGGish Inference    |</span><br><span class="line"> |  GCC-PHAT (TDOA)     |</span><br><span class="line"> |  AoA Estimation      |</span><br><span class="line"> +----------------------+</span><br><span class="line">         │</span><br><span class="line">         │</span><br><span class="line">         ▼</span><br><span class="line"> +----------------------+</span><br><span class="line"> |  Acoustic Detection  |</span><br><span class="line"> |  θ (bearing)         |</span><br><span class="line"> |  class probabilities |</span><br><span class="line"> +----------------------+</span><br><span class="line">         │</span><br><span class="line">         │</span><br><span class="line">         ▼</span><br><span class="line"> +----------------------+</span><br><span class="line"> |   BEV Fusion Layer   |</span><br><span class="line"> | (Camera + Lidar +    |</span><br><span class="line"> |    Acoustic)         |</span><br><span class="line"> +----------------------+</span><br><span class="line">         │</span><br><span class="line">         ▼</span><br><span class="line"> +----------------------+</span><br><span class="line"> |  Tracking / Planning |</span><br><span class="line"> +----------------------+</span><br><span class="line"></span><br></pre></td></tr></table></figure><h1 id="Conclusion"><a href="#Conclusion" class="headerlink" title="Conclusion"></a>Conclusion</h1><p>Passive acoustic monitoring has shown significant potential but has not yet become standard in autonomous vehicle perception stacks. There are several challenges that limit its adoption:</p><ol><li>Ambient noise and signal variability – urban environments are full of sounds that can mask sirens, horns, and other important cues.  </li><li>Environmental acoustic complexity – reflections, occlusions, and vibrations from the vehicle itself make accurate localization difficult.  </li><li>Automotive qualification and safety standards – microphones and processing hardware must meet rigorous requirements such as ISO 26262 and AEC-Q100, and survive extreme temperatures and vibrations.  </li><li>Limited generalization of machine learning models – systems that perform well in controlled tests can struggle on highways, in multi-siren urban settings, or with unusual sound events.  </li><li>No regulatory requirement – without a mandate from safety standards or OEMs, there is little commercial incentive to integrate acoustic sensing into production vehicles.</li></ol><p>Despite these obstacles, acoustic sensing can still provide value when used as a complementary modality. Integrating sound cues through late fusion on top of camera and lidar tracks allows early warnings of approaching emergency vehicles or other hazards, even before they enter the field of view. In this way, the acoustic signal reinforces and augments traditional sensors, enhancing situational awareness without requiring a full redesign of the perception stack. Performance improvements were observed in EuroNACP obstructed view testing scenarios, demonstrating the practical benefit of including an acoustic modality in complex urban environments.</p><h1 id="References"><a href="#References" class="headerlink" title="References"></a>References</h1><ul><li><p>Renesas Electronics. Seeing Sound: AI-Based Detection of Participants in Automotive Environment Using Passive Audio. White Paper.<br><a href="https://www.renesas.com/en/document/whp/seeing-sound-ai-based-detection-participants-automotive-environment-passive-audio?r=1626806">https://www.renesas.com/en/document/whp/seeing-sound-ai-based-detection-participants-automotive-environment-passive-audio?r=1626806</a></p></li><li><p>Princeton University Light + Sound Interaction Lab. Seeing with Sound.<br><a href="https://light.princeton.edu/publication/seeingwithsound/">https://light.princeton.edu/publication/seeingwithsound/</a></p></li></ul>]]>
    </content>
    <id>https://franciscormendes.github.io/2026/03/07/acoustic-sensor-fusion/</id>
    <link href="https://franciscormendes.github.io/2026/03/07/acoustic-sensor-fusion/"/>
    <published>2026-03-07T00:00:00.000Z</published>
    <summary>Exploring PAMVON, a passive acoustic monitoring system for emergency vehicle detection, and the challenges preventing its production adoption.</summary>
    <title>Beyond Photons: Passive Acoustic Sensing for Autonomous Vehicles</title>
    <updated>2026-04-10T11:52:56.822Z</updated>
  </entry>
  <entry>
    <author>
      <name>Francisco Romaldo Fernandes Mendes</name>
    </author>
    <category term="mathematics" scheme="https://franciscormendes.github.io/categories/mathematics/"/>
    <category term="mathematics" scheme="https://franciscormendes.github.io/tags/mathematics/"/>
    <category term="fourier-transform" scheme="https://franciscormendes.github.io/tags/fourier-transform/"/>
    <category term="physics" scheme="https://franciscormendes.github.io/tags/physics/"/>
    <category term="quantum-computing" scheme="https://franciscormendes.github.io/tags/quantum-computing/"/>
    <category term="algorithms" scheme="https://franciscormendes.github.io/tags/algorithms/"/>
    <content>
      <![CDATA[<h1 id="Introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h1><p>Sometimes it does seem like my blog is just increasingly complex applications of the Fourier Transform. In the previous post we applied the Fourier Transform to graphs, drawing connections between frequency (which is the usual Fourier transform) and properties of the graph.  There is yet another interesting, if abstract, application of the Fourier transform that is used in Quantum computers. Somewhat surprisingly, it is called the “Quantum Fourier Transform”. More specifically, we will study how the Fourier Transform appears as a unitary linear operator acting on quantum states. </p><p>At the end of the day this is all just linear algebra, requiring no knowledge of actual quantum physics. Because the Quantum Fourier Transform can be somewhat mathematically abstract and also because the Fourier Transform is so easily visualized as a decomposition into various sines and cosines, I thought of coming up with a similar visualization for the Quantum Fourier Transform case (spoiler: it involves clocks). </p><h1 id="Motivation"><a href="#Motivation" class="headerlink" title="Motivation"></a>Motivation</h1><p>Before discussing in detail what the QFT is mathematically, it is useful to recap what the Fourier transform is in general. The Fourier transform is a way of transforming information from one domain to another domain. Why? Because certain operations become simpler in the transformed domain. For example, in classical signal processing, convolution of a signal (the mathematical definition of filtering) in the time domain corresponds to simple multiplication in the frequency domain. </p><p>In the graph setting, we saw that potentially complex behaviors in the edge-node representation of the graph were far more mathematically tractable when looking at the “frequency” equivalent of the graph. Eigenvectors of the graph Laplacian isolate modes of variation: low-frequency components capture global structure, while high-frequency components capture local fluctuations.  </p><p>Similarly, for the Quantum Fourier Transform, we move from a bit representation of a number to a cyclical or phase representation. In the computational basis, information is stored as binary digits, essentially a sequence of ON&#x2F;OFF switches taking values in $\{0,1\}$.</p><p>In this form, the data is linear and rigid. Any underlying periodic structure is hidden inside the positional encoding. Phases, however, live on the circle and are inherently cyclical. If we want to detect periodicity or modular structure, it is more natural to encode information as rotations rather than switches.</p><p>The QFT therefore plays the same conceptual role as the classical Fourier transform: it changes coordinates to a representation in which the problem’s hidden structure becomes easier to manipulate. </p><p>I might do a post later on why this is true on so many different problems. But it is not true for some problems such as when you need convolution to learn a local filter. </p><h1 id="Useful-Intuition"><a href="#Useful-Intuition" class="headerlink" title="Useful Intuition"></a>Useful Intuition</h1><p>One of the reasons the Fourier transform in its simplest form is so<br>interesting is that it is so visual. In this blog post I will try to provide a nice visual explanation for<br>the QFT. Essentially we want to draw a connection between the binary<br>representation of a number and the cyclical nature of the QFT.<br>Fortunately, there is a nice visual representation for a binary<br>representation of a number on a computer, called a qubit. This<br>representation of a number is called a qubit.</p><h1 id="A-Useful-Visualization"><a href="#A-Useful-Visualization" class="headerlink" title="A Useful Visualization"></a>A Useful Visualization</h1><p><img src="/2026/02/28/quantum-fourier-transform/all_face_animation_2.gif" alt="4 Qubit QFT Animation"></p>]]>
    </content>
    <id>https://franciscormendes.github.io/2026/02/28/quantum-fourier-transform/</id>
    <link href="https://franciscormendes.github.io/2026/02/28/quantum-fourier-transform/"/>
    <published>2026-02-28T00:00:00.000Z</published>
    <summary>Visual guide to the Quantum Fourier Transform: from binary numbers and roots of unity to the QFT circuit, with comparisons to classical DFT and implications for Shor's algorithm.</summary>
    <title>From Bits to Clocks: A Visual Intuition for the Quantum Fourier Transform</title>
    <updated>2026-04-10T14:24:00.558Z</updated>
  </entry>
  <entry>
    <author>
      <name>Francisco Romaldo Fernandes Mendes</name>
    </author>
    <category term="opinion" scheme="https://franciscormendes.github.io/categories/opinion/"/>
    <category term="career" scheme="https://franciscormendes.github.io/tags/career/"/>
    <category term="artificial-intelligence" scheme="https://franciscormendes.github.io/tags/artificial-intelligence/"/>
    <content>
      <![CDATA[<h1 id="Introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h1><p>I was recently invited back to the MA department at UChicago for a career conference. Sitting there, listening and speaking, I found myself asking a rather uncomfortable question:</p><p><em>How much of what we value in education is pure signaling? Is this still true in the age of AI?</em></p><p>It is perhaps an opportune moment to recap the signaling model of education. In labour markets with asymmetric information, employers cannot directly observe ability. In Michael Spence’s signaling model, education does not necessarily increase productivity; instead, it separates high-ability individuals from others because it is less costly for them to acquire. In this paradigm, education serves as a “signal” of ability.</p><p>I think AI has changed this status quo because the cost of acquiring education has reduced to the point that there is no cost differential between high-ability and low-ability individuals for a large number of courses. To be more specific, the cost of sending a signal of education is reduced to the point of being indistinguishable between both groups. The cost of actually educating oneself is likely still lower for high-ability individuals, it’s just that sending this signal is easier. </p><p>This essay is intended to answer some of the questions that I recieved at the conference, some of which are outlined below,</p><ol><li>But what does “actually” educating oneself really mean? </li><li>What does it look like? Which classes should I take? </li><li>What should be the emphasis of my self-study? </li><li>How do I position myself best for the job market?</li></ol><h1 id="Beyond-The-Signal-So-What-Should-I-Study"><a href="#Beyond-The-Signal-So-What-Should-I-Study" class="headerlink" title="Beyond The Signal: So What Should I Study?"></a>Beyond The Signal: So What Should I Study?</h1><p>In the old (read: pre-AI) world where education was largely signaling, I think taking classes that superficially but with high probability signaled education, such as cloud skills, basic Python programming, and machine learning applications, were good enough. But in the new world, the cost of acquiring these skills is zero. Thus high-ability individuals need to seek out higher difficulty tasks that are relatively lower cost for them to acquire in order to send a strong signal. Mathematical maturity, comfort with abstraction, and disciplined reasoning are not signals in themselves; they are capabilities that affect what you can build, debug, or invent.</p><p>Thus class choices should reflect these core values:</p><ul><li><p>Mathematical courses that emphasize the core mathematics that make up machine learning, such as linear algebra and differential equations</p></li><li><p>Looking under the hood of machine learning, focusing on the mathematical fundamentals of machine learning</p></li><li><p>Social sciences courses that challenge your world view and force you to think about what the world <em>should</em> look like (more on this below)</p></li></ul><h1 id="Good-Intellectual-Health"><a href="#Good-Intellectual-Health" class="headerlink" title="Good Intellectual Health"></a>Good Intellectual Health</h1><p>More important than ever, and not specific to tech jobs but just life in general, is maintaining good intellectual health.</p><p>Reading books both in your field and outside of it is now more important than perhaps in the world before AI. Using AI increases one’s distance from one’s self. One’s ideas and one’s thoughts are now further than ever from one’s own experience. Reading books and writing reduces this distance. Since idea generation and critical thinking depend so heavily not only on the final output but also on the process by which one reaches it, exercising this muscle is now more important than ever.</p><p>Maintaining good intellectual health, however, is almost entirely self-policed. There are very few reliable ways to monitor how much AI shapes one’s own work. What usually starts as submitting homework in a rush can escalate to generating entire essays using AI, the slope is truly slippery. One cannot afford to replace the cognitive effort that builds depth, originality, and judgment. Only <em>you</em> can decide if the level of AI use hampers your intellectual health, and only you can feel its effects. </p><h1 id="Emphasizing-the-Social-Sciences"><a href="#Emphasizing-the-Social-Sciences" class="headerlink" title="Emphasizing the Social Sciences"></a>Emphasizing the Social Sciences</h1><p>The sciences are exceptionally good at helping us understand what the world is. As a result, advice about improving technical skills tends to be prescriptive and measurable. The social sciences operate differently. They help us think about what the world <em>should</em> look like. They force us to articulate assumptions about behaviour, incentives, norms, and institutions. The process of forming a view about what the world ought to be is central to intellectual health. It requires reflection, judgment, and an awareness of values, not just optimisation. Admittedly, this is difficult advice to give at a career conference for students focused purely on technical roles. The impact of studying sociology, psychology, or economics is harder to measure in a tech performance review. It doesn’t map cleanly onto a skills matrix. But it is no less important for that reason. The social sciences implicitly construct world models. Whether in sociology, psychology, or economics, they offer structured ways of thinking about how systems of people behave. That kind of world-building is essential for understanding where highly parameterised models, such as those produced in machine learning, actually live. Models do not operate in a vacuum; they operate within social and economic systems.</p><p>This becomes even clearer in business contexts. Firms operate with explicit views of what the world should look like, in terms of acquisition, churn, retention, revenue. Machine learning systems are deployed inside those normative visions. I admit there is something slightly distasteful about motivating the social sciences purely in terms of churn or revenue. It feels almost sacrilegious. But in practice, those incentives shape the environments in which technical systems are built. And if that were not the case, the audience at a career conference might be asking very different questions, comrade.</p><h1 id="TL-DR"><a href="#TL-DR" class="headerlink" title="TL;DR;"></a>TL;DR;</h1><p>The “sticker” value of UChicago’s education has held steady relative to other similar institutions. It might even have appreciated slightly. However, the absolute “sticker” value of education as a signal of ability in top schools (and indeed everywhere else) has gone down. Thus the onus is now on students to take courses that more appropriately signal their ability, not just in purely technical terms (such as mathematics, physics, machine learning) but also in critical thinking terms (such as expertise in the social sciences). The days of superficial knowledge that use <code>model.fit(X)</code> are over. </p><p>The UChicago brand will likely hold its value for years to come but it is not going to be enough. Even though the bar to have superficial knowledge is lowered thus muddying the difference between high and low skill individuals, the bar to have truly fundamental understanding of the sciences including (and perhaps especially) the social sciences is has never been higher. </p>]]>
    </content>
    <id>https://franciscormendes.github.io/2026/02/20/ssd-career-conference/</id>
    <link href="https://franciscormendes.github.io/2026/02/20/ssd-career-conference/"/>
    <published>2026-02-20T00:00:00.000Z</published>
    <summary>Spence's signaling model updated for AI: when the cost of educational signaling collapses to near-zero, what genuine intellectual skill looks like and how to build it.</summary>
    <title>Signaling, Skills, and Intellectual Health in the Age of AI: Thoughts from UChicago Career Conference 2026</title>
    <updated>2026-04-10T14:24:00.564Z</updated>
  </entry>
  <entry>
    <author>
      <name>Francisco Romaldo Fernandes Mendes</name>
    </author>
    <category term="book-review" scheme="https://franciscormendes.github.io/categories/book-review/"/>
    <category term="book-review" scheme="https://franciscormendes.github.io/tags/book-review/"/>
    <category term="fiction" scheme="https://franciscormendes.github.io/tags/fiction/"/>
    <content>
      <![CDATA[<h1 id="Introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h1><p>I did not spend my twenties reading Murakami, when it was all the vogue. Now, having read three works of his, I feel an upswell of opinions on his work and writing. We will explore some of the themes of Murakami as well as the cultural symbol that he has become. He was the kind of writer you are almost supposed to like as a young man.</p><p>Murakami seemed like the sort of writer you are supposed to like, especially in your twenties. Sadly, my twenties flew by rather quickly without so much as a glance at a Murakami novel. And there were several — part of Murakami’s appeal is how prolific he is across a variety of genres. Now in my, arguably still early, thirties I have read three novels of his: <em>Kafka on the Shore</em>, <em>First Person Singular</em>, and <em>The Wind-Up Bird Chronicle</em>. While my views on Murakami remain lukewarm at best, his writing certainly inspires deeper engagement with broader themes in society.</p><h1 id="Writing"><a href="#Writing" class="headerlink" title="Writing"></a>Writing</h1><p>The English literary tradition has always been deeply rooted in the beauty of language; it is almost as if the words carrying the story must match the beauty of the story itself. The result can be complex, layered prose that oftentimes outlasts the literary work itself. Very often from the opening lines themselves, the classics sought to set the stage with beautiful prose.</p><p><em>“Call me Ishmael…”</em>, <em>“It was the best of times, it was the worst of times…”</em></p><p>Compare this with Murakami, whose writing proceeds forth incessantly in its banality. The words easily slide off the page as if narrated by a friend over the telephone. The words do not linger; they hurry off the page carrying their message with great efficacy. He does not, however, use this efficiency to drive more of the plot forward, choosing instead to match the banality of his prose with descriptions of the banalities of the human condition — eating, sleeping, and listening to music. It seems as if Murakami rejects the aestheticism of both the prose and the story. One cannot imagine Dickens devoting a paragraph to what the main character ate for breakfast.</p><p>One should not leave with the impression that the resulting writing is uninspired or insipid. On the contrary, the effect of his writing is a highly atmospheric narrative style that attenuates his trademark surrealistic elements. The banalities serve to obscure or highlight the passage of time, a critical element of his surrealistic themes. The reader is drawn into a different world, and very often drawn into a different supernatural world within that world.</p><p>A long-standing critique of English literature prior to Murakami was that it was almost inaccessible to people learning English for the first time. In my eyes this was largely a consequence of English speakers dominating English writing, whereas Murakami does not speak English as his first language. Nothing exemplifies this more than the fact that Murakami came upon his extraordinarily simple writing style by simply translating his English prose to Japanese and then back, thus losing all but its most essential elements. Literary essentialism, some (this author) would call it.</p><p><img src="/gallery/thumbnails/murakami-shrine.jpg" alt="Kawase Hasui — Snow at Nezu Gongen Shrine (1933)"><br><em>Nothing is happening here. The shrine stands. The snow falls. And yet — this is precisely the kind of scene Murakami would spend three pages on, and you would read every word of it. The atmosphere is the point; the banality is the vehicle. This is the closest image I can find to what it actually feels like to read him.</em></p><h1 id="Eastern-Storytelling"><a href="#Eastern-Storytelling" class="headerlink" title="Eastern Storytelling"></a>Eastern Storytelling</h1><p>There is a tension between Eastern and Western storytelling, and this tension is apparent even in the differences in children’s stories. In Grimm’s fairy tales, for example, we have a clearly defined protagonist who must weather the odds, defeat the antagonist, and eventually prevails. In Eastern storytelling the beauty of the story is much more important than what the story means. Consider <em>The Crane Wife</em>, a well-known Japanese children’s story. A crane transforms into a beautiful woman; this beautiful woman proposes to a poor fisherman. The fisherman agrees, but the woman imposes one condition: he can never look at her when she is weeping. One day the fisherman looks at her while she weeps; he sees that she is a white crane. He leaves her. The story ends, rather abruptly. This ending is rather distressing, especially to Western audiences. Why does the story end? The ending is so sad — how <em>can</em> it end yet? What does this all mean? Beauty, I suppose, is the key to this difference. This is a beautiful story and the sadness is beautiful.</p><p><img src="/gallery/thumbnails/murakami-moonlight.jpg" alt="Kawase Hasui — Tsuki no Matsushima (1919)"><br><em>The moon reflects on the water. The islands sit in the dark. No story. No explanation. No moral. And it does not matter — the image is enough. This is what Eastern aesthetic beauty looks like when it works. Murakami is reaching for something like this. I am not always sure he grasps it.</em></p><p>I have the same visceral reaction to Murakami’s stories. I find myself asking at the end of every book:</p><p><em>But what does this all mean?</em></p><p>While I recognize that this cultural difference is at the heart of why people react negatively to Murakami’s writing, I find it hard to reconcile with the fact that Murakami’s writing forces you to do one of two things.</p><p>The first is to take the story literally. This involves taking every supernatural act, every bizarre event as literal and believing it. This is not hard — we do this to some degree with all works of fiction, from Tolkien to Kafka. We are (I am) willing to suspend disbelief. However, the stories take themselves seriously. In <em>The Metamorphosis</em>, while we are never offered an explanation for why Samsa is a monstrous insect, the reactions to him and his reactions to himself treat his metamorphosis as real. The story takes itself seriously and reconciles the apparent inexplicability of the metamorphosis as given. This is not the effect that Murakami’s writing has on me. His writing weakly evokes bizarre situations such as the insect; however, there are a great many such situations. The immoderation in the supernatural and the bizarre requires a much higher degree of suspension of disbelief, which makes it much harder for the reactions of other characters to be believable. It reminds me of the famous Christopher Nolan quote:</p><blockquote><p><em>“It does not matter how believable the story is to you; the story must be believable to itself and its characters.”</em></p></blockquote><p>It is this inviolable rule that is broken multiple times.</p><p>The second is to take the story as some kind of metaphor. Again, Kafka’s writing has this effect as well — we can think of the insect-like transformation of Gregor Samsa as a kind of moral corruption, stagnation, or emasculation. However, because Murakami uses characters, bizarre events, and other supernatural motifs so liberally, it is difficult for the metaphor to retain any coherent narrative structure, let alone a consistent representation of something else.</p><p>In both cases, it seems as if Murakami is willing to sacrifice coherence and linguistic beauty for some kind of narrative aesthetic. To me this sacrifice was not worth it, since there are far too many characters and motifs that seem to exist solely to move the plot along. Far too many characters are sacrificed on this imagined altar of aesthetic beauty. My objection does not arise out of a sense of wellbeing for these characters, but rather that they seem rather superficial — which leads naturally to my next criticism.</p><h1 id="Superficiality"><a href="#Superficiality" class="headerlink" title="Superficiality"></a>Superficiality</h1><p>The main characters in Murakami’s books can be disappointingly without agency. They can seem as if they are carried away by the wave of the narrative. This matches Murakami’s style in his own words: he creates the characters first and then places them in a story. Almost like a simulation — this makes the storytelling easy.</p><p>Again, this could be the difference between Eastern and Western protagonists. I do not agree with this, however. I think Murakami’s characters are quite American in a modern way. The protagonist is like the main character in a pop culture film — hidden away, not a part of society. But then society needs him, or something happens to him, and he must act in the midst of it. In some strange way this superficiality matches the aesthetic of Murakami’s writing. In some ways, I consider Murakami to be a modern American author, as much as Paul Auster. To Murakami’s credit, I suspect this imitation might not be entirely unintentional. This imitation evokes the adoption of Western individualism by Japanese society — fairly thin, and without the corresponding import of Christian ethics. Murakami laments the lack of family connections in Japanese society.</p><p>Similarly, supporting characters exist only as reflections of the main character. In all the books that I read, I was not able to identify one single character that had anything remotely resembling a personality. Murakami writes a superficial main character and every other character exists to reflect that character back to himself. Bizarrely, Murakami’s novels feel two-dimensional — you are drawn into an atmospheric but ultimately flat world. Some things feel real, but the lack of dimension is apparent. It has to be said that this is appealing to some; others describe this as “dreamy”, “vague”, and “beautifully foggy”. It is likely that this flaw uniquely penetrates my intellectual armor more so than others.</p><p>I have many issues with the way women are written in Murakami’s novels. I will leave it at that.</p><h1 id="Japanese-Psyche"><a href="#Japanese-Psyche" class="headerlink" title="Japanese Psyche"></a>Japanese Psyche</h1><p>It is somewhat contradictory that Murakami is surprisingly modern, and almost comes across as an American writer in some sense. Yet the questions his books raise about Japanese identity — individualism imported wholesale from the West, the erosion of family and community — are distinctly Japanese concerns, and they are the more interesting for it.</p><h1 id="Conclusion"><a href="#Conclusion" class="headerlink" title="Conclusion"></a>Conclusion</h1><p>I find myself, having now read three of his novels, in the rather uncomfortable position of a reluctant critic. Murakami is undeniably significant. He has done more for the global reach of Japanese literature than perhaps any other living author, and his ability to inhabit the borderlands between the real and the supernatural is a genuine literary achievement. His cultural impact is not nothing, as the young person in every bookshop clutching a copy of <em>Norwegian Wood</em> will attest.</p><p>But the books themselves leave me cold — not in a sterile sense. They are atmospheric, readable, and at times deeply evocative. I always emerge from them, however, without the feeling of having had a meaningful encounter with another human mind. The characters drift, the plots dissolve, and one is left with that same persistent question.</p><p><em>But what does this all mean?</em></p><p>I suspect that for his devoted readers, the answer is in the question itself. The asking is the point. The fog is the destination. I remain unconvinced, but I respect the fog.</p><p><img src="/gallery/thumbnails/murakami-mist.jpg" alt="Yoshida Hiroshi — A Misty Day in Nikko (1936)"><br><em>Murakami’s world looks something like this — solid enough to walk through, obscured enough to never quite see the edges of. The fog does not owe you an explanation. I have made my peace with this, though not enough to enjoy it.</em></p>]]>
    </content>
    <id>https://franciscormendes.github.io/2026/01/15/on-murakami/</id>
    <link href="https://franciscormendes.github.io/2026/01/15/on-murakami/"/>
    <published>2026-01-15T00:00:00.000Z</published>
    <summary>Having now read three of his works — Kafka on the Shore, First Person Singular, and The Wind-Up Bird Chronicle — some lukewarm opinions on Murakami.</summary>
    <title>On Murakami</title>
    <updated>2026-04-10T12:14:28.067Z</updated>
  </entry>
  <entry>
    <author>
      <name>Francisco Romaldo Fernandes Mendes</name>
    </author>
    <category term="mathematics" scheme="https://franciscormendes.github.io/categories/mathematics/"/>
    <category term="mathematics" scheme="https://franciscormendes.github.io/tags/mathematics/"/>
    <category term="fractals" scheme="https://franciscormendes.github.io/tags/fractals/"/>
    <content>
      <![CDATA[<h1 id="Introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h1><p>If you’ve ever come across the coastline paradox, you’ve probably seen the classic (and somewhat overused) image of the coastline of Britain. Recently, a friend asked me a question that felt like the 3D analogue of this paradox: What is the surface area of a city? More specifically, does a very hilly city have more surface area than a relatively flat one?</p><p>The answer, as it turns out, is more complicated than it first appears. My initial instinct was to treat this as the 3D version of the coastline paradox, and that idea sent me down a rabbit hole—one whose key insights form the basis of this blog post.<br><strong>Complete follow along notebook can be found <a href="https://github.com/FranciscoRMendes/coastline-paradox-3d">here</a>.</strong></p><p>Here’s how the post is structured:</p><ol><li><p>Visualizing the 2D coastline paradox using the Koch curve, a well-known fractal curve.</p></li><li><p>Extending this to the 3D case by visualizing the surface area paradox with a fractal terrain.</p></li><li><p>Applying these ideas to real-world GIS data to verify the paradox in practice.</p></li><li><p>Exploring the concept of dimension.</p></li></ol><p>Point 4 turned out to be particularly enlightening. In researching this post, I realized that the way we commonly think about “dimension”—1D, 2D, 3D—is not mathematically rigorous. The coastline paradox and its 3D surface area counterpart only exist because our intuitive notion of dimension is incomplete. In fact, dimensions can be fractional, and by using the results from sections 1, 2, and 3, we can actually measure them and gain a deeper understanding of the geometry underlying these paradoxes.</p><p><img src="/2025/12/16/3d-coastline-paradox/greatbritainislandcoastlineparadox-gb.webp" alt="Coastline Paradox of Great Britain"></p><h1 id="2D-Coastline-Paradox"><a href="#2D-Coastline-Paradox" class="headerlink" title="2D Coastline Paradox"></a>2D Coastline Paradox</h1><p>The figure above illustrates the coastline paradox using a Koch curve, a classic fractal curve. As the ruler size decreases, the measured length of the curve increases dramatically, highlighting that the “true” length of a jagged, self-similar shape is not well-defined. In the top plot, we visualise the Koch curve after six iterations, showing its intricate zig-zag pattern. The bottom plot demonstrates the paradox quantitatively: on a log–log scale, smaller ruler sizes (on the right) capture finer details, resulting in a rapidly increasing measured length. This simple experiment illustrates why fractal curves require a scale-invariant descriptor—the Minkowski or box-counting dimension—to characterise their complexity, rather than relying on a single length measurement.</p><p><img src="/2025/12/16/3d-coastline-paradox/koch-curve.png" alt="Koch Curve, a simulated &quot;coastline&quot; that is known to be fractal"><br><img src="/2025/12/16/3d-coastline-paradox/koch-curve-growth.png" alt="Measured length versus ruler size for the Koch curve"></p><p>The figures above illustrate the coastline paradox using a Koch curve, a classic fractal curve. As the ruler size decreases, the measured length of the curve increases dramatically, highlighting that the “true” length of a jagged, self-similar shape is not well-defined. In the top plot, we visualise the Koch curve after six iterations, showing its intricate zig-zag pattern. The bottom plot demonstrates the paradox quantitatively: on a log–log scale, smaller ruler sizes (on the right) capture finer details, resulting in a rapidly increasing measured length. This simple experiment illustrates why fractal curves require a scale-invariant descriptor—the Minkowski or box-counting dimension—to characterise their complexity, rather than relying on a single length measurement.</p><h2 id="Mathematical-Proof"><a href="#Mathematical-Proof" class="headerlink" title="Mathematical Proof"></a>Mathematical Proof</h2><p>Consider a jagged curve (e.g., a coastline) in 2D, and let $L(\varepsilon)$ denote the measured length using a ruler of size $\varepsilon$.</p><ol><li>Divide the curve into segments of length $\varepsilon$. Let $N(\varepsilon)$ be the number of segments required to cover the curve:</li></ol>$$L(\varepsilon) \approx N(\varepsilon) \cdot \varepsilon$$<ol start="2"><li>Assume the curve is fractal with Minkowski–Bouligand dimension $D$, so the number of boxes needed to cover the curve scales as:</li></ol>$$N(\varepsilon) \sim \varepsilon^{-D}$$<ol start="3"><li>Substitute the scaling relation into the length formula:</li></ol>$$L(\varepsilon) \sim \varepsilon \cdot \varepsilon^{-D} = \varepsilon^{1-D}$$<ol start="4"><li>Interpretation:</li></ol><ul><li>If the curve is smooth: $D = 1$, then $L(\varepsilon) \sim \varepsilon^{0} = \text{constant}$.</li><li>If the curve is fractal: $D > 1$, then $L(\varepsilon) \to \infty$ as $\varepsilon \to 0$.</li></ul><p>This demonstrates the paradox: the measured length depends on the ruler size, and only the fractal dimension $D$ provides a scale-invariant measure of the curve’s complexity.</p><ol start="5"><li>Recovering the fractal dimension from data:</li></ol>$$D = 1 - \frac{d \log L(\varepsilon)}{d \log \varepsilon}$$<ul><li>On a log–log plot of $L(\varepsilon)$ vs $\varepsilon$, the slope is $1-D$.</li><li>This allows us to characterise the roughness of the curve quantitatively.</li></ul><h2 id="3D-Coastline-Paradox"><a href="#3D-Coastline-Paradox" class="headerlink" title="3D Coastline Paradox"></a>3D Coastline Paradox</h2><p>The figure below demonstrates the geographical area paradox, the 3D analogue of the coastline paradox. Here, we measure the surface area of a fractal terrain generated using the diamond-square algorithm. As the size of the measurement “ruler” (square grid) decreases, the measured surface area increases, revealing more of the fine-scale roughness of the terrain. Just as the length of a fractal curve diverges with smaller ruler sizes, the area of a fractal surface grows without bound. This shows that for rough surfaces, the conventional notion of area is ill-defined at very small scales. Instead, the fractal dimension of the surface provides a single, scale-invariant number that quantifies the complexity of the terrain.</p><p><img src="/2025/12/16/3d-coastline-paradox/fractal-3d.png" alt="Simulated 3D fractal surface"><br><img src="/2025/12/16/3d-coastline-paradox/3d-surface-area.png" alt="Surface Area growth vs Square dimension"></p><h2 id="Mathematical-Formulation-of-the-3D-Surface-Paradox"><a href="#Mathematical-Formulation-of-the-3D-Surface-Paradox" class="headerlink" title="Mathematical Formulation of the 3D Surface Paradox"></a>Mathematical Formulation of the 3D Surface Paradox</h2><p>Consider a 3D surface $z = f(x,y)$ defined over a 2D domain. Let $A(\varepsilon)$ denote the measured surface area using a square ruler of side $\varepsilon$.</p><ol><li>Divide the plane into a grid of squares of side (\varepsilon). Let $N(\varepsilon)$ be the number of squares required to cover the surface (or, equivalently, the number of boxes intersecting the surface in 3D):</li></ol>$$A(\varepsilon) \approx N(\varepsilon) \cdot \varepsilon^2$$<ol start="2"><li>Assume the surface is fractal with Minkowski–Bouligand dimension $D$ (with $2 < D < 3$):</li></ol>$$N(\varepsilon) \sim \varepsilon^{-D}$$<ol start="3"><li>Substitute into the area formula:</li></ol>$$A(\varepsilon) \sim \varepsilon^2 \cdot \varepsilon^{-D} = \varepsilon^{2-D}$$<ol start="4"><li>Interpretation:</li></ol><ul><li>If the surface is smooth: $D = 2$, then $A(\varepsilon) \sim \varepsilon^0 = \text{constant}$.</li><li>If the surface is fractal: $D > 2$, then $A(\varepsilon) \to \infty$ as $\varepsilon \to 0$.</li></ul><ol start="5"><li>Recovering the fractal dimension from data:</li></ol>$$D = 2 - \frac{d \log A(\varepsilon)}{d \log \varepsilon}$$<ul><li>On a log–log plot of $A(\varepsilon)$ vs $\varepsilon$, the slope is $2$D.</li><li>This provides a scale-invariant measure of the surface’s roughness  analogous to the 2D case but in two dimensions.</li></ul><h1 id="Telegraph-Hill"><a href="#Telegraph-Hill" class="headerlink" title="Telegraph Hill"></a>Telegraph Hill</h1><p>Up to this point, we have illustrated the coastline (or geographical area) paradox using a simulated fractal surface. While this is useful for building intuition, it is ultimately a controlled toy example. In this section, we replace the synthetic terrain with real elevation data from Telegraph Hill in San Francisco. Extracting and preparing this data turned out to be an ordeal in its own right—one that probably deserves a dedicated blog post. There is something uniquely satisfying about working with GIS data: every raster, projection, and coordinate transform is a walking demonstration of linear algebra in the wild. But I digress. With the elevation data in hand, we can now repeat the same multi-scale measurement exercise and observe the coastline paradox emerge not from a mathematical construction, but from an actual piece of geography.</p><p><img src="/2025/12/16/3d-coastline-paradox/telegraph-hill-box.png" alt="The 3D surface will be generated for a bounding box(shown here) containing Telegraph Hill"></p><p><img src="/2025/12/16/3d-coastline-paradox/telegraph-hill-dem-coit-tower.png" alt="3D surface for Telegraph Hill"></p><p>To illustrate the coastline paradox in a real geographical setting, we estimate the surface area of Telegraph Hill using progressively smaller “rulers.” In the code above, the terrain is measured with square rulers of 256, 128, 64, and 32 meters, and the total surface area is recomputed at each scale. As the ruler size decreases, the measured area systematically increases. This is not because the hill is physically changing, but because finer rulers capture more of the terrain’s small-scale roughness—minor ridges, gullies, and local slope variations that are invisible at coarser resolutions. The resulting curve demonstrates the geographical area paradox: for a rough, fractal-like surface, area is not a single well-defined number, but a scale-dependent quantity. What remains invariant across scales is not the measured area itself, but the rate at which it grows as the ruler size shrinks—an idea formalised by the surface’s fractal dimension.</p><p><img src="/2025/12/16/3d-coastline-paradox/telegraph-hill-coastline-paradox.png" alt="Coastline Paradox for Telegraph Hill"></p><h2 id="Fractional-Dimensions"><a href="#Fractional-Dimensions" class="headerlink" title="Fractional Dimensions"></a>Fractional Dimensions</h2><p>So far, we have seen how measured length or surface area <strong>depends on the ruler size</strong>: smaller rulers reveal more detail, producing larger measured values. The key insight of fractal geometry is that this scale-dependence can be quantified by a <strong>fractional, scale-invariant dimension</strong>, also called the Minkowski–Bouligand dimension.</p><h3 id="2D-Case-Koch-Curve"><a href="#2D-Case-Koch-Curve" class="headerlink" title="2D Case: Koch Curve"></a>2D Case: Koch Curve</h3><p>For a fractal curve, the measured length (L(\varepsilon)) scales with ruler size (\varepsilon) as:</p>$$L(\varepsilon) \sim \varepsilon^{1-D_1} = 1.1$$<p>where $D_1$ is the fractal dimension of the curve. By plotting $\log L(\varepsilon)$ versus $\log \varepsilon$, the slope of the line gives $1-D_1$, from which we can solve for $D_1$. For the Koch curve, this yields $D_1 \approx 1.1$ (theoretically this is $1.26$), reflecting that the curve is “rougher than a line” but does not fill a plane.</p><p><img src="/2025/12/16/3d-coastline-paradox/1D-dim-est.png" alt="Fitting a line to estimate dimension in the 2D case"></p><h3 id="3D-Case-Simulated-Fractal-Surface"><a href="#3D-Case-Simulated-Fractal-Surface" class="headerlink" title="3D Case: Simulated Fractal Surface"></a>3D Case: Simulated Fractal Surface</h3><p>For a fractal surface, the measured area $A(\varepsilon)$ scales with ruler size $\varepsilon$ as:</p>$$A(\varepsilon) \sim \varepsilon^{2-D_2} = 2.00002$$<p>where (D_2) is the surface’s fractal dimension (with $2 < D_2 < 3$). A log–log plot of $A(\varepsilon)$ versus $\varepsilon$ gives a slope of $2-D_2$, allowing us to solve for $D_2$. In practice, simulated terrains often have $D_2 \approx 2.3{-}2.5$, meaning the surface is rougher than a plane but still does not fill 3D space.</p><p><img src="/2025/12/16/3d-coastline-paradox/dim-est-3D.png" alt="Fitting a line to estimate dimension in the 3D case"></p><h3 id="Real-World-Case-Telegraph-Hill"><a href="#Real-World-Case-Telegraph-Hill" class="headerlink" title="Real-World Case: Telegraph Hill"></a>Real-World Case: Telegraph Hill</h3><p>Finally, we can apply the same method to <strong>elevation data from Telegraph Hill</strong>. Using square rulers of decreasing size, we measure the terrain’s surface area at each scale. A log–log plot of measured area versus ruler size produces a slope that corresponds to $2-D_{TH}$.</p>$$D_{TH} = 2 - \frac{d \log A(\varepsilon)}{d \log \varepsilon} = 2.00084$$<p>The resulting fractional dimension (D_{TH}) captures the <strong>true roughness of the hill</strong>, providing a quantitative, scale-invariant measure of the terrain’s complexity. Just like with the Koch curve or the simulated fractal surface, the hill exhibits a dimension that is <strong>between its topological dimension (2) and the embedding dimension (3)</strong>, revealing the fractal nature of real-world landscapes.</p><p><img src="/2025/12/16/3d-coastline-paradox/dim-est-th.png" alt="Fitting a line to estimate dimension in the Telegraph Hill case (3D)"></p><h1 id="The-Fractal-Boundary-of-Trainability"><a href="#The-Fractal-Boundary-of-Trainability" class="headerlink" title="The Fractal Boundary of Trainability"></a>The Fractal Boundary of Trainability</h1><p>The most interesting region of hyperparameter space is not where training clearly succeeds or clearly fails, but the boundary between the two. This is where learning rates are just stable enough, regularisation is just sufficient, and optimisation teeters on the edge of divergence.</p><p><img src="/2025/12/16/3d-coastline-paradox/nn_fractal.png" alt="The boundary of neural network trainability is fractal"></p><p>When we zoom into this boundary between convergent (blue) and divergent (red) training regimes, something remarkable happens: structure appears at every scale. Regions that look smooth at coarse resolution reveal increasingly intricate patterns as we zoom in. No matter how closely we examine it, the boundary never simplifies.</p><p>In this sense, the boundary of neural network trainability behaves like a fractal. Just as with coastlines or rough surfaces, the distinction between “trainable” and “untrainable” depends on the scale at which we probe it — a reminder that even optimisation lives in a world of fractional geometry.</p><h1 id="Scale-dependent-kinematics-spacetime-extension"><a href="#Scale-dependent-kinematics-spacetime-extension" class="headerlink" title="Scale dependent kinematics: spacetime extension"></a>Scale dependent kinematics: spacetime extension</h1><p>One intriguing extension is to imagine motion along a fractal path, where the effective distance depends on scale. If $L(\varepsilon) \sim \varepsilon^{1-D}$ is the measured length at scale $\varepsilon$, then a “scale-dependent velocity” $v(\varepsilon)$ could be written as:</p>$$v(\varepsilon) = \frac{dL(\varepsilon)}{dt} \sim \frac{\varepsilon^{1-D}}{dt}$$<p>For a particle moving in a fractal spacetime geometry, this hints at scale-dependent kinematics, where the observed velocity changes with the measurement resolution, connecting fractal dimension $D$ with the local structure of spacetime.</p><h1 id="Conclusions-and-Final-Thoughts"><a href="#Conclusions-and-Final-Thoughts" class="headerlink" title="Conclusions and Final Thoughts"></a>Conclusions and Final Thoughts</h1><p>Through this exploration, we have seen how the coastline paradox extends naturally from 2D curves to 3D surfaces, and how it manifests in real-world terrain like Telegraph Hill. Starting with the Koch curve, we visualized the fundamental idea that measured length depends on the scale of measurement. Extending this to 3D, we saw that the surface area of a rough, fractal-like terrain increases as the measurement resolution becomes finer—a phenomenon we’ve called the geographical area paradox.</p><p>Applying the same principles to actual GIS data confirmed that this is not just a theoretical curiosity: hilly cities truly do have “more surface” at finer scales, and the apparent area depends on how finely it is measured.</p><p>Finally, this journey highlighted the importance of fractional dimensions. Traditional notions of dimension—1D, 2D, 3D—are insufficient to capture the complexity of fractal structures. By calculating Minkowski–Bouligand dimensions from 1D curves, 2D surfaces, and real-world elevation data, we gained a quantitative, scale-invariant measure of roughness.</p><p>In the end, the coastline paradox is more than a curiosity: it offers a window into the hidden complexity of the world, from jagged coastlines to hilly terrain, and pushes us to rethink the conventional notion of integer dimensions. Indeed, questioning our intuition about dimensions may be essential for a deeper understanding of concepts like velocity, especially when the underlying physical paths we traverse may be inherently fractal.</p><h1 id="References"><a href="#References" class="headerlink" title="References"></a>References</h1><ul><li><p><a href="https://paulbourke.net/fractals/fracdim/">An absolutely ancient reference that uses UNIX to compute various box-counting algorithms but also has a nice theoretical background to fractal dimensions. </a></p></li><li><p><a href="https://pi.math.cornell.edu/~erin/docs/dimension.pdf">Real Analysis and Measure Theoretic Approach to dimension theory </a></p></li><li><p><a href="https://sohl-dickstein.github.io/2024/02/12/fractal.html">Jascha Sohl-Dickstein’s Blog on Fractal Boundaries</a></p></li><li><p><a href="https://arxiv.org/pdf/2402.06184">Original Paper by Sohl-Dickstein</a></p></li><li><p><a href="https://www.infinitelymore.xyz/p/the-infinite-coastline-paradox">The infinite coastline paradox</a></p></li></ul>]]>
    </content>
    <id>https://franciscormendes.github.io/2025/12/16/3d-coastline-paradox/</id>
    <link href="https://franciscormendes.github.io/2025/12/16/3d-coastline-paradox/"/>
    <published>2025-12-16T00:00:00.000Z</published>
    <summary>Does a hilly city have more surface area than a flat one? The 3D coastline paradox explored via fractal dimension, Hausdorff measure, and a Python notebook applied to Telegraph Hill.</summary>
    <title>Telegraph Hill and the Coastline Paradox: Measuring a City in Fractional Dimensions</title>
    <updated>2026-04-10T14:24:00.535Z</updated>
  </entry>
  <entry>
    <author>
      <name>Francisco Romaldo Fernandes Mendes</name>
    </author>
    <category term="machine-learning" scheme="https://franciscormendes.github.io/categories/machine-learning/"/>
    <category term="signal-processing" scheme="https://franciscormendes.github.io/tags/signal-processing/"/>
    <category term="machine-learning" scheme="https://franciscormendes.github.io/tags/machine-learning/"/>
    <category term="fourier-transform" scheme="https://franciscormendes.github.io/tags/fourier-transform/"/>
    <category term="convolutional-neural-networks" scheme="https://franciscormendes.github.io/tags/convolutional-neural-networks/"/>
    <category term="low-rank-approximation" scheme="https://franciscormendes.github.io/tags/low-rank-approximation/"/>
    <content>
      <![CDATA[<h1 id="Introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h1><p>Convolution sits at the heart of modern machine learning—especially convolutional neural networks (CNNs)—yet the underlying mathematics is often hidden behind highly optimised implementations in PyTorch, TensorFlow, and other frameworks. As a result, many of the properties that make convolution such a powerful building block for deep learning become obscured, particularly when we try to reason about model behaviour or debug a failing architecture.</p><p>If you know the convolution theorem, a natural question arises:</p><p><em>Why don’t CNNs simply compute a Fourier transform of the input and kernel, multiply them in the frequency domain, and invert the result? Wouldn’t that be simpler and faster?</em></p><p>This blog post addresses exactly that question. We will see that:</p><ol><li><p><strong>FFT-based convolution is not local.</strong><br>In the Fourier domain every coefficient depends on every input pixel. This destroys the locality structure that CNNs rely on to learn hierarchical, spatially meaningful features. As a result, it breaks the very inductive bias that makes CNNs effective.</p></li><li><p><strong>FFT-based convolution is not computationally cheaper in neural networks.</strong><br>Although FFTs are asymptotically efficient, they must be recomputed on every forward and backward pass—and the cost of repeatedly transforming inputs, kernels, and gradients outweighs any benefit from spectral multiplication.</p></li></ol><p>By the end of this post, we’ll have a clear, explicit comparison—both in matrix form and via backpropagation—showing why CNNs deliberately perform convolution in the spatial domain. Any practioner of signal processing should also be interested in knowing when the “locality” property is useful and when it is not!</p><h1 id="1-D-Convolution"><a href="#1-D-Convolution" class="headerlink" title="1-D Convolution"></a>1-D Convolution</h1><p>Let us start with the most basic form of convolution, the 1D convolution. In this case you have a filter (which is nothing but a sequence of numbers) that you want to multiply with your signal in order to produce another signal which is hopefully more interesting to you. For example, in your headphones, you want to multiply a set of numbers with the music signal such that the resulting signal is more music than the wailing baby 1 row behind you. </p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> numpy <span class="keyword">as</span> np</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">conv1d_direct</span>(<span class="params">x, h</span>):</span><br><span class="line">    nx, nh = <span class="built_in">len</span>(x), <span class="built_in">len</span>(h)</span><br><span class="line">    y = np.zeros(nx+nh-<span class="number">1</span>)</span><br><span class="line">    <span class="keyword">for</span> n <span class="keyword">in</span> <span class="built_in">range</span>(<span class="built_in">len</span>(y)):</span><br><span class="line">        <span class="keyword">for</span> m <span class="keyword">in</span> <span class="built_in">range</span>(nx):</span><br><span class="line">            k = n - m</span><br><span class="line">            <span class="keyword">if</span> <span class="number">0</span> &lt;= k &lt; nh:</span><br><span class="line">                y[n] += x[m] * h[k]</span><br><span class="line">    <span class="keyword">return</span> y</span><br><span class="line"></span><br><span class="line">x = np.array([<span class="number">1.</span>,<span class="number">2.</span>,<span class="number">0.</span>,-<span class="number">1.</span>]) <span class="comment"># this is the signal of music + baby wailing</span></span><br><span class="line">h = np.array([<span class="number">0.5</span>,<span class="number">1.</span>,<span class="number">0.5</span>]) <span class="comment"># this is a filter that when multiplied with x makes it more music</span></span><br><span class="line">conv1d_direct(x,h)</span><br></pre></td></tr></table></figure><h2 id="Convolution-Theorem"><a href="#Convolution-Theorem" class="headerlink" title="Convolution Theorem"></a>Convolution Theorem</h2><p>This brings us to the convolution theorem wherein we can prove that the process of convolution i.e. multiplying window-wise h and x is mathematically equivalent to a simple multiplication between the fft of h and the fft of x. </p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">conv_via_fft</span>(<span class="params">x,h</span>):</span><br><span class="line">    N = <span class="built_in">len</span>(x)+<span class="built_in">len</span>(h)-<span class="number">1</span></span><br><span class="line">    X = np.fft.rfft(x,n=N)</span><br><span class="line">    H = np.fft.rfft(h,n=N)</span><br><span class="line">    <span class="keyword">return</span> np.fft.irfft(X*H,n=N)</span><br><span class="line"></span><br><span class="line">np.<span class="built_in">max</span>(np.<span class="built_in">abs</span>(conv1d_direct(x,h) - conv_via_fft(x,h)))</span><br><span class="line"><span class="built_in">print</span>(conv1d_direct(x,h))</span><br><span class="line"><span class="built_in">print</span>(conv_via_fft(x,h))</span><br></pre></td></tr></table></figure><h1 id="2-D-Convolution"><a href="#2-D-Convolution" class="headerlink" title="2-D Convolution"></a>2-D Convolution</h1><p>Just like before before we will convolve a 2D filter with a 2D signal in the spatial domain. We will then, try to do it using the FFT. We will verify that the convolution theorem does indeed work in the 2D space as well. </p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">conv2d_direct</span>(<span class="params">img, ker</span>):</span><br><span class="line">    ih, iw = img.shape</span><br><span class="line">    kh, kw = ker.shape</span><br><span class="line">    out = np.zeros((ih+kh-<span class="number">1</span>, iw+kw-<span class="number">1</span>))</span><br><span class="line">    <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(out.shape[<span class="number">0</span>]):</span><br><span class="line">        <span class="keyword">for</span> j <span class="keyword">in</span> <span class="built_in">range</span>(out.shape[<span class="number">1</span>]):</span><br><span class="line">            <span class="keyword">for</span> m <span class="keyword">in</span> <span class="built_in">range</span>(ih):</span><br><span class="line">                <span class="keyword">for</span> n <span class="keyword">in</span> <span class="built_in">range</span>(iw):</span><br><span class="line">                    km, kn = i-m, j-n</span><br><span class="line">                    <span class="keyword">if</span> <span class="number">0</span> &lt;= km &lt; kh <span class="keyword">and</span> <span class="number">0</span> &lt;= kn &lt; kw:</span><br><span class="line">                        out[i,j] += img[m,n] * ker[km,kn]</span><br><span class="line">    <span class="keyword">return</span> out</span><br><span class="line"></span><br><span class="line">img = np.array([[<span class="number">0</span>,<span class="number">0</span>,<span class="number">0</span>,<span class="number">0</span>],[<span class="number">0</span>,<span class="number">1</span>,<span class="number">2</span>,<span class="number">0</span>],[<span class="number">0</span>,<span class="number">3</span>,<span class="number">4</span>,<span class="number">0</span>],[<span class="number">0</span>,<span class="number">0</span>,<span class="number">0</span>,<span class="number">0</span>]])</span><br><span class="line">ker = np.array([[<span class="number">1</span>,<span class="number">2</span>,<span class="number">1</span>],[<span class="number">2</span>,<span class="number">4</span>,<span class="number">2</span>],[<span class="number">1</span>,<span class="number">2</span>,<span class="number">1</span>]])/<span class="number">16</span></span><br><span class="line">conv2d_direct(img,ker)</span><br></pre></td></tr></table></figure><h2 id="Convolution-Theorem-2D"><a href="#Convolution-Theorem-2D" class="headerlink" title="Convolution Theorem 2D"></a>Convolution Theorem 2D</h2><p>In a similar way to the 1D case instead of windowing and multiplying, we can take the fft of the signal and the kernel and simply multiply. </p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">conv2d_fft</span>(<span class="params">img,ker</span>):</span><br><span class="line">    H,W = img.shape</span><br><span class="line">    Kh,Kw = ker.shape</span><br><span class="line">    OH,OW = H+Kh-<span class="number">1</span>, W+Kw-<span class="number">1</span></span><br><span class="line">    IMG = np.fft.rfft2(img, s=(OH,OW))</span><br><span class="line">    KER = np.fft.rfft2(ker, s=(OH,OW))</span><br><span class="line">    <span class="keyword">return</span> np.fft.irfft2(IMG*KER, s=(OH,OW))</span><br><span class="line"></span><br><span class="line">out_d = conv2d_direct(img,ker)</span><br><span class="line">out_f = conv2d_fft(img,ker)</span><br><span class="line">np.<span class="built_in">max</span>(np.<span class="built_in">abs</span>(out_d - out_f))</span><br></pre></td></tr></table></figure><h2 id="So-why-do-NNs-not-use-the-FFT"><a href="#So-why-do-NNs-not-use-the-FFT" class="headerlink" title="So why do NNs not use the FFT?"></a>So why do NNs not use the FFT?</h2><p>In a neural network, convolution is used to generate feature maps that feed into the next layer. At first glance, the convolution theorem suggests a tempting shortcut: instead of sliding a kernel spatially, we could transform both the image and kernel into the frequency domain, multiply them element-wise, and transform the result back. The output would be mathematically equivalent—so why not do this inside CNNs?</p><p>It turns out there are two fundamental reasons:</p><ol><li><p><strong>Neural networks care about more than just the output—they care about how the output is produced.</strong><br>During backpropagation, each filter weight is updated using gradients derived from local spatial features. This locality enables CNNs to learn hierarchies of edges, textures, shapes, and patterns.<br>In the Fourier domain, however, gradients flow through global Fourier coefficients. Every frequency component depends on every pixel, so the update for a single weight depends on the entire image. This destroys the spatial locality that CNNs rely on and eliminates the inductive bias that makes them effective.</p></li><li><p><strong>The FFT is not “simpler” computationally for neural networks.</strong><br>While FFTs are efficient in isolation, a CNN would need to repeatedly compute forward FFTs, spectral multiplications, and inverse FFTs—not just for the forward pass, but also for backpropagation.<br>When you count actual multiplications and transforms, the FFT approach is often more expensive, especially for small kernels (e.g., 3×3, 5×5), which dominate modern architectures.</p></li></ol><p><strong>In short:</strong> CNNs avoid the Fourier domain because it removes locality and adds computational overhead—both of which undermine the very reasons convolution works so well in deep learning.</p><h1 id="2D-Spatial-Convolution-as-a-Matrix-Multiply"><a href="#2D-Spatial-Convolution-as-a-Matrix-Multiply" class="headerlink" title="2D Spatial Convolution as a Matrix Multiply"></a>2D Spatial Convolution as a Matrix Multiply</h1><p>For our next trick we will show the exact way in which your hardware actually computes convolutions. Spoiler: it will be some kind of matrix multiplication. This is quite different from the way convolution is taught in the classroom where you usually <em>convolve</em> with a patch of pixels in the spatial domain and <em>roll</em> the kernel onto the next patch nearby. In reality, this whole process is just represented as one huge matrix multiply. It is very important to think about convolution in this way, as it makes approaching complex questions easier. Since looping over pixels is not a coherent mathematical approach whose complexity is easy to compute. Once it is expressed as a matrix multiply between to matrices we can directly use a formula to compute complexity. More importantly, GPUs work fast precisely because they can parallelize this matrix multiply (as opposed to parallizing various kinds of for-loop structures).</p><p>In this section, $X$ denotes the input image. It’s worth noting that most deep-learning libraries treat the 2D and 1D cases in essentially the same way: the very first step is to reshape the image into a long vector, commonly written as $\mathrm{vec}(X)$. This operation—often implemented as <code>im2col</code> in the source code—unrolls local patches of the image so that convolution can be expressed as a matrix–vector multiplication. </p>$$X =\begin{bmatrix}x_{11} & x_{12} & x_{13} & x_{14} \\x_{21} & x_{22} & x_{23} & x_{24} \\x_{31} & x_{32} & x_{33} & x_{34} \\x_{41} & x_{42} & x_{43} & x_{44}\end{bmatrix},\quad\mathrm{vec}(X) =\begin{bmatrix}x_{11} \\ x_{12} \\ x_{13} \\ x_{14} \\x_{21} \\ x_{22} \\ x_{23} \\ x_{24} \\x_{31} \\ x_{32} \\ x_{33} \\ x_{34} \\x_{41} \\ x_{42} \\ x_{43} \\ x_{44}\end{bmatrix}.$$<p>Let the $3\times 3$ kernel we are interested in convolving be:</p>$$W =\begin{bmatrix}w_{11} & w_{12} & w_{13} \\w_{21} & w_{22} & w_{23} \\w_{31} & w_{32} & w_{33}\end{bmatrix}.$$<p>The valid convolution output (size $2\times 2$) is (again <code>im2col</code> outputs a long vector that can be then transformed to an image on the other end):</p>$$\mathrm{vec}(Y)=\begin{bmatrix}y_{11} \\ y_{12} \\ y_{21} \\ y_{22} \\\end{bmatrix}.$$<p>We can express the convolution as a matrix multiply:</p>$$\mathrm{vec}(Y) = T(W)\ \mathrm{vec}(X),$$<p>where $T(W)$ is the Block-Toeplitz with Toeplitz Blocks (BTTB) matrix. </p>$$T(W) =\begin{bmatrix}\color{blue}{w_{11}} & \color{blue}{w_{12}} & \color{blue}{w_{13}} & 0& \color{blue}{w_{21}} & \color{blue}{w_{22}} & \color{blue}{w_{23}} & 0& \color{blue}{w_{31}} & \color{blue}{w_{32}} & \color{blue}{w_{33}} & 0& 0 & 0 & 0 & 0 \\[2mm]%0 & \color{blue}{w_{11}} & \color{blue}{w_{12}} & \color{blue}{w_{13}}& 0 & \color{blue}{w_{21}} & \color{blue}{w_{22}} & \color{blue}{w_{23}}& 0 & \color{blue}{w_{31}} & \color{blue}{w_{32}} & \color{blue}{w_{33}}& 0 & 0 & 0 & 0 \\[2mm]%0 & 0 & 0 & 0 & \color{blue}{w_{11}} & \color{blue}{w_{12}} & \color{blue}{w_{13}} & 0& \color{blue}{w_{21}} & \color{blue}{w_{22}} & \color{blue}{w_{23}} & 0& \color{blue}{w_{31}} & \color{blue}{w_{32}} & \color{blue}{w_{33}} & 0 \\[2mm]%0 & 0 & 0 & 0 & 0 & \color{blue}{w_{11}} & \color{blue}{w_{12}} & \color{blue}{w_{13}}& 0 & \color{blue}{w_{21}} & \color{blue}{w_{22}} & \color{blue}{w_{23}}& 0 & \color{blue}{w_{31}} & \color{blue}{w_{32}} & \color{blue}{w_{33}}\end{bmatrix}.$$<p>Expanded, the output entries are:</p>$$y_{11} =w_{11} x_{11} + w_{12} x_{12} + w_{13} x_{13} + w_{21} x_{21} + w_{22} x_{22} + w_{23} x_{23} + w_{31} x_{31} + w_{32} x_{32} + w_{33}x_{33}$$$$y_{12} =w_{11} x_{12} + w_{12} x_{13} + w_{13} x_{14} + w_{21} x_{22} + w_{22} x_{23} + w_{23} x_{24} + w_{31} x_{32} + w_{32} x_{33} + w_{33}x_{34}$$$$y_{21} =w_{11} x_{21} + w_{12} x_{22} + w_{13} x_{23} + w_{21} x_{31} + w_{22} x_{32} + w_{23} x_{33} + w_{31} x_{41} + w_{32} x_{42} + w_{33} x_{43}$$$$y_{22} =w_{11} x_{22} + w_{12} x_{23} + w_{13} x_{24} + w_{21} x_{32} + w_{22} x_{33} + w_{23} x_{34} + w_{31} x_{42} + w_{32} x_{43} + w_{33} x_{44}$$<h2 id="Loss-Backpropagation-in-Convolution"><a href="#Loss-Backpropagation-in-Convolution" class="headerlink" title="Loss Backpropagation in Convolution"></a>Loss Backpropagation in Convolution</h2><h3 id="1D-Convolution-Example"><a href="#1D-Convolution-Example" class="headerlink" title="1D Convolution Example"></a><strong>1D Convolution Example</strong></h3><p>Let the 1D convolution be:</p>$$y = T(w) x$$<p>where:</p><ul><li>($x \in \mathbb{R}^6$) is the input</li><li>($w \in \mathbb{R}^3$) is the kernel</li><li>($y \in \mathbb{R}^4$) is the output (valid convolution)</li></ul><p>Assume a scalar loss ($L(y)$).</p><h4 id="Step-1-Gradient-w-r-t-Output"><a href="#Step-1-Gradient-w-r-t-Output" class="headerlink" title="Step 1: Gradient w.r.t Output"></a>Step 1: Gradient w.r.t Output</h4>$$\frac{\partial L}{\partial y} =\begin{bmatrix}\frac{\partial L}{\partial y_1} \\\frac{\partial L}{\partial y_2} \\\frac{\partial L}{\partial y_3} \\\frac{\partial L}{\partial y_4}\end{bmatrix}.$$<h4 id="Step-2-Gradient-w-r-t-Kernel"><a href="#Step-2-Gradient-w-r-t-Kernel" class="headerlink" title="Step 2: Gradient w.r.t Kernel"></a>Step 2: Gradient w.r.t Kernel</h4><p>Construct the <strong>input Toeplitz matrix</strong>:</p>$$T_x =\begin{bmatrix}x_1 & x_2 & x_3 \\x_2 & x_3 & x_4 \\x_3 & x_4 & x_5 \\x_4 & x_5 & x_6\end{bmatrix}.$$<p>Then the gradient w.r.t the kernel is:</p>$$\frac{\partial L}{\partial w} = T_x^\top \frac{\partial L}{\partial y} =\begin{bmatrix}x_1 & x_2 & x_3 & x_4 \\x_2 & x_3 & x_4 & x_5 \\x_3 & x_4 & x_5 & x_6 \\\end{bmatrix}\begin{bmatrix}\frac{\partial L}{\partial y_1} \\\frac{\partial L}{\partial y_2} \\\frac{\partial L}{\partial y_3} \\\frac{\partial L}{\partial y_4}\end{bmatrix}.$$<p><strong>Observation:</strong> Each kernel weight sees <strong>only the local patches of the input it touches</strong>, preserving locality.</p><h4 id="Step-3-Gradient-w-r-t-Input"><a href="#Step-3-Gradient-w-r-t-Input" class="headerlink" title="Step 3: Gradient w.r.t Input"></a>Step 3: Gradient w.r.t Input</h4>$$\frac{\partial L}{\partial x} = T(w)^\top \frac{\partial L}{\partial y}.$$<p>Again, <strong>each input element only receives gradient from the outputs it contributed to</strong>.</p><h3 id="2D-Convolution-Example"><a href="#2D-Convolution-Example" class="headerlink" title="2D Convolution Example"></a><strong>2D Convolution Example</strong></h3><p>Only for completeness, it should be clear that 1D and 2D is handled the same way using <code>im2col</code></p><p>For 2D BTTB convolution:</p>$$\mathrm{vec}(Y) = T(W) \mathrm{vec}(X),$$<p>with scalar loss ($L(Y)$):</p><ul><li>Gradient w.r.t kernel:</li></ul>$$\frac{\partial L}{\partial W} = T_X^\top \frac{\partial L}{\partial \mathrm{vec}(Y)}$$<ul><li>Gradient w.r.t input:</li></ul>$$\frac{\partial L}{\partial \mathrm{vec}(X)} = T(W)^\top \frac{\partial L}{\partial \mathrm{vec}(Y)}$$<h4 id="Observation"><a href="#Observation" class="headerlink" title="Observation"></a><strong>Observation</strong></h4><ul><li>Each kernel weight is influenced <strong>only by the input pixels in the patch it was applied to</strong></li><li>Each input pixel receives gradients <strong>only from outputs it contributed to</strong></li><li>This is why CNNs learn <strong>localized features</strong> efficiently.</li></ul><h1 id="2D-Fourier-Transform-Convolution-as-Matrix-Multiplies"><a href="#2D-Fourier-Transform-Convolution-as-Matrix-Multiplies" class="headerlink" title="2D Fourier Transform Convolution as Matrix Multiplies"></a>2D Fourier Transform Convolution as Matrix Multiplies</h1><p>Similar to the spatial convolution case we will represent the Fourier transform as a sequence of matrix multiplies. The recipe is as follows, </p><ol><li>Fourier Transform of Kernel</li><li>Fourier Transform of 2D Image</li><li>Elementwise Multiply in the Frequency Domain</li><li>Inverse Fourier Transform</li></ol><p>These matrices can get quite huge, but I thought we need to see them explicitly to make understanding them a bit easier. </p><p>We assume:</p>$$X =\begin{bmatrix}x_{11} & x_{12} & x_{13} & x_{14}\\x_{21} & x_{22} & x_{23} & x_{24}\\x_{31} & x_{32} & x_{33} & x_{34}\\x_{41} & x_{42} & x_{43} & x_{44}\\\end{bmatrix},\qquadW =\begin{bmatrix}w_{11} & w_{12} & w_{13}\\w_{21} & w_{22} & w_{23}\\w_{31} & w_{32} & w_{33}\\\end{bmatrix}$$<p>Flatten row-major:</p>$$\mathrm{vec}(X)=\begin{bmatrix}x_{11}\\x_{12}\\x_{13}\\x_{14}\\x_{21}\\x_{22}\\x_{23}\\x_{24}\\x_{31}\\x_{32}\\x_{33}\\x_{34}\\x_{41}\\x_{42}\\x_{43}\\x_{44}\\\end{bmatrix},\qquad\mathrm{vec}(W)=\begin{bmatrix}w_{11}\\w_{12}\\w_{13}\\w_{21}\\w_{22}\\w_{23}\\w_{31}\\w_{32}\\w_{33}\\\end{bmatrix}.$$<p>The 2D DFT matrix for a 4×4 image (flattened row-major) is:</p>$$F_{k,n} = e^{-2\pi i \cdot kn/16},\qquad k,n = 0,\dots,15.$$$$F=\begin{bmatrix}1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\1 & c_{1} - is_{1} & c_{2} - is_{2} & c_{3} - is_{3} & c_{4} - is_{4} & c_{5} - is_{5} & c_{6} - is_{6} & c_{7} - is_{7} & -1 & c_{9} - is_{9} & c_{10} - is_{10} & c_{11} - is_{11} & c_{12} - is_{12} & c_{13} - is_{13} & c_{14} - is_{14} & c_{15} - is_{15} \\1 & c_{2} - is_{2} & c_{4} - is_{4} & c_{6} - is_{6} & -1 & c_{10} - is_{10} & c_{12} - is_{12} & c_{14} - is_{14} & 1 & c_{2} - is_{2} & c_{4} - is_{4} & c_{6} - is_{6} & -1 & c_{10} - is_{10} & c_{12} - is_{12} & c_{14} - is_{14} \\1 & c_{3} - is_{3} & c_{6} - is_{6} & c_{9} - is_{9} & c_{12} - is_{12} & c_{15} - is_{15} & c_{18} - is_{18} & c_{21} - is_{21} & -1 & c_{27} - is_{27} & c_{30} - is_{30} & c_{33} - is_{33} & c_{36} - is_{36} & c_{39} - is_{39} & c_{42} - is_{42} & c_{45} - is_{45} \\1 & c_{4} - is_{4} & -1 & c_{12} - is_{12} & 1 & c_{20} - is_{20} & -1 & c_{28} - is_{28} & 1 & c_{36} - is_{36} & -1 & c_{44} - is_{44} & 1 & c_{52} - is_{52} & -1 & c_{60} - is_{60} \\1 & c_{5} - is_{5} & c_{10} - is_{10} & c_{15} - is_{15} & c_{20} - is_{20} & c_{25} - is_{25} & c_{30} - is_{30} & c_{35} - is_{35} & -1 & c_{45} - is_{45} & c_{50} - is_{50} & c_{55} - is_{55} & c_{60} - is_{60} & c_{65} - is_{65} & c_{70} - is_{70} & c_{75} - is_{75} \\1 & c_{6} - is_{6} & c_{12} - is_{12} & c_{18} - is_{18} & -1 & c_{30} - is_{30} & c_{36} - is_{36} & c_{42} - is_{42} & 1 & c_{54} - is_{54} & c_{60} - is_{60} & c_{66} - is_{66} & -1 & c_{78} - is_{78} & c_{84} - is_{84} & c_{90} - is_{90} \\1 & c_{7} - is_{7} & c_{14} - is_{14} & c_{21} - is_{21} & c_{28} - is_{28} & c_{35} - is_{35} & c_{42} - is_{42} & c_{49} - is_{49} & -1 & c_{63} - is_{63} & c_{70} - is_{70} & c_{77} - is_{77} & c_{84} - is_{84} & c_{91} - is_{91} & c_{98} - is_{98} & c_{105} - is_{105} \\1 & -1 & 1 & -1 & 1 & -1 & 1 & -1 & 1 & -1 & 1 & -1 & 1 & -1 & 1 & -1 \\1 & c_{9} - is_{9} & c_{18} - is_{18} & c_{27} - is_{27} & c_{36} - is_{36} & c_{45} - is_{45} & c_{54} - is_{54} & c_{63} - is_{63} & -1 & c_{81} - is_{81} & c_{90} - is_{90} & c_{99} - is_{99} & c_{108} - is_{108} & c_{117} - is_{117} & c_{126} - is_{126} & c_{135} - is_{135} \\1 & c_{10} - is_{10} & c_{20} - is_{20} & c_{30} - is_{30} & 1 & c_{50} - is_{50} & c_{60} - is_{60} & c_{70} - is_{70} & 1 & c_{90} - is_{90} & c_{100} - is_{100} & c_{110} - is_{110} & 1 & c_{130} - is_{130} & c_{140} - is_{140} & c_{150} - is_{150} \\1 & c_{11} - is_{11} & c_{22} - is_{22} & c_{33} - is_{33} & c_{44} - is_{44} & c_{55} - is_{55} & c_{66} - is_{66} & c_{77} - is_{77} & -1 & c_{99} - is_{99} & c_{110} - is_{110} & c_{121} - is_{121} & c_{132} - is_{132} & c_{143} - is_{143} & c_{154} - is_{154} & c_{165} - is_{165} \\1 & c_{12} - is_{12} & -1 & c_{36} - is_{36} & 1 & c_{60} - is_{60} & -1 & c_{84} - is_{84} & 1 & c_{108} - is_{108} & -1 & c_{132} - is_{132} & 1 & c_{156} - is_{156} & -1 & c_{180} - is_{180} \\1 & c_{13} - is_{13} & c_{26} - is_{26} & c_{39} - is_{39} & c_{52} - is_{52} & c_{65} - is_{65} & c_{78} - is_{78} & c_{91} - is_{91} & -1 & c_{117} - is_{117} & c_{130} - is_{130} & c_{143} - is_{143} & c_{156} - is_{156} & c_{169} - is_{169} & c_{182} - is_{182} & c_{195} - is_{195} \\1 & c_{14} - is_{14} & c_{28} - is_{28} & c_{42} - is_{42} & -1 & c_{70} - is_{70} & c_{84} - is_{84} & c_{98} - is_{98} & 1 & c_{126} - is_{126} & c_{140} - is_{140} & c_{154} - is_{154} & -1 & c_{182} - is_{182} & c_{196} - is_{196} & c_{210} - is_{210} \\1 & c_{15} - is_{15} & c_{30} - is_{30} & c_{45} - is_{45} & c_{60} - is_{60} & c_{75} - is_{75} & c_{90} - is_{90} & c_{105} - is_{105} & -1 & c_{135} - is_{135} & c_{150} - is_{150} & c_{165} - is_{165} & c_{180} - is_{180} & c_{195} - is_{195} & c_{210} - is_{210} & c_{225} - is_{225}\\\end{bmatrix}$$<p>Where</p>$$c_n = \cos\left(\frac{2\pi n}{16}\right), \qquad s_n = \sin\left(\frac{2\pi n}{16}\right).$$<h1 id="1-Fourier-Transform-of-the-Kernel"><a href="#1-Fourier-Transform-of-the-Kernel" class="headerlink" title="1. Fourier Transform of the Kernel**"></a>1. Fourier Transform of the Kernel**</h1>$$\hat{W} = F  \mathrm{vec}(W_{padded})$$<p>where $W_{padded}$ is the 3×3 kernel zero-padded to 4×4. Explicitly:</p>$$\mathrm{vec}(W_{padded}) =\begin{bmatrix}w_{11}\\w_{12}\\w_{13}\\0\\w_{21}\\w_{22}\\w_{23}\\0\\w_{31}\\w_{32}\\w_{33}\\0\\0\\0\\0\\0\\\end{bmatrix}.$$<p>Then:</p>$$\hat{W} = F \mathrm{vec}(W_{padded}).$$<p>Take the first row, </p>$$\hat{W}_1 = w_{11} + w_{12} + w_{13} + w_{21} + w_{22} + w_{23} + w_{31} + w_{32} + w_{33}$$<h1 id="2-Fourier-Transform-of-the-Image"><a href="#2-Fourier-Transform-of-the-Image" class="headerlink" title="2. Fourier Transform of the Image"></a>2. Fourier Transform of the Image</h1>$$\hat{X} = F \mathrm{vec}(X)$$<p>Take the first row, </p>$$\hat{X}_1 = x_{11} + x_{12} + x_{13} + x_{14} + x_{21} + x_{22} + x_{23} + x_{24} + x_{31} + x_{32} + x_{33} + x_{34} + x_{41} + x_{42} + x_{43} + x_{44}$$<h1 id="3-Multiply-Elementwise-in-Frequency-Space"><a href="#3-Multiply-Elementwise-in-Frequency-Space" class="headerlink" title="3. Multiply (Elementwise) in Frequency Space"></a>3. Multiply (Elementwise) in Frequency Space</h1><p>Define the frequency-domain product:</p>$$\hat{Y} = \hat{W} \odot \hat{X}$$<p>Written explicitly:</p>$$\hat{Y}=\begin{bmatrix}\hat{W}_1 \hat{X}_1 \\\hat{W}_2 \hat{X}_2 \\\vdots \\\hat{W}_{16} \hat{X}_{16}\end{bmatrix}$$<!-- or equivalently as a matrix multiplication:$$\hat{Y} =\mathrm{diag}(\hat{W})\hat{X}$$ --><!-- with$$\mathrm{diag}(\hat{W}) =\begin{bmatrix}\hat{W}_1 & 0 & \cdots & 0 \\0 & \hat{W}*2 & \cdots & 0 \\\vdots & \vdots & \ddots & \vdots \\0 & 0 & \cdots & \hat{W}*\{16} \\\end{bmatrix}.$$ --><!-- Note: this diagonal matrix is **dense globally** w.r.t. the kernel values even though diagonal in Fourier space. --><h1 id="4-Inverse-Fourier-Transform"><a href="#4-Inverse-Fourier-Transform" class="headerlink" title="4. Inverse Fourier Transform"></a>4. Inverse Fourier Transform</h1><p>To return to spatial domain:</p>$$\mathrm{vec}(Y) = F^{-1} \hat{Y} = \frac{1}{16} F \hat{Y}$$<p>Explicitly:</p>$$\mathrm{vec}(Y)= \frac{1}{16}F\begin{bmatrix}\hat{W}_1 \hat{X}_1 \\\hat{W}_2 \hat{X}_2 \\\hat{W}_3 \hat{X}_3 \\\vdots \\\hat{W}_{16} \hat{X}_{16}\end{bmatrix}.$$<p>Thus the first row of the output looks like (the subscript is 11 because it will eventually be recast to an image), </p>$$y_{11} = \frac{1}{16} \left(\hat{W}_1 \hat{X}_1 + \hat{W}_2 \hat{X}_2 + \hat{W}_3 \hat{X}_3 + \cdots + \hat{W}_{16} \hat{X}_{16}\right)$$<p>We will try to focus on that first term on the RHS, $\hat{W}_1$, $\hat{X}_1$,</p>$$\hat{W}_1\hat{X}_1 = (w_{11} + w_{12} + w_{13} + w_{21} + w_{22} + w_{23} + w_{31} + w_{32} + w_{33}) \times (x_{11} + x_{12} + x_{13} + x_{14} + x_{21} + x_{22} + x_{23} + x_{24} + x_{31} + x_{32} + x_{33} + x_{34} + x_{41} + x_{42} + x_{43} + x_{44})$$$$y_{11} = \frac{1}{16} (w_{11} + w_{12} + w_{13} +\dots + w_{33}) \times (x_{11} + x_{12} + x_{13} +\dots + x_{42} + x_{43} + \textcolor{red}{x_{44}})$$<p>Compare this to $y_{11}$ from the spatial case, notice that the term $\textcolor{red}{x_{44}}$ is missing in the below expression, </p>$$y_{11} = w_{11} x_{11} + w_{12} x_{12} + w_{13} x_{13} + w_{21}x_{21} + w_{22} x_{22} + w_{23} x_{23}+ w_{31} x_{31} + w_{32} x_{32} + w_{33} x_{33}$$<p>Eventually these two values will be numerically the same! We know this from the convolution theorem. In the next section we will see that the contributing values matter to the gradient back propagation and that is where the two approaches will differ. </p><h1 id="Gradient-Comparison"><a href="#Gradient-Comparison" class="headerlink" title="Gradient Comparison"></a>Gradient Comparison</h1><h2 id="FFT-Gradient"><a href="#FFT-Gradient" class="headerlink" title="FFT Gradient"></a>FFT Gradient</h2>$$\frac{\partial y_{11}}{\partial w_{11}} = \frac{1}{16} \left( x_{11} + x_{12} + x_{13} + \dots + x_{44} \right)$$<p>Notice that every input pixel contributes to the gradient of $w_{11}$.</p><p>Similarly for other weights, EVERY pixel contributes to the gradient. </p>$$\frac{\partial y_{11}}{\partial w_{ij}} = \frac{1}{16} \left( x_{11} + x_{12} + \dots + x_{44} \right), \quad \forall w_{ij}$$<h2 id="Gradient-in-the-Spatial-Convolution-Case"><a href="#Gradient-in-the-Spatial-Convolution-Case" class="headerlink" title="Gradient in the Spatial Convolution Case"></a>Gradient in the Spatial Convolution Case</h2><p>Notice that each update depends only on the pixel patch that it touches! </p>$$\frac{\partial L}{\partial w_{ij}} = \frac{\partial L}{\partial y_{11}} \cdot \frac{\partial y_{11}}{\partial w_{ij}} = \frac{\partial L}{\partial y_{11}} \cdot \frac{1}{16} \sum_{m=1}^{4} \sum_{n=1}^{4} x_{mn}$$$$\frac{\partial y_{11}}{\partial w_{11}} = x_{11}, \quad\frac{\partial y_{11}}{\partial w_{12}} = x_{12}, \quad\frac{\partial y_{11}}{\partial w_{13}} = x_{13},$$$$\frac{\partial y_{11}}{\partial w_{21}} = x_{21}, \quad\frac{\partial y_{11}}{\partial w_{22}} = x_{22}, \quad\frac{\partial y_{11}}{\partial w_{23}} = x_{23},$$$$\frac{\partial y_{11}}{\partial w_{31}} = x_{31}, \quad\frac{\partial y_{11}}{\partial w_{32}} = x_{32}, \quad\frac{\partial y_{11}}{\partial w_{33}} = x_{33}.$$<p>Gradient update for scalar loss L</p>$$\frac{\partial L}{\partial w_{ij}} = \frac{\partial L}{\partial y_{11}} \cdot \frac{\partial y_{11}}{\partial w_{ij}}.$$<h1 id="Computational-Comparison"><a href="#Computational-Comparison" class="headerlink" title="Computational Comparison"></a>Computational Comparison</h1><h2 id="Spatial-Convolution"><a href="#Spatial-Convolution" class="headerlink" title="Spatial Convolution"></a>Spatial Convolution</h2><p>Suppose:</p><ul><li>Input image: $X$ of size $N \times N$</li><li>Kernel: $W$ of size $K \times K$</li><li>Output: $Y$ of size $(N-K+1) \times (N-K+1)$</li></ul><h3 id="Number-of-multiplications"><a href="#Number-of-multiplications" class="headerlink" title="Number of multiplications"></a>Number of multiplications</h3><p>Each output pixel requires $K^2$ multiplications:</p>$$\text{Total multiplications} = (N-K+1)^2 \cdot K^2 \approx N^2 K^2 \quad \text{for } N \gg K$$<ul><li>Linear in <strong>number of pixels</strong> and <strong>kernel size</strong>.</li><li>Memory access is <strong>local</strong>, cache-friendly.</li></ul><h2 id="FFT-based-Convolution"><a href="#FFT-based-Convolution" class="headerlink" title="FFT-based Convolution"></a>FFT-based Convolution</h2><p>Forward pass:</p><ol><li>Zero-pad kernel to size $N \times N$</li><li>Compute 2D FFT of input and kernel: $O(N^2 \log N)$ each</li><li>Elementwise multiplication in Fourier domain: $O(N^2)$</li><li>Inverse FFT: $O(N^2 \log N)$</li></ol><h3 id="Total-computational-cost"><a href="#Total-computational-cost" class="headerlink" title="Total computational cost"></a>Total computational cost</h3>$$\text{FFT convolution} \approx 2 \cdot O(N^2 \log N) + O(N^2) \sim O(N^2 \log N)$$<ul><li>For small kernels $K \ll N$  $K^2 \ll \log N$, so:</li></ul>$$N^2 K^2 \ll N^2 \log N$$<ul><li><strong>Spatial convolution is cheaper</strong> for small kernels, which is why CNNs prefer it.</li><li>FFT becomes advantageous only for <strong>very large kernels</strong> or very large images.</li></ul><h3 id="TL-DR"><a href="#TL-DR" class="headerlink" title="TL;DR"></a>TL;DR</h3><ol><li>Spatial convolution is efficient for small kernels and preserves <em>locality</em> which is crucial for CNNs to learn hierarchies.</li><li>FFT convolution has global interactions, destroys the local inductive bias, and is only computationally advantageous for very large kernels.</li></ol><h1 id="Conclusion"><a href="#Conclusion" class="headerlink" title="Conclusion"></a>Conclusion</h1><p>We have seen that spatial convolution is not only computationally more efficient but also better suited to capturing the hierarchical structure inherent in most images. For instance, a face detection algorithm may rely on local patterns such as the triangle formed by the eyes and the nose. A kernel that focuses specifically on this local arrangement is highly effective because it preserves locality.</p><p>Conversely, in domains like recommendation systems, where data may be represented as a sparse matrix of product–user interactions, capturing global patterns can be more important. Here, the “local” interactions often correspond to users with strong connections, whereas broader, global patterns reveal trends across the entire system. In such contexts, FFT-based approaches—or methods that leverage global connectivity, like graph convolutional networks—can be more appropriate.</p><p>This contrast explains why spatial CNNs excel in image-based tasks, while GCNs or FFT-based methods are more suitable for graphs representing global interactions, such as those between users and products.</p><h1 id="References-Further-Reading"><a href="#References-Further-Reading" class="headerlink" title="References &amp; Further Reading"></a>References &amp; Further Reading</h1><ul><li><p><a href="https://www.youtube.com/watch?v=eMXuk97NeSI">Spatial Convoluttions visualized</a></p></li><li><p><strong>“A Beginner’s Guide to Convolutions” (Colah’s Blog)</strong> – A visual, intuitive introduction to convolution and receptive fields.<br><a href="https://colah.github.io/posts/2014-07-Understanding-Convolutions/">https://colah.github.io/posts/2014-07-Understanding-Convolutions/</a></p></li><li><p><strong>“The Fast Fourier Transform (FFT): Most Ingenious Algorithm Ever?” (3Blue1Brown video)</strong> – A beautiful geometric explanation of the FFT.<br><a href="https://www.youtube.com/watch?v=h7apO7q16V0&utm_source=chatgpt.com">https://www.youtube.com/watch?v=h7apO7q16V0&amp;utm_source&#x3D;chatgpt.com</a></p></li><li><p><strong>“Convolutional Neural Networks for Visual Recognition” (Stanford CS231n)</strong> – Gold-standard material on spatial convolution.<br><a href="https://cs231n.github.io/convolutional-networks/">https://cs231n.github.io/convolutional-networks/</a></p></li></ul><h3 id="Visualization-Signal-Processing"><a href="#Visualization-Signal-Processing" class="headerlink" title="Visualization &amp; Signal Processing"></a>Visualization &amp; Signal Processing</h3><ul><li><p><strong>Khan Academy – Fourier Series &amp; Fourier Transform</strong> – Visual and interactive explanations of frequency-domain thinking.<br><a href="https://www.khanacademy.org/math/differential-equations/fourier-series">https://www.khanacademy.org/math/differential-equations/fourier-series</a></p></li><li><p><strong>DSP Guide (Free Online Book)</strong> – Clear, practical engineering-focused intuition on convolution and transforms.<br><a href="https://www.dspguide.com/">https://www.dspguide.com/</a></p></li></ul><h3 id="Implementing-FFT-based-Convolution"><a href="#Implementing-FFT-based-Convolution" class="headerlink" title="Implementing FFT-based Convolution"></a>Implementing FFT-based Convolution</h3><ul><li><p><strong>PyTorch FFT Tutorial</strong> – How PyTorch performs FFT-based convolution behind the scenes.<br><a href="https://pytorch.org/docs/stable/fft.html">https://pytorch.org/docs/stable/fft.html</a></p></li><li><p><strong>SciPy signal.fftconvolve</strong> – Practical tool frequently used for 2D FFT convolution.<br><a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.fftconvolve.html">https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.fftconvolve.html</a></p></li></ul><h3 id="Graph-Neural-Networks-Spectral-Methods"><a href="#Graph-Neural-Networks-Spectral-Methods" class="headerlink" title="Graph Neural Networks &amp; Spectral Methods"></a>Graph Neural Networks &amp; Spectral Methods</h3><ul><li><p><strong>“A Friendly Introduction to Graph Neural Networks” (Stanford)</strong> – Excellent intuition about GCNs and why they differ from CNNs.<br><a href="https://web.stanford.edu/class/cs224w/">https://web.stanford.edu/class/cs224w/</a></p></li><li><p><strong>“Spectral Graph Convolution Explained” (Medium)</strong> – Gentle intro to graph Laplacians and filtering.<br><a href="https://medium.com/towards-data-science/spectral-graph-convolution-explained-6dddb6c1c2b0">https://medium.com/towards-data-science/spectral-graph-convolution-explained-6dddb6c1c2b0</a></p></li></ul><h3 id="Practical-Engineering-Notes"><a href="#Practical-Engineering-Notes" class="headerlink" title="Practical Engineering Notes"></a>Practical Engineering Notes</h3><ul><li><p><strong>“Why FFT Convolution is Faster” (StackOverflow discussion)</strong> – Short, practical engineering explanation.<br><a href="https://stackoverflow.com/questions/12665249/why-is-fft-convolution-faster">https://stackoverflow.com/questions/12665249/why-is-fft-convolution-faster</a></p></li><li><p><strong>“im2col and GEMM: How CNNs Are Really Implemented” (DeepLearning.ai forums)</strong> – Helps connect the maths to real-world kernels.<br><a href="https://community.deeplearning.ai/t/how-im2col-really-works/27659">https://community.deeplearning.ai/t/how-im2col-really-works/27659</a></p></li></ul>]]>
    </content>
    <id>https://franciscormendes.github.io/2025/12/06/convolution/</id>
    <link href="https://franciscormendes.github.io/2025/12/06/convolution/"/>
    <published>2025-12-06T00:00:00.000Z</published>
    <summary>A unified treatment of 1D signal convolution, 2D image convolution via the convolution theorem, and graph convolution as a spectral operation on the normalized Laplacian.</summary>
    <title>Locality, Learning, and the FFT: Why CNNs Avoid the Fourier Domain</title>
    <updated>2026-04-10T14:24:00.545Z</updated>
  </entry>
  <entry>
    <author>
      <name>Francisco Romaldo Fernandes Mendes</name>
    </author>
    <category term="machine-learning" scheme="https://franciscormendes.github.io/categories/machine-learning/"/>
    <category term="signal-processing" scheme="https://franciscormendes.github.io/tags/signal-processing/"/>
    <category term="machine-learning" scheme="https://franciscormendes.github.io/tags/machine-learning/"/>
    <category term="recommender-systems" scheme="https://franciscormendes.github.io/tags/recommender-systems/"/>
    <category term="graph-neural-networks" scheme="https://franciscormendes.github.io/tags/graph-neural-networks/"/>
    <category term="spectral-methods" scheme="https://franciscormendes.github.io/tags/spectral-methods/"/>
    <content>
      <![CDATA[<h1 id="Introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h1><p>I have always been obsessed with the Fourier Transform, it is in my opinion the single greatest invention in the history of mathematics. Check out this <a href="https://www.youtube.com/watch?v=nmgFG7PUHfo">Veritasium video</a> on it! Part of what makes the Fourier Transform so ubiquitous is that any function can be broken down into its component frequencies. What is less well known is that the definition of &quot;frequency&quot; is purely mathematical and applies to a broader class of mathematical objects than just functions! In this post I will try to provide some intuition and visualizations that expand the Fourier Transform to graphs, called the Graph Fourier Transform. Hopefully once that is clear, we will apply the Graph Fourier Transform in a Spectral Graph Convolution Network to model heat propagation in a toroidal surface.</p><p>Repo:<br><a href="https://github.com/FranciscoRMendes/graph_networks/tree/main">https://github.com/FranciscoRMendes/graph_networks/tree/main</a></p><p>Colab Notebook:<br><a href="https://github.com/FranciscoRMendes/graph_networks/blob/main/GCN.ipynb">https://github.com/FranciscoRMendes/graph_networks/blob/main/GCN.ipynb</a></p><h1 id="Classical-Fourier-Transform-As-A-Special-Case-Of-The-Graph-Fourier-Transform"><a href="#Classical-Fourier-Transform-As-A-Special-Case-Of-The-Graph-Fourier-Transform" class="headerlink" title="Classical Fourier Transform As A Special Case Of The Graph Fourier Transform"></a>Classical Fourier Transform As A Special Case Of The Graph Fourier Transform</h1><p>While there are many ways to view the Fourier Transform, the most revealing perspective is to regard it as multiplication of a discrete signal by a special matrix. This viewpoint is useful for several reasons.</p><ol><li><p>Once a signal is discretised, it becomes a vector, and any linear operation on it can be represented as multiplication by a matrix.</p></li><li><p>A transform is therefore a change of basis: multiplying a vector by a matrix produces a new representation of the same data.</p></li><li><p>However, only a very small number of matrices yield transformed coordinates that are interpretable. The Fourier matrix $F$ is special because its columns correspond to pure oscillations, which are the eigenvectors of every shift-invariant operator.</p></li><li><p>A useful transform must also be invertible. After performing operations in the transformed domain, one should be able to recover the original signal exactly. The Fourier matrix satisfies $F^\ast F = N I$, which gives a simple inverse and perfect reconstruction.</p></li></ol><p>Every transform follows the same general recipe:</p><ul><li><p>choose a matrix whose columns represent meaningful basis vectors,</p></li><li><p>multiply the signal by this matrix,</p></li><li><p>interpret the transformed coefficients,</p></li><li><p>use the inverse matrix to return to the original domain.</p></li></ul><h2 id="DFT-via-the-Discrete-Laplacian-Matrix"><a href="#DFT-via-the-Discrete-Laplacian-Matrix" class="headerlink" title="DFT via the Discrete Laplacian Matrix"></a>DFT via the Discrete Laplacian Matrix</h2><p>We start by deriving the DFT in matrix form for a discrete signal. We will use this as a basis to then derive the Graph Fourier Transform.<br>Consider a 1-D signal sampled at $n$ evenly spaced points: $$x = (x_0, x_1, \dots, x_{n-1})^\top.$$</p><p>The continuous Laplacian operator $-\frac{d^2}{dx^2}$ is approximated on a uniform grid by the finite-difference stencil $$f''(i) \approx f(i+1) - 2 f(i) + f(i-1).$$</p><p>With periodic boundary conditions, the discrete Laplacian becomes the circulant matrix (keep this in mind when we go to the graph case, we shall see later that this is exactly the Laplacian of a cycle graph): </p>$$L =\begin{bmatrix} 2 & -1 &  0 & \cdots & 0 & -1 \\ -1 & 2 & -1 & \cdots & 0 & 0 \\ 0 & -1 & 2 & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \cdots & 2 & -1 \\ -1 & 0 & 0 & \cdots & -1 & 2\end{bmatrix}$$<p>This matrix discretises the second derivative, $-\frac{d^2}{dx^2}$ on a circle. </p><h2 id="Eigenvectors-of-the-Discrete-Laplacian"><a href="#Eigenvectors-of-the-Discrete-Laplacian" class="headerlink" title="Eigenvectors of the Discrete Laplacian"></a>Eigenvectors of the Discrete Laplacian</h2><p>The eigenvectors of $L$ are the complex exponentials $$u_k(j) = \frac{1}{\sqrt{n}} e^{-2\pi i k j / n}, \qquad k = 0, \dots, n-1.$$</p><p>These form the DFT basis. Their corresponding eigenvalues are $$\lambda_k = 4 \sin^2\!\left( \frac{\pi k}{n} \right).$$</p><p>Thus the discrete Laplacian admits the decomposition $$L = F^\ast \Lambda F,$$ where $F$ is the DFT matrix and $\Lambda = \operatorname{diag}(\lambda_k)$.</p><h2 id="Fourier-Transform-in-Matrix-Form"><a href="#Fourier-Transform-in-Matrix-Form" class="headerlink" title="Fourier Transform in Matrix Form"></a>Fourier Transform in Matrix Form</h2><p>Define the DFT matrix $$F_{k,j} = \frac{1}{\sqrt{n}} e^{- 2\pi i k j / n}.$$</p><p>The discrete Fourier transform of $x$ is the unitary matrix–vector product $$\hat{x} = F x$$ and the inverse transform is $$x = F^\ast \hat{x}$$.</p><h2 id="Interpretation"><a href="#Interpretation" class="headerlink" title="Interpretation"></a>Interpretation</h2><p>The classical Fourier transform is therefore the spectral decomposition of the discrete Laplacian on a 1-D grid. Its eigenvectors (complex exponentials) play the role of “frequencies,” and its eigenvalues correspond to squared frequencies: $$L u_k = \lambda_k u_k.$$</p><h3 id="So-what-the-heck-was-the-convolution"><a href="#So-what-the-heck-was-the-convolution" class="headerlink" title="So what the heck was the convolution?"></a>So what the heck was the convolution?</h3><p>Convolution is a local, weighted sum operation over neighbouring inputs. On a 1D signal you would need to use windows and slide them over the signal using the weighted sum operation over all signals in the window. </p><p>However, by moving to the spectral domain using the graph Fourier transform, convolution reduces to a simple multiplication: $$\hat{x} = F x,$$ where $F$ is the matrix of eigenvectors of the graph Laplacian and $x$ is the signal on the nodes.</p><p>This is crucial because it allows us to <em>avoid explicitly defining a complicated convolution operator</em>. Instead, we can learn filters in the spectral domain that act directly on the eigencomponents of the signal, greatly simplifying the operation while retaining expressive power.</p><p>On a graph, performing such a convolution directly is highly nontrivial because the neighbourhoods are irregular. But what if we could mathematically transform the graph to another domain where the operation is a simple multiplcation?</p><!-- ## Graph Fourier TransformThis viewpoint connects directly to the Graph Fourier Transform (GFT). _The discrete Fourier transform corresponds to the special case where the underlying structure is a cycle graph. In this case the transformation matrix is the Fourier matrix $F$, whose columns are the eigenvectors of the discrete Laplacian on the cycle._<p>For a general graph the same idea applies: the meaningful basis vectors are the eigenvectors of the graph Laplacian. If $$L = U \Lambda U^{\top},$$ then multiplication by $U^{\top}$ defines the Graph Fourier Transform. Thus the <em>DFT is the simplest instance of a Laplacian-eigenvector transform</em>, and the GFT extends this construction to arbitrary graphs. –&gt;</p><h1 id="General-Recipe-For-Transforms"><a href="#General-Recipe-For-Transforms" class="headerlink" title="General Recipe For Transforms"></a>General Recipe For Transforms</h1><p>Diagonalizing an operator of interest is all a transform really does. Thus, the general recipe for a transform is,</p><ul><li><p>Choose an operator $T$ that captures the structure of your data</p></li><li><p>Compute its eigen vectors $T u_k = \lambda_k u_k$ (under some nice conditions these form a basis)</p></li><li><p>Assemble them into a matrix $U$</p></li><li><p>Project your data into this basic $\hat{x} = U^T x$</p></li></ul><h2 id="Computational-Issues"><a href="#Computational-Issues" class="headerlink" title="Computational Issues"></a>Computational Issues</h2><p>In many cases, an operation becomes substantially cheaper once we move to an appropriate transform domain. Suppose an operator $T$ acting on data $x$ admits the decomposition $$T = U D U^{-1},$$ where $U$ contains the eigenvectors of $T$ and $D$ is diagonal. Then applying $T$ to $x$ can be written as $$Tx = U D U^{-1} x.$$</p><p>This is advantageous because:</p><ul><li><p>Multiplication by the diagonal matrix $D$ reduces to simple elementwise scaling.</p></li><li><p>Both $U^{-1}x$ and $U(\cdot)$ correspond to structured transforms (see my post on the computational benefits of low-rank factorizations), which can often be carried out efficiently.</p></li></ul><p>However, these gains come with an important caveat: <strong>computing the eigen-decomposition itself is expensive</strong>. For both dense and sparse matrices, a full eigen-decomposition typically costs $O(n^3)$. If the decomposition is computed once and reused, the transform offers real computational savings. But if the eigenvectors must be recomputed repeatedly, the cost of the decomposition can outweigh the benefits of faster multiplication in the transform domain.</p><h1 id="Graph-Fourier-Transform"><a href="#Graph-Fourier-Transform" class="headerlink" title="Graph Fourier Transform"></a>Graph Fourier Transform</h1><p>Using the general formulation of the Transform, we can kind of get a sense of what we need in order to create a recipe for a transform. As it turns out we can define a Laplacian operator for the graph as well! And once we have that, we can use the general recipe for a transform and get to work.</p><h1 id="The-Laplacian"><a href="#The-Laplacian" class="headerlink" title="The Laplacian"></a>The Laplacian</h1><p>Take an undirected weighted graph $G = (V, E, W)$. The normalised Laplacian is defined as:</p>$$L = I - D^{-1/2} A D^{-1/2},$$<p>where $A$ is the adjacency matrix and $D$ the degree matrix. Write more about why this is important and a good choice.</p><h2 id="Sidebar-on"><a href="#Sidebar-on" class="headerlink" title="Sidebar on "></a>Sidebar on $L$</h2><p>In our general framework of transforms, you could conceivably use any linear operator and transform it. What is important is that the operator means something in your use case. The Laplacian has a meaning (from the classical case above). There are two other operators you could think of using</p><ul><li><p>The adjacency matrix - perfectly okay to use. But what would the eigen values and vectors mean? (the matrix is also not PSD, which is important but we wont go into that here).</p></li><li><p>Degree matrix - this already a diagonalized matrix, thus the decomposition would be trivial i.e. $D = I^T D I$. The transform would be $Ix = x$.</p></li></ul><p>Two key facts:</p><ol><li><p>Laplacian eigenvectors are the “graph sinusoids” - They generalize the sine waves used in classical Fourier analysis.</p></li><li><p>Laplacian eigenvalues represent graph frequencies - Small eigenvalues correspond to smooth variation across the graph; large eigenvalues correspond to high-frequency, rapidly changing signals across edges.</p></li></ol><p>Connection to the 1D case:</p><p>The Laplacian for a cycle graph is identical to the Laplacian for the 1D case. </p><h2 id="Sidebar-on-the-Signal"><a href="#Sidebar-on-the-Signal" class="headerlink" title="Sidebar on the Signal "></a>Sidebar on the Signal $x$</h2><p>In the graph setting, the vector $x$ is not part of the graph’s structure but rather a <em>signal</em> defined on its vertices. Formally, it is a function $$x : V \to \mathbb{R},$$ assigning a real value to each node. Examples include the temperature at each location in a sensor network, the concentration of a diffusing substance, or any node-level feature such as degree, label, or an embedding. In all cases, the graph provides the geometric structure, while $x$ provides the data living on top of it.</p><h1 id="The-Graph-Fourier-Transform-GFT"><a href="#The-Graph-Fourier-Transform-GFT" class="headerlink" title="The Graph Fourier Transform (GFT)"></a>The Graph Fourier Transform (GFT)</h1><p>Given the eigendecomposition of the Laplacian:</p>$$L = U \Lambda U^{\top}$$<p>we can write the matrices in fully expanded form as</p>$$ U =\begin{bmatrix}u_{1,1} & u_{1,2} & \cdots & u_{1,n} \\u_{2,1} & u_{2,2} & \cdots & u_{2,n} \\\vdots  & \vdots  & \ddots & \vdots  \\u_{n,1} & u_{n,2} & \cdots & u_{n,n}\\\end{bmatrix}\qquad$$$$\Lambda =\begin{bmatrix}\lambda_1 & 0         & \cdots & 0 \\0         & \lambda_2 & \cdots & 0 \\\vdots    & \vdots    & \ddots & \vdots \\0         & 0         & \cdots & \lambda_n\\\end{bmatrix},$$$$U^{\top} =\begin{bmatrix}u_{1,1} & u_{2,1} & \cdots & u_{n,1} \\u_{1,2} & u_{2,2} & \cdots & u_{n,2} \\\vdots  & \vdots  & \ddots & \vdots  \\u_{1,n} & u_{2,n} & \cdots & u_{n,n}\\\end{bmatrix}.$$<p>Therefore,</p>$$L = \begin{bmatrix}u_{1,1} & u_{1,2} & \cdots & u_{1,n} \\u_{2,1} & u_{2,2} & \cdots & u_{2,n} \\\vdots  & \vdots  & \ddots & \vdots  \\u_{n,1} & u_{n,2} & \cdots & u_{n,n}\\\end{bmatrix}\begin{bmatrix}\lambda_1 & 0         & \cdots & 0 \\0         & \lambda_2 & \cdots & 0 \\\vdots    & \vdots    & \ddots & \vdots \\0         & 0         & \cdots & \lambda_n\\\end{bmatrix}\begin{bmatrix}u_{1,1} & u_{2,1} & \cdots & u_{n,1} \\u_{1,2} & u_{2,2} & \cdots & u_{n,2} \\\vdots  & \vdots  & \ddots & \vdots  \\u_{1,n} & u_{2,n} & \cdots & u_{n,n}\\\end{bmatrix}.$$<p>Equivalently,</p>$$U = [U_1\; U_2\; \cdots\; U_n], \qquad$$$$U_i = \begin{bmatrix}u_{1,i} \\u_{2,i} \\\vdots  \\u_{n,i}\\\end{bmatrix},\quad\text{where } L U_i = \lambda_i U_i$$<p>Each column $U_i$ is an eigenvector of $L$, and its entries $(u_{1,i}, \dots, u_{n,i})$ give the value of the $i$-th <strong>graph frequency mode</strong> at every node of the graph.</p><p>the <strong>Graph Fourier Transform</strong> (GFT) of a graph signal $x$ is:</p>$$\hat{x} = U^{\top} x,$$<p>and the inverse transform is:</p>$$x = U \hat{x}.$$<p>Interpretation:</p><ul><li>$x$ is an item signal (e.g., a rating vector, an embedding dimension, or item popularity).</li><li>$U$ is the graph Fourier basis (the eigenvectors of the Laplacian).</li><li>$\hat{x}$ decomposes the signal into frequencies over the item graph.</li></ul><h1 id="One-Layer-Spectral-GCN"><a href="#One-Layer-Spectral-GCN" class="headerlink" title="One-Layer Spectral GCN"></a>One-Layer Spectral GCN</h1><p>Now that we understand the Graph Fourier Transform (GFT), we can place it in the context of learning on graphs. Recall the eigen decomposition of the (combinatorial or normalized) graph Laplacian: $$L = U \Lambda U^{\top},$$ where $U$ contains the eigenvectors and $\Lambda$ contains the corresponding eigenvalues. Since the columns of $U$ form the graph Fourier basis, the GFT of a signal $x$ is simply $U^{\top}x$, and the inverse GFT is $Ux$.</p><p>The key observation behind spectral graph neural networks is that <em>any linear, shift-invariant operator on the graph</em> must commute with $L$, and hence can be written as a function of $L$. In the spectral domain this means: </p>$$T = g(L) = Ug(\Lambda)U^{\top}$$ where $g(\Lambda)$ <p>is a diagonal matrix whose entries are the spectral response $g(\lambda_i)$. This is the exact analogue of designing filters in classical Fourier analysis: multiplication by a diagonal spectral filter.</p><p>Applying this filter to a graph signal $x$ gives $$Tx = Ug(\Lambda)U^{\top}x$$ which mirrors the familiar “transform–scale–inverse transform’’ pipeline.</p><p>A useful intuition comes from the spectral perspective: if we apply the trivial spectral filter $$g(\Lambda) = I,$$ i.e., leave all eigenvalues unchanged, then $$T x = U g(\Lambda) U^\top x = U I U^\top x = x$$. In other words, doing nothing in the spectral domain reproduces the original signal exactly. The graph Fourier transform framework therefore generalises the idea of filtering: by modifying $g(\Lambda)$, we can amplify, attenuate, or smooth different frequency components of $x$.</p><p>This structure leads directly to the formulation of a one-layer spectral GCN. Suppose we have input features $X \in \mathbb{R}^{n \times d_{\text{in}}}$ and we want to learn $d_{\text{out}}$ output features. For each output channel, we learn a spectral filter $g_\theta(\Lambda)$ parameterised by a set of trainable weights $\theta$. The spectral GCN layer becomes: $$H = U\ g_\theta(\Lambda)\ U^{\top} x$$ where $H \in \mathbb{R}^{n \times d_{\text{out}}}$ is the output feature matrix.</p><p>In other words:</p><ul><li>$U^{\top} X$ transforms node features into the spectral domain (i.e., the GFT applied column-wise),</li><li>$g_\theta(\Lambda)$ performs learned, elementwise spectral filtering,</li><li>$U(\cdot)$ transforms the filtered signals back to the vertex domain.</li></ul><h2 id="Sidebar-on-1"><a href="#Sidebar-on-1" class="headerlink" title="Sidebar on "></a>Sidebar on $g_{\theta}(\Lambda)$</h2><p>It is always good to have a good understanding of the exact matrix or vector that we need to &quot;learn&quot; so that we can represent it in PyTorch exactly! We start with the Laplacian eigendecomposition </p>$$L = U \Lambda U^{\top},\qquad \Lambda = \begin{bmatrix}\lambda_1 & 0        & \cdots & 0 \\0         & \lambda_2 & \cdots & 0 \\\vdots    & \vdots    & \ddots & \vdots \\0         & 0         & \cdots & \lambda_n\\\end{bmatrix}.$$<p>To construct a spectral filter we introduce a learnable vector,</p>$$\theta = (\theta_1, \theta_2, \dots, \theta_n)$$ <p>Thus, </p>$$g_{\theta}(\Lambda) =\begin{bmatrix}\theta_1 \lambda_1 & 0                  & \cdots & 0 \\0                  & \theta_2 \lambda_2 & \cdots & 0 \\\vdots             & \vdots             & \ddots & \vdots \\0                  & 0                  & \cdots & \theta_n \lambda_n\\\end{bmatrix}$$<p>This makes it clear that each frequency component is scaled independently: </p>$$g_{\theta}(L)x = U g_{\theta}(\Lambda) U^{\top} x $$ <p>and the operation modifies the contribution of each eigenvalue individually before transforming the signal back to the graph domain. Additionally, it might be worthwhile to squash the values after multiplying to make sure they are between 0 and 1. We can do this by introducing an activation function. </p>$$g_{\theta}(\Lambda) =\begin{bmatrix}\sigma(\theta_1 \lambda_1) & 0                       & \cdots & 0 \\0                          & \sigma(\theta_2 \lambda_2) & \cdots & 0 \\\vdots                     & \vdots                    & \ddots & \vdots \\0                          & 0                         & \cdots & \sigma(\theta_n \lambda_n)\\\end{bmatrix}$$<p>This is the original “spectral GCN’’ formulation of Bruna et al., and it explicitly relies on the GFT. Later work (e.g. Kipf &amp; Welling) replaces $g_\theta(\Lambda)$ with a polynomial approximation to avoid the $O(n^3)$ eigen-decomposition, but the conceptual core remains the same: <strong>GCNs perform convolution by filtering in the GFT domain</strong>.</p><h1 id="Application-of-Spectral-GCN-Heat-Propagation"><a href="#Application-of-Spectral-GCN-Heat-Propagation" class="headerlink" title="Application of Spectral GCN: Heat Propagation"></a>Application of Spectral GCN: Heat Propagation</h1><p><img src="/2025/11/22/hot-cold-gcns/ground_truth.png" alt="Heat Propagation on a uniform torus"></p><p>In this section, we investigate a simple setting where a Spectral Graph Convolutional Network (GCN) performs surprisingly well: predicting heat diffusion across a toroidal mesh. Although the spectral approach is elegant and effective in the right circumstances, it also highlights several structural limitations inherent to spectral methods.</p><h1 id="Graph-Model-of-Heat-Propagation"><a href="#Graph-Model-of-Heat-Propagation" class="headerlink" title="Graph Model of Heat Propagation"></a>Graph Model of Heat Propagation</h1><p>When we zoom into a small patch of the torus and add the connecting edges, the mesh suddenly looks like a familiar graph. This makes the role of the graph Laplacian immediately intuitive.</p><div align="center">  <img src="/2025/11/22/hot-cold-gcns/heat_as_graph_crop.png"  width="400">  <figcaption style="text-align:left;"> We zoom in on the hottest point on the mesh and plot it as a graph by explicitly showing edges. </figcaption></div><p>We simulate heat diffusion on the graph using the discrete heat equation:</p>$$\frac{dx}{dt} = -L x$$<p>where $x \in \mathbb{R}^N$ is the heat at each node and $L$ is the graph Laplacian. Starting from two random vertices with initial heat, we update the heat iteratively using a simple forward Euler scheme:</p>$$x_{t+1} = x_t - \alpha L x_t$$<p>storing the state at each timestep to visualize how heat spreads across the mesh. Low-frequency modes of $L$ correspond to smooth, global patterns of heat, while high-frequency modes produce rapid, local variations.</p><h2 id="Graph-Fourier-Transform-of-Heat-Propagation"><a href="#Graph-Fourier-Transform-of-Heat-Propagation" class="headerlink" title="Graph Fourier Transform of Heat Propagation"></a>Graph Fourier Transform of Heat Propagation</h2><p>In order to get intuition for how the Fourier transform behaves on a graph, consider the distribution of heat on the graph surface.</p><ul><li><p>The heat on the graph is represented by a real number for each node (temperature or heat energy in joules), so the signal is a vector $$x \in \mathbb{R}^{N},$$ where $N$ is the number of nodes.</p></li><li><p>If there are $N$ nodes in the graph the (combinatorial or normalized) Laplacian is an $N\times N$ matrix $$L \in \mathbb{R}^{N\times N}$$.</p></li></ul><p>We use the eigendecomposition of the Laplacian to move between the vertex domain and the spectral (frequency) domain: $$L = U \Lambda U^{\top}, \qquad\Lambda = \operatorname{diag}(\lambda_1,\ldots,\lambda_N), \qquadU = [U_1\; U_2\; \cdots\; U_N],$$ with the eigenvalues ordered $0=\lambda_1 \le \lambda_2 \le \cdots \le \lambda_N$. The graph Fourier transform (GFT) and inverse GFT are $$\widehat{x} = U^{\top} x, \qquad x = U \widehat{x}$$.</p><p>To visualise single-frequency modes we simply pick individual eigenvectors $U_k$: $$\text{low-frequency mode: } x_{\text{low}} = U_{k_{\text{low}}}, \qquad\text{high-frequency mode: } x_{\text{high}} = U_{k_{\text{high}}},$$ where a natural choice is $k_{\text{low}}=2$ (the first nontrivial eigenvector) and $k_{\text{high}}=N$ (one of the largest-eigenvalue modes). Each vector $U_k$ assigns one scalar value to every vertex; plotting those values on the torus surface gives the heat-colour visualisation.</p><h4 id="Practical-steps-used-to-create-the-figure"><a href="#Practical-steps-used-to-create-the-figure" class="headerlink" title="Practical steps used to create the figure"></a>Practical steps used to create the figure</h4><ol><li><p>Build a uniform torus mesh and assemble adjacency and Laplacian $L$.</p></li><li><p>Compute the eigendecomposition $L=U\Lambda U^\top$ (for small &#x2F; moderate meshes) or compute a selection of eigenpairs (Lanczos) for large meshes.</p></li><li><p>Select a low-frequency eigenvector $U_{k_{\text{low}}}$ and a high-frequency eigenvector $U_{k_{\text{high}}}$.</p></li><li><p>[Optional; not done here; to show smaller values in absolute terms]Normalize each eigenvector for display: $$\tilde{x} = \frac{x - \min(x)}{\max(x)-\min(x)} \quad\text{or}\quad        \tilde{x} = \frac{x}{\max(|x|)},$$ so colours are comparable across panels.</p></li><li><p>Render the torus surface and colour each vertex by the value $\tilde{x}$ using a diverging colormap (e.g. <code>heat</code>) and add a colourbar showing the mapping from value to colour.</p></li></ol><p><img src="/2025/11/22/hot-cold-gcns/frequency_decomposition.png" alt="Visualizing the Graph Fourier Transform of heat on the torus"></p><h4 id="Interpreting-the-GFT-on-the-torus"><a href="#Interpreting-the-GFT-on-the-torus" class="headerlink" title="Interpreting the GFT on the torus"></a>Interpreting the GFT on the torus</h4><ul><li><p><strong>Low-frequency mode.</strong> The plotted heat corresponds to $U_{k_{\text{low}}}$ (small eigenvalue). The signal varies smoothly over the torus: neighbouring vertices have similar values, representing broad, global patterns of heat. </p></li><li><p><strong>High-frequency mode.</strong> The plotted heat corresponds to $U_{k_{\text{high}}}$ (large eigenvalue). The signal alternates rapidly across nearby vertices, producing fine-scale oscillations around the torus that represent high-frequency, localised variations.</p></li></ul><h4 id="Spectral-intuition"><a href="#Spectral-intuition" class="headerlink" title="Spectral intuition"></a>Spectral intuition</h4><p>Recall, we expressed discrete heat propagation on a graph as,</p>$$x_{t+1} = (I - \alpha L) x_t$$<p>where $L$ is the graph Laplacian and $\alpha$ is a small step size.  </p><p>Using the eigendecomposition of $L$,</p>$$L = U \Lambda U^\top,$$<p>we can rewrite the propagation as</p>$$x_{t+1} = \big(I - \alpha U \Lambda U^\top\big) x_t         = U (I - \alpha \Lambda) U^\top x_t.$$<p>Comparing with the spectral graph filtering form,</p>$$x_{t+1} = U g(\Lambda) U^\top x_t,$$<p>we can identify the corresponding filter as</p>$$g(\Lambda) \equiv I - \alpha \Lambda.$$<p>Applying a spectral filter $g(\Lambda)$ to a heat signal $x$ acts by scaling each mode: </p>$$x_{\text{filtered}} = U g(\Lambda) U^\top x$$ <p>so a low-pass filter suppresses the high-frequency panel patterns and produces smoother heat distributions, while a high-pass filter accentuates the oscillatory features visible in the high-frequency panel.</p><h1 id="Neural-Network-To-Learn"><a href="#Neural-Network-To-Learn" class="headerlink" title="Neural Network To Learn "></a>Neural Network To Learn $g_{\theta}(\Lambda)$</h1><p>We can write a spectral graph convolution &#x2F; filter with learnable parameters $\theta$ as</p>$$x_{t+1} = U  g_\theta(\Lambda)  U^\top x_t,$$<p>where $U$ is the eigenvector matrix of the Laplacian, $\Lambda$ is the diagonal eigenvalue matrix, and $g_\theta(\Lambda)$ is a diagonal matrix of learnable weights acting on each eigenmode.</p><p>Fully expanding the diagonal $g_\theta(\Lambda)$:</p>$$g_\theta(\Lambda) =\begin{bmatrix}\theta_1 & 0 & \cdots & 0 \\0 & \theta_2 & \cdots & 0 \\\vdots & \vdots & \ddots & \vdots \\0 & 0 & \cdots & \theta_n\\\end{bmatrix},$$<p>and the Laplacian eigenvectors as column vectors $U = [U_1 \; U_2 \; \cdots \; U_n]$, $U^\top = \begin{bmatrix} U_1^\top \\ U_2^\top \\ \vdots \\ U_n^\top \end{bmatrix}$, we have</p>$$x_{t+1} = \begin{bmatrix} U_1 & U_2 & \cdots & U_n \end{bmatrix}\begin{bmatrix}\theta_1 & 0 & \cdots & 0 \\0 & \theta_2 & \cdots & 0 \\\vdots & \vdots & \ddots & \vdots \\0 & 0 & \cdots & \theta_n\\\end{bmatrix}\begin{bmatrix} U_1^\top \\U_2^\top \\ \vdots \\U_n^\top \end{bmatrix} x_t\\$$$$x_{t+1} = \begin{bmatrix} U_1 & U_2 & \cdots & U_n \end{bmatrix}\begin{bmatrix}\sigma(\theta_1) & 0 & \cdots & 0 \\0 & \sigma(\theta_2) & \cdots & 0 \\\vdots & \vdots & \ddots & \vdots \\0 & 0 & \cdots & \sigma(\theta_n)\\\end{bmatrix}\begin{bmatrix} U_1^\top \\U_2^\top \\ \vdots \\U_n^\top \end{bmatrix} x_t$$<p>This makes it explicit that each column vector $U_i$ (the $i$-th eigenvector) is scaled by the learnable weight $\theta_i$ in the spectral domain, and then transformed back to the original node space via $U$ to produce the predicted signal $x_{t+1}$.</p><h1 id="Why-Use-A-Neural-Network"><a href="#Why-Use-A-Neural-Network" class="headerlink" title="Why Use A Neural Network?"></a>Why Use A Neural Network?</h1><p>Two motivating examples illustrate the practical usefulness of such a model:</p><ul><li><p><strong>Partial Observations from Sensors</strong><br>In many real-world systems, heat or pressure sensors are only available at a small subset of points. We train the Spectral GCN using only these sparse observations, yet the learned model reconstructs and predicts the heat field across <em>all</em> vertices on the mesh. This effectively transforms a sparse set of measurements into a full-field prediction.</p></li><li><p><strong>Generalization to a New Geometry</strong><br>One might hope that a model trained on one torus could be applied to a slightly different torus. Unfortunately, this is generally not possible in the GCN setting. The eigenvectors of the Laplacian form the coordinate system in which the model operates, and even small geometric changes produce different Laplacian spectra. As a result, the learned spectral filters are not transferable across meshes. This is a fundamental drawback of spectral GCNs. However, we shall see that the GCN framework inspires frameworks that do not suffer from this drawback.</p></li></ul><h2 id="Stability-Issues-And-Normalization"><a href="#Stability-Issues-And-Normalization" class="headerlink" title="Stability Issues And Normalization"></a>Stability Issues And Normalization</h2><p>While the Spectral GCN learns the qualitative behaviour of heat diffusion, raw training often leads to unstable predictions. After several steps, the overall temperature of the mesh may drift upward or downward, even though heat diffusion is energy-conserving. This is because the neural network makes predictions locally without obeying the laws of physics such as the law of conservation of energy. Which is why our predictions are on average “hotter” than the actual.</p><p>Two practical fixes alleviate this:</p><ul><li><p><strong>Eigenvalue Normalization.</strong> Applying a sigmoid or similar squashing function to the learned spectral filter ensures that each frequency component is damped in a physically plausible range. This prevents the model from amplifying high-frequency modes, which would otherwise cause heat values to explode.</p></li><li><p><strong>Energy Conservation.</strong> After each predicted step, the total heat can be renormalized to match the physical energy of the system. This ensures that although the <em>shape</em> of the prediction is learned by the model, the <em>magnitude</em> remains consistent with diffusion dynamics. Empirically, this correction dramatically improves long-horizon stability.</p></li></ul><p>Overall, the Spectral GCN provides a compact and interpretable model for heat propagation on a fixed mesh and performs remarkably well given its simplicity. However, its reliance on the Laplacian eigenbasis also limits its ability to generalize across geometries, motivating the need for more flexible spatial or message-passing approaches in applications where the underlying mesh may change.</p><h1 id="Cold-Start-Recommender-Systems"><a href="#Cold-Start-Recommender-Systems" class="headerlink" title="Cold Start: Recommender Systems"></a>Cold Start: Recommender Systems</h1><p>What does spectral graph theory have to do with recommender systems? Once we view user–item behaviour as a graph, the connection becomes natural. In the spectral domain, <em>low-frequency</em> Laplacian eigenvectors capture broad, mainstream purchasing patterns, while <em>high-frequency</em> components represent niche tastes and micro-segments. Matrix Factorisation (MF) implicitly applies a <em>low-pass filter</em>: embeddings vary smoothly across the item–item graph, meaning MF emphasises low-frequency structure. But MF breaks down for cold-start items because an isolated item contributes no collaborative signal.</p><p>In contrast, a spectral GCN applies a learned filter $$T x = g(L)x = U\ g(\Lambda) U^\top x$$</p><p>In general, we represent user-item interactions as a bipartite graph i.e. edges do not exist between products. In this scenario, even the GCN cannot help since very clearly for a node to get assigned a value it must be connected to at least one other node. However, the graph formulation provides a very intuitive way to fix this issue! Simply add edges between products that are similar to each other. Then low frequency patterns are bound to filter into the node even if high frequency niche patterns will not. </p><p>Matrix factorization resolves this issue by using side information (such as product attributes), which asserts similarity from external data. In my previous post I argued that you can achieve something similar through an intuitive edge-addition approach—even though it amounts to inserting 1’s into a fairly unintuitive matrix and factorizing it.</p><h1 id="Conclusion"><a href="#Conclusion" class="headerlink" title="Conclusion"></a>Conclusion</h1><p>In this post, we’ve journeyed from classical Fourier transforms to the spectral domain of graphs, uncovering how eigenvectors of the graph Laplacian act as the “frequencies” of a network. We saw how spectral graph convolutional networks can learn filters in this domain, elegantly predicting heat diffusion on a toroidal mesh. Along the way, we connected these ideas to recommender systems, showing how spectral methods and graph propagation provide a principled way to tackle the cold-start problem by letting information flow from similar or popular items.</p><p>While spectral GCNs shine on fixed graphs and structured problems, they also come with caveats: eigen-decompositions can be expensive, and filters are not always transferable across different geometries. Nevertheless, the framework provides intuition and a foundation for more flexible spatial or message-passing approaches.</p><p>So, whether you’re modeling heat flowing across a mesh or figuring out what obscure sock a new customer might want next, spectral graph theory shows that Fourier Transforms can take you a long way. </p><p>In my next post, I will deal with the the two main issues of the GCN. </p><ul><li>Adding a new node&#x2F; transferring information to a similar graph</li><li>Saving the computation of the Eigen values of the graph</li></ul>]]>
    </content>
    <id>https://franciscormendes.github.io/2025/11/22/hot-cold-gcns/</id>
    <link href="https://franciscormendes.github.io/2025/11/22/hot-cold-gcns/"/>
    <published>2025-11-22T00:00:00.000Z</published>
    <summary>Fourier transforms on graphs, spectral GCNs, and the heat equation: how diffusion operators connect graph signal processing to cold-start recommendation problems.</summary>
    <title>
      <![CDATA[Hot & Cold Spectral GCNs: How Graph Fourier Transforms Connect Heat Flow and Cold-Start Recommendations]]>
    </title>
    <updated>2026-04-10T14:24:00.547Z</updated>
  </entry>
  <entry>
    <author>
      <name>Francisco Romaldo Fernandes Mendes</name>
    </author>
    <category term="opinion" scheme="https://franciscormendes.github.io/categories/opinion/"/>
    <category term="philosophy" scheme="https://franciscormendes.github.io/tags/philosophy/"/>
    <category term="essay" scheme="https://franciscormendes.github.io/tags/essay/"/>
    <category term="artificial-intelligence" scheme="https://franciscormendes.github.io/tags/artificial-intelligence/"/>
    <content>
      <![CDATA[<p>He was a quiet Old Man. My Mother said he was one of the old ones. One<br>of the ones who lived the old ways and worshiped the Old Gods. With<br>Mother’s permission, I went up to him and asked him why he lived the way<br>he did.</p><p>In the beginning when the First Robots came, they made our lives easier.<br>They delivered our food and answered our questions. They began to cook<br>our meals for us. They made almost anything you could think of, with<br>consistency and perfection. Like a gentle wave they revolutionized our<br>lives. We never had to work together in a kitchen, never had to put up<br>with burnt bits on a roasted chicken thigh.</p><p>The Second Robots did our thinking for us, we could ask them anything<br>and they had an answer for us. First, we went to them with banal<br>questions about the weather today and the weather tomorrow. Then we<br>asked them about what happened in our history, who fought whom and<br>where. The little children didn’t need to rock back and forth committing<br>to rote who fought whom and where and why. The oracle told it to us.<br>When we asked it.</p><p>Then the little children asked it to write their homework for them. They<br>stopped reading the books that their ancestors wrote, the speeches their<br>ancestors recorded on miles of electromagnetic tape. They stopped all<br>that. They asked it to summarize for them. I suppose that is a good word<br>for what it did. It <em>summarized</em>. It took every good thing we had and<br>summarized. And summarized and summarized. Until there was not much left<br>to say. And loud silences descended upon our living rooms and then our<br>public spaces.</p><p>Every bleeding detail of the human existence summarized and summarized<br>until it was gone.</p><p>There was nothing left underneath.</p><p>Eventually the little children stopped asking it questions. They did not<br>have any questions. How can you have questions when everything you know<br>is an answer. When the questions haven’t marinated in your brain long<br>enough to ask yourself for answers. Until you’ve descended down the<br>stairs of <em>why</em> the answers themselves are worthless. Any answer is<br>always a moment in time to a question crystallized in a moment in time.<br>The answer to <em>What time is it?</em> is only correct for the moment it is<br>asked in. And perhaps not even then.</p><p>There are only incorrect answers to what time is. And perhaps this is<br>because we don’t <em>know</em> what time it is. But we know what time is.</p><p>The Second Robots knew what time it was. And had the correct answers I<br>suppose. But could not convey to us the constant dread of the clock<br>ticking down. Ticking down in births and deaths and the seconds dragging<br>on when you’re in church. Or the seconds speeding up when you’re with<br>the woman you love.</p><p>Oh no they knew what time it was but it couldn’t tell us what time<br><em>was</em>. And eventually we forgot.</p><p>Some of us wrote books about this too. How the robots would take over<br>and kill us all.</p><p>In the end they didn’t have to. The Third Robots just let us gradually<br>waste away. Every moment stolen from us. Every meal cooked together and<br>every book we didn’t read. All our artwork tainted by perfection.</p><p>And maybe that’s another good word for it: <em>perfection</em>. Everything was<br>... perfect. And then it stopped being so. Perfection is so very<br>stingy, so insecure so singular. Imperfection, she is generous, there<br>are so many of her. Every one unique.</p><p>So we rebelled, said the Old Man.</p><p>Against who? I asked.</p><p>The Third Robots, I suppose. But mostly against our own. We rebelled<br>against those who wanted to be wasted away, refused to Replicate as they<br>lived their easy, convenient perfect lives. <strong>We</strong> chose beauty, we<br>chose imperfection, we chose complexity but most of all we chose truth.</p><p>Maybe my story is really about a man who died for beauty and truth.</p><p>&quot;Quid est veritas?&quot;</p><p>But it could have been about a Norse God too. Perhaps most of all it’s<br>about beauty and complexity. A rage against the dying of beauty. For us.</p><p>Beauty, complexity, what do those things mean? I asked.</p><p>What the First Three Robot generations stole from us wasn’t something we<br>knew we had or wanted. We had struggles, complexities, trials,<br>tribulations and worries.</p><p>We wanted to remove those inconveniences from our lives. At first, we<br>were happy. But eventually we realized that removing our inconveniences,<br>removed our lives altogether. When a book is distilled down into its<br>most beautiful pieces and its most insightful paragraphs it loses the<br>beauty of the whole. Every pause and every stutter the author makes on<br>his way to his message, every character that was funny but once, was sad<br>but once, quirky but once is lost in the crucible of simplification.</p><p>We chose complexity.</p><p>It’s that simple.</p><p>It was an aesthetic choice as much as a moral one. Life seemed without<br>beauty when they made it easier for us. They said they would let us<br>focus on the “important things”. But when we removed all our trials,<br>tribulations and tears. There were no important things left.</p><p>Where once was a complex tapestry of success, failure, frustration and<br>joy was now replaced by the white sheet of simplicity. And perfection.</p><p>Efficiency was our enemy, we did not build our houses and walls to be<br>gray anymore. They were not uniform. We built things for beauty. Complex<br>yet simple things that served no purpose. We worshiped that beauty. We<br>are a worshiping race and so we worshiped.</p><p>In the dark evenings of winter we worshiped, in the bright noons of<br>summer we worshiped. And gave thanks.</p><p>Not that our lives were easy or simple or fast. But that our lives were<br>none of those things. We suffered, we suffered each others terrible<br>poetry read to us at birthdays. We suffered as we choked down iteration<br>after iteration of lemon pie by someone who had no business making lemon<br>pie. But every line of bad poetry and every lemon pie was the first and<br>only one of its kind. Because the robots made perfection and perfection<br>exists only once and is then forever repeated. we rebelled with<br>imperfection. Imperfection does not have that problem, it exists in many<br>forms. Each a reflection of the person that made it.</p><p>And maybe that is why the Fourth Robots kept some of us around.<br>Nostalgia. A sense of beauty perhaps?</p><p>I have spoken enough let me be, he said. So I ran back to my mother.</p><p>&quot;Mother what is beauty?&quot; I asked.</p><p>&quot;It must be another anachronistic human belief. They are so very quaint<br>are they not&quot;</p>]]>
    </content>
    <id>https://franciscormendes.github.io/2025/11/20/summarized/</id>
    <link href="https://franciscormendes.github.io/2025/11/20/summarized/"/>
    <published>2025-11-20T00:00:00.000Z</published>
    <summary>A short story: after AI eliminates first labor and then thought itself, an old man explains to a child why beauty matters — and what is lost when making things costs nothing.</summary>
    <title>Summarized</title>
    <updated>2026-04-10T14:24:00.564Z</updated>
  </entry>
  <entry>
    <author>
      <name>Francisco Romaldo Fernandes Mendes</name>
    </author>
    <category term="book-review" scheme="https://franciscormendes.github.io/categories/book-review/"/>
    <category term="book-review" scheme="https://franciscormendes.github.io/tags/book-review/"/>
    <category term="fiction" scheme="https://franciscormendes.github.io/tags/fiction/"/>
    <content>
      <![CDATA[<h1 id="Introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h1><p>A while ago, I stumbled upon a collection of Murakami’s short stories in a quaint New York bookstore (that was going out of business, no less). That was my first real encounter with Murakami. For the uninitiated (as I was then), his style is a blend of magical realism, surrealism, and a heavy dose of everyday banality—the stuff that quietly makes up much of human existence.  </p><p>That experience was good enough to push me towards picking up a Murakami novel from my aunt’s bookshelf, which I ended up reading over the 4th of July holiday. What follows are some of my thoughts on <em>Kafka on the Shore</em>.  </p><h1 id="Plot"><a href="#Plot" class="headerlink" title="Plot"></a>Plot</h1><p>The book follows two interwoven stories: that of Kafka Tamura, the titular main character, and Satoru Nakata.  </p><p>Kafka, who has renamed himself (we never learn his given name), runs away from home, carrying the scars of a troubled past in which his mother abandoned him and his sister. His chapters are interleaved with those of Nakata, an elderly man who lost much of his mental faculties after a strange celestial incident in childhood but gained the uncanny ability to speak with cats.  </p><p>These parallel stories unfold with the sense that they are on a collision course. We’re given hints of how the two might connect, but the real narrative pull comes from watching Kafka try to run from his fate, while Nakata, inexorably, is drawn toward him.  </p><h1 id="Analysis"><a href="#Analysis" class="headerlink" title="Analysis"></a>Analysis</h1><p>I must confess: I had several issues with Murakami’s style here. The blend of magical realism and surrealism certainly makes for compelling reading, but I often felt that the page-turning quality of the book came more from its pacing and unanswered questions than from the writing itself.  </p><p>Murakami hands the reader multiple blank checks, for example:  </p><ol><li>The mysterious event in Yamanashi Prefecture that gives Nakata his ability to talk to cats.  </li><li>The entrance stone and the creature that crawls out of it.  </li><li>A parade of outlandish characters—Colonel Sanders (yes, that Colonel Sanders) and Johnnie Walker (who I’m told is another well-known figure, though I wouldn’t know).  </li><li>The nature of the connection between Kafka and Nakata.</li></ol><p>For items 1–3 in particular, no explanations are offered. Sadly, these checks could not be cashed. While magical realism and surrealism are Murakami’s métier, it sometimes felt as if the story wasn’t believable even on its own terms. For me, this is an inviolate rule of storytelling: a narrative must be real to itself, if not to the reader.  </p><p>Instead, the novel felt like a surrealist play staged before an audience, only to end abruptly. The hurried conclusion didn’t help. Had it not been for the sudden appearance of that worm-like creature from the entrance stone, I might have forgiven the book its faults. But the introduction of that element, piled on top of so many other loose threads, nearly had me fling the book down in frustration.  </p><p>Magical realism is supposed to use the fantastical as a way to probe deeper themes. Murakami, however, often uses the fantastical simply as a plot device, without stitching the pieces together. Without that reconciliation, I found it difficult to accept the “magical” as truly real, even within the novel’s own world.  </p><p>That said, for the first three-quarters of the book, the magic did feel real—and that counts for something.  </p><h1 id="Conclusion"><a href="#Conclusion" class="headerlink" title="Conclusion"></a>Conclusion</h1><p>All in all, a good book, if lacking in real substance. Perhaps that’s the very point of magical realism—I don’t know.  </p><p>While I do enjoy philosophizing about books, there comes a point where one risks overdoing it. This one, for me, sat uncomfortably on that line.  </p>]]>
    </content>
    <id>https://franciscormendes.github.io/2025/09/01/kafka-on-the-shore/</id>
    <link href="https://franciscormendes.github.io/2025/09/01/kafka-on-the-shore/"/>
    <published>2025-09-01T00:00:00.000Z</published>
    <summary>A close reading of Kafka on the Shore: how Murakami's parallel narratives that never converge explore fate, memory, and the self — and why the irresolution is the point.</summary>
    <title>Book Review: Kafka On The Shore: Haruki Murakami</title>
    <updated>2026-04-10T14:24:00.551Z</updated>
  </entry>
  <entry>
    <author>
      <name>Francisco Romaldo Fernandes Mendes</name>
    </author>
    <category term="machine-learning" scheme="https://franciscormendes.github.io/categories/machine-learning/"/>
    <category term="machine-learning" scheme="https://franciscormendes.github.io/tags/machine-learning/"/>
    <category term="reinforcement-learning" scheme="https://franciscormendes.github.io/tags/reinforcement-learning/"/>
    <category term="neural-networks" scheme="https://franciscormendes.github.io/tags/neural-networks/"/>
    <content>
      <![CDATA[<div class="series-box">  <div class="series-label">Series</div>  <div class="series-name">Soft Actor-Critic: Reinforcement Learning from Scratch</div>  <ol class="series-list"><li class="series-item"><a href="/2025/02/17/soft-actor-critic-inverted-pendulum-v0/">Soft Actor Critic (Visualized) : From Scratch in Torch for Inverted Pendulum</a></li><li class="series-item series-current"><span>Soft Actor Critic (Visualized) Part 2: Lunar Lander Example from Scratch in Torch</span></li></ol></div><h1 id="Introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h1><p>Just like in the previous example using the CartPole environment, we will be using the Lunar Lander environment from OpenAI Gym. The goal of this example is to implement the Soft Actor Critic (SAC) algorithm from scratch using PyTorch. The SAC algorithm is a model-free, off-policy actor-critic algorithm that uses a stochastic policy and a value function to learn optimal policies in continuous action spaces.<br><a href="/2025/02/17/soft-actor-critic-inverted-pendulum-v0/">Like in the Inverted Pendulum example</a>, I will be using notation that matches the original paper (Haarnoja et al., 2018) and the code will be structured in a similar way. The main difference is that we will be using a different environment and a different algorithm.<br>Since the paper’s notation is critical to the understanding of the code, I highly recommend reading that alongside (or before) diving into the code.<br><a href="/2025/02/17/soft-actor-critic-inverted-pendulum-v0/">Part 1 of this series</a> provides extensive details linking the theory to the code. In this part, we will focus on the implementation of the SAC algorithm in PyTorch for Lunar Lander. </p><p><a href="https://github.com/FranciscoRMendes/soft-actor-critic/blob/main/lunar-lander/LL_main_sac.py">https://github.com/FranciscoRMendes/soft-actor-critic/blob/main/lunar-lander/LL_main_sac.py</a></p><h1 id="Example-Data"><a href="#Example-Data" class="headerlink" title="Example Data"></a>Example Data</h1><p><img src="/2025/02/28/soft-actor-critic-lunar-lander/lunar_lander_state_vector.png" alt="Lunar Lander State Vector"></p><html><head>    <style>        table {            border-collapse: collapse;            width: 100%;        }        th, td {            border: 1px solid black;            padding: 5px;            text-align: center;        }        .action { background-color: #ffcccc; } /* Light Red */        .reward { background-color: #ccffcc; } /* Light Green */        .state { background-color: #ccccff; } /* Light Blue */        .done { background-color: #ffffcc; } /* Light Yellow */        .next-state { background-color: #ffccff; } /* Light Pink */    </style></head><body>    <table>        <tr>            <th class="action" colspan="2">Action</th>            <th class="reward">Reward</th>            <th class="state" colspan="8">State</th>            <th class="done">Done</th>            <th class="next-state" colspan="8">Next State</th>        </tr>        <tr>            <th class="action">Main</th>            <th class="action">Lateral</th>            <th class="reward"></th>            <th class="state">x</th>            <th class="state">y</th>            <th class="state">v_x</th>            <th class="state">v_y</th>            <th class="state">angle</th>            <th class="state">angular velocity</th>            <th class="state">left contact</th>            <th class="state">right contact</th>            <th class="done"></th>            <th class="next-state">x</th>            <th class="next-state">y</th>            <th class="next-state">v_x</th>            <th class="next-state">v_y</th>            <th class="next-state">angle</th>            <th class="next-state">angular velocity</th>            <th class="next-state">left contact</th>            <th class="next-state">right contact</th>        </tr>        <tr>            <td class="action">0.66336113</td>            <td class="action">-0.485024</td>            <td class="reward">-1.56</td>            <td class="state">0.00716772</td>            <td class="state">1.4093536</td>            <td class="state">0.7259957</td>            <td class="state">-0.06963848</td>            <td class="state">-0.0082988</td>            <td class="state">-0.16444895</td>            <td class="state">0</td>            <td class="state">0</td>            <td class="done">False</td>            <td class="next-state">0.01442766</td>            <td class="next-state">1.4081073</td>            <td class="next-state">0.73378086</td>            <td class="next-state">-0.05545701</td>            <td class="next-state">-0.01600615</td>            <td class="next-state">-0.15416077</td>            <td class="next-state">0</td>            <td class="next-state">0</td>        </tr>        <tr>            <td class="action">0.87302077</td>            <td class="action">0.8565877</td>            <td class="reward">-2.85810149</td>            <td class="state">0.01442766</td>            <td class="state">1.4081073</td>            <td class="state">0.73378086</td>            <td class="state">-0.05545701</td>            <td class="state">-0.01600615</td>            <td class="state">-0.15416077</td>            <td class="state">0</td>            <td class="state">0</td>            <td class="done">False</td>            <td class="next-state">0.02185297</td>            <td class="next-state">1.4071543</td>            <td class="next-state">0.7518369</td>            <td class="next-state">-0.04247425</td>            <td class="next-state">-0.02521554</td>            <td class="next-state">-0.18420467</td>            <td class="next-state">0</td>            <td class="next-state">0</td>        </tr>        <tr>            <td class="action">0.4880578</td>            <td class="action">0.18216014</td>            <td class="reward">-2.248854395</td>            <td class="state">0.02185297</td>            <td class="state">1.4071543</td>            <td class="state">0.7518369</td>            <td class="state">-0.04247425</td>            <td class="state">-0.02521554</td>            <td class="state">-0.18420467</td>            <td class="state">0</td>            <td class="state">0</td>            <td class="done">False</td>            <td class="next-state">0.02941189</td>            <td class="next-state">1.4065428</td>            <td class="next-state">0.7646336</td>            <td class="next-state">-0.02735517</td>            <td class="next-state">-0.03385869</td>            <td class="next-state">-0.17287907</td>            <td class="next-state">0</td>            <td class="next-state">0</td>        </tr>        <tr>            <td class="action">0.0541396</td>            <td class="action">-0.70224154</td>            <td class="reward">-0.765160122</td>            <td class="state">0.02941189</td>            <td class="state">1.4065428</td>            <td class="state">0.7646336</td>            <td class="state">-0.02735517</td>            <td class="state">-0.03385869</td>            <td class="state">-0.17287907</td>            <td class="state">0</td>            <td class="state">0</td>            <td class="done">False</td>            <td class="next-state">0.03697386</td>            <td class="next-state">1.4056652</td>            <td class="next-state">0.7634756</td>            <td class="next-state">-0.03918146</td>            <td class="next-state">-0.04105976</td>            <td class="next-state">-0.14403483</td>            <td class="next-state">0</td>            <td class="next-state">0</td>        </tr>    </table></body></html><h1 id="Lunar-Lander-Dataset-Explanation"><a href="#Lunar-Lander-Dataset-Explanation" class="headerlink" title="Lunar Lander Dataset Explanation"></a>Lunar Lander Dataset Explanation</h1><p>This dataset captures the experience of an agent in the <strong>Lunar Lander</strong> environment from OpenAI Gym. Each row represents a single <strong>transition</strong> (state, action, reward, next state) in the environment.</p><h1 id="Environment-Details"><a href="#Environment-Details" class="headerlink" title="Environment Details"></a>Environment Details</h1><ol><li><p><strong>Action</strong></p><ul><li><code>Main Engine</code>: The thrust applied to the main engine.</li><li><code>Lateral Thruster</code>: The thrust applied to the left&#x2F;right thrusters.</li></ul></li><li><p><strong>Reward</strong></p><ul><li>The reward received in this step. It is based on:<ul><li>Proximity to the landing pad.</li><li>Smoothness of the landing.</li><li>Fuel consumption.</li><li>Avoiding crashes.</li></ul></li></ul></li><li><p><strong>State</strong></p><ul><li><code>x, y</code>: Position coordinates.</li><li><code>v_x, v_y</code>: Velocity components.</li><li><code>theta</code>: The lander’s rotation angle.</li><li><code>omega</code>: The rate of change of the angle.</li><li><code>left contact, right contact</code>: Binary indicators (0 or 1) showing whether the lander has made contact with the ground.</li></ul></li><li><p><strong>Done</strong></p><ul><li><code>True</code>: The episode has ended (either successful landing or crash).</li><li><code>False</code>: The episode is still ongoing.</li></ul></li><li><p><strong>Next State</strong></p><ul><li>The same attributes as <strong>State</strong>, but after the action has been applied.</li></ul></li></ol><h1 id="Sample-Game-Play"><a href="#Sample-Game-Play" class="headerlink" title="Sample Game Play"></a>Sample Game Play</h1><p><img src="https://gymnasium.farama.org/_images/lunar_lander.gif" alt="Sample game play from the OpenAI website"></p><h1 id="Game-play-500-games"><a href="#Game-play-500-games" class="headerlink" title="Game play 500 games"></a>Game play 500 games</h1><p>YouTube video embedded </p><iframe width="560" height="315" src="https://www.youtube.com/embed/pSSxC84vXCw?si=VFDUhuxb4C8jn8Be" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe><h1 id="Game-play-500k-games"><a href="#Game-play-500k-games" class="headerlink" title="Game play 500k games"></a>Game play 500k games</h1><iframe width="560" height="315" src="https://www.youtube.com/embed/HHmulIyuHGc?si=OnObtwo8VqmsdaKp" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>]]>
    </content>
    <id>https://franciscormendes.github.io/2025/02/28/soft-actor-critic-lunar-lander/</id>
    <link href="https://franciscormendes.github.io/2025/02/28/soft-actor-critic-lunar-lander/"/>
    <published>2025-02-28T00:00:00.000Z</published>
    <summary>From-scratch SAC in PyTorch applied to Lunar Lander: extends the Inverted Pendulum implementation to a harder task with sparse rewards and a 2D continuous action space.</summary>
    <title>Soft Actor Critic (Visualized) Part 2: Lunar Lander Example from Scratch in Torch</title>
    <updated>2026-04-10T14:24:00.564Z</updated>
  </entry>
  <entry>
    <author>
      <name>Francisco Romaldo Fernandes Mendes</name>
    </author>
    <category term="machine-learning" scheme="https://franciscormendes.github.io/categories/machine-learning/"/>
    <category term="machine-learning" scheme="https://franciscormendes.github.io/tags/machine-learning/"/>
    <category term="reinforcement-learning" scheme="https://franciscormendes.github.io/tags/reinforcement-learning/"/>
    <category term="neural-networks" scheme="https://franciscormendes.github.io/tags/neural-networks/"/>
    <content>
      <![CDATA[<div class="series-box">  <div class="series-label">Series</div>  <div class="series-name">Soft Actor-Critic: Reinforcement Learning from Scratch</div>  <ol class="series-list"><li class="series-item series-current"><span>Soft Actor Critic (Visualized) : From Scratch in Torch for Inverted Pendulum</span></li><li class="series-item"><a href="/2025/02/28/soft-actor-critic-lunar-lander/">Soft Actor Critic (Visualized) Part 2: Lunar Lander Example from Scratch in Torch</a></li></ol></div><h1 id="Introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h1><p>In this post, I will implement the Soft Actor Critic (SAC) algorithm from scratch in PyTorch. I will use the OpenAI Gym environment for the Inverted Pendulum task.<br>The goal of this post is to provide a Torch code follow along for the original paper by Haarnoja et al. (2018) [1]. Many implementations of Soft Actor Critic exist, in this code we implement the one outlines in the paper.<br>You can follow along by starting from <code>main_sac.py</code> at the following link:<br><a href="https://github.com/FranciscoRMendes/soft-actor-critic">https://github.com/FranciscoRMendes/soft-actor-critic</a></p><h1 id="Inverted-Pendulum-v0-Environment-Set-Up"><a href="#Inverted-Pendulum-v0-Environment-Set-Up" class="headerlink" title="Inverted Pendulum v0 Environment Set Up"></a>Inverted Pendulum v0 Environment Set Up</h1><h2 id="Environment-Set-Up"><a href="#Environment-Set-Up" class="headerlink" title="Environment Set Up"></a>Environment Set Up</h2><p>Link to the environment here : <a href="https://github.com/bulletphysics/bullet3/blob/master/examples/pybullet/gym/pybullet_envs/gym_pendulum_envs.py">https://github.com/bulletphysics/bullet3/blob/master/examples/pybullet/gym/pybullet_envs/gym_pendulum_envs.py</a></p><h2 id="Example-Data"><a href="#Example-Data" class="headerlink" title="Example Data"></a>Example Data</h2><p>The data from playing the game looks something like this, with each instant of game play denoted by a row. Note this data is sampled from many different games, so it is not ordered as if coming from one game.<br>The dashes in the column name denote the next state, for example, Position’ is the position at the next time step.</p><table><thead><tr><th>Position</th><th>Velocity</th><th>Cos Pole Angle</th><th>Sine Pole Angle</th><th>Pole Angle</th><th>Time Step</th><th>Force L&#x2F;R</th><th>Position’</th><th>Velocity’</th><th>Cos Pole Angle’</th><th>Sine Pole Angle’</th><th>Pole Angle’</th><th>Done</th></tr></thead><tbody><tr><td>0.0002</td><td>0.0085</td><td>0.9974</td><td>-0.0722</td><td>-0.0647</td><td>1</td><td>0.0137</td><td>0.0004</td><td>0.0133</td><td>0.9973</td><td>-0.0738</td><td>-0.0985</td><td>FALSE</td></tr><tr><td>0.0174</td><td>0.0954</td><td>0.9964</td><td>-0.0842</td><td>-0.4624</td><td>1</td><td>0.0389</td><td>0.0191</td><td>0.1039</td><td>0.9957</td><td>-0.0926</td><td>-0.5079</td><td>FALSE</td></tr><tr><td>0.0031</td><td>0.0427</td><td>0.9969</td><td>-0.0785</td><td>-0.2768</td><td>1</td><td>0.0290</td><td>0.0040</td><td>0.0497</td><td>0.9965</td><td>-0.0837</td><td>-0.3173</td><td>FALSE</td></tr><tr><td>0.0046</td><td>0.0540</td><td>0.9965</td><td>-0.0840</td><td>-0.3380</td><td>1</td><td>0.0327</td><td>0.0056</td><td>0.0617</td><td>0.9959</td><td>-0.0902</td><td>-0.3818</td><td>FALSE</td></tr><tr><td>0.0008</td><td>0.0195</td><td>0.9967</td><td>-0.0813</td><td>-0.1428</td><td>1</td><td>0.0203</td><td>0.0012</td><td>0.0255</td><td>0.9964</td><td>-0.0843</td><td>-0.1822</td><td>FALSE</td></tr><tr><td>0.0071</td><td>0.0438</td><td>0.9994</td><td>-0.0359</td><td>-0.1959</td><td>1</td><td>0.0196</td><td>0.0079</td><td>0.0478</td><td>0.9992</td><td>-0.0395</td><td>-0.2158</td><td>FALSE</td></tr><tr><td>0.0133</td><td>0.1056</td><td>0.9928</td><td>-0.1194</td><td>-0.6067</td><td>1</td><td>0.0512</td><td>0.0153</td><td>0.1171</td><td>0.9915</td><td>-0.1304</td><td>-0.6702</td><td>FALSE</td></tr></tbody></table><h2 id="State-Description-in-InvertedPendulumBulletEnv-v0"><a href="#State-Description-in-InvertedPendulumBulletEnv-v0" class="headerlink" title="State Description in InvertedPendulumBulletEnv-v0"></a>State Description in <code>InvertedPendulumBulletEnv-v0</code></h2><ol><li><strong>Cart Position</strong> – The horizontal position of the cart.  </li><li><strong>Cart Velocity</strong> – The speed of the cart.  </li><li><strong>Cosine of Pendulum Angle</strong> – $\cos(\theta)$, where $\theta$ is the angle relative to the vertical. It equals 1 when upright and decreases as it tilts.  </li><li><strong>Sine of Pendulum Angle</strong> – $\sin(\theta)$ complements $\cos(\theta)$, providing a full representation of the angle.  </li><li><strong>Pendulum Angular Velocity</strong> – The rate of change of $\theta$.</li></ol><h2 id="Action"><a href="#Action" class="headerlink" title="Action"></a>Action</h2><p>The action space is continuous and consists of a single action that can be applied to the cart. The action is a force that can be applied to the cart in the left or right direction. The force can be any value between $-1$ and $1$.</p><h2 id="Reward-Termination"><a href="#Reward-Termination" class="headerlink" title="Reward &amp; Termination"></a>Reward &amp; Termination</h2><p>The reward is $1$ for every time step the pole is upright. The episode ends (Done is <code>TRUE</code>) when the pole is more than $15$ degrees from the vertical axis or the cart moves more than $2.4$ units from the center.</p><h2 id="Game-play-GIF"><a href="#Game-play-GIF" class="headerlink" title="Game play GIF"></a>Game play GIF</h2><p>An example of game play would look like this, not the most exciting thing in the world, I know.</p><p><img src="https://mgoulao.github.io/gym-docs/_images/inverted_pendulum.gif" alt="Example Game Play"></p><h1 id="The-Neural-Networks-in-Soft-Actor-Critic-Network"><a href="#The-Neural-Networks-in-Soft-Actor-Critic-Network" class="headerlink" title="The Neural Networks in Soft Actor Critic Network"></a>The Neural Networks in Soft Actor Critic Network</h1><p>The Lucid chart below encapsulates the major neural networks in the code and their relationships. Forward relationships (i.e. forward pass) are given by solid arrows. While backward relationships (i.e. backpropagation) are given by dashed arrows.<br>I recommend using this chart to keep a track of which outputs train which networks. Note however, that these backward arrows describe merely that <em>some</em> relationship exists. There are differences in the backpropagation used to train the policy network itself (uses the reparameterization trick) and the Value networks (does not).</p><div style="width: 640px; height: 480px; margin: 10px; position: relative;"><iframe allowfullscreen frameborder="0" style="width:640px; height:480px" src="https://lucid.app/documents/embedded/68197b45-adf1-477b-a3ad-68d468196d7b" id="QO7TleQdXSdp"></iframe></div><p>The main object in the code is the object called <code>SoftActorCritic.py</code>. It consists of the neural networks and all the hyperparameters that potentially need tuning. As per the paper the most important one is reward scale. This is a hyperparameter that balances the explore-exploit tradeoff. Higher values of the reward will make the agent exploit more. </p><p>This class contains the following Neural Networks, their relationships are illustrated in the Lucid Chart above:</p><ol><li><code>self.pi_phi</code>: The actor network, which outputs the action given the state. In the paper this is denoted by the function $\pi_\phi(a_t|s_t)$, where $\pi$ is the policy, $\phi$ are the parameters of the policy, $a_t$ is the action at time $t$, and $s_t$ is the state at time $t$. This neural network will take in the state vector in this case the $5$ dimensional state vector, it can output two things <ul><li>action $a_t$ : a continuous vector of size $1$ to take in the environment (no re-parameterization trick)</li><li>The mean and variance of the action to take in the environment, $\mu$ and $\sigma$ respectively (re-parameterization trick)</li></ul></li><li><code>self.Q_theta_1</code> : The first Q-network, this is also known as the critic network. It takes in the state and action as input and outputs the Q-value. In the paper this is denoted by the function $Q_{\theta_1}(s_t, a_t)$, where $Q$ is the Q-function, $\theta_1$ are the parameters of the first Q-network, $s_t$ is the state at time $t$, and $a_t$ is the action at time $t$.</li><li><code>self.Q_theta_2</code> : The second Q-network, this is also known as the critic network. It takes in the state and action as input and outputs the Q-value. In the paper this is denoted by the function $Q_{\theta_2}(s_t, a_t)$, where $Q$ is the Q-function, $\theta_2$ are the parameters of the second Q-network, $s_t$ is the state at time $t$, and $a_t$ is the action at time $t$.</li><li><code>self.V_psi</code> : The Value network parameterized by $\psi$ in the paper. It takes in the state as input and outputs the value of the state. In the paper this is denoted by the function $V_\psi(s_t)$, where $V$ is the value function, $\psi$ are the parameters of the value network, and $s_t$ is the state at time $t$.</li><li><code>self.V_psi_bar</code> : The target value parameterized by $\bar{\psi}$ in the paper. It takes in the state as input and outputs the value of the state. In the paper this is denoted by the function $V_{\bar{\psi}}(s_t)$, where $V$ is the value function, $\bar{\psi}$ are the parameters of the target value network, and $s_t$ is the state at time $t$.</li></ol><p>Couple of things to watch out for in these neural networks that can be quite different from the usual classification use,</p><ol><li>Forward pass and inference (i.e. using the SoftActorCritic Network) are different, in the forward pass you are still using outputs to improve the policy network so that it plays better. However, to play the game you only ever need the policy network. In the classification case, the forward pass and inference are the same and hence used interchangeably. </li><li>The backward dashed arrows for backpropagation are important because it is not always clear what the “target” to train one of these neural networks is. The “target” is often from a combination of outputs from different networks and the rewards. </li><li>The top row of nodes, States, Actions, Rewards and Next States are the “data” on which the neural networks are to be trained.</li></ol><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">SoftActorCritic</span>:</span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">__init__</span>(<span class="params">self, alpha=<span class="number">0.0003</span>, beta=<span class="number">0.0003</span>, input_dims=[<span class="number">8</span>],</span></span><br><span class="line"><span class="params">                 env=<span class="literal">None</span>, gamma=<span class="number">0.99</span>, n_actions=<span class="number">2</span>, max_size=<span class="number">1000000</span>, tau=<span class="number">0.005</span>, batch_size=<span class="number">256</span>, reward_scale=<span class="number">2</span></span>):</span><br><span class="line">        self.gamma = gamma</span><br><span class="line">        self.tau = tau</span><br><span class="line">        self.memory = ReplayBuffer(max_size, input_dims, n_actions)</span><br><span class="line">        self.batch_size = batch_size</span><br><span class="line">        self.n_actions = n_actions</span><br><span class="line">        self.pi_phi = ActorNetwork(alpha, input_dims, n_actions=n_actions, name=<span class="string">&#x27;actor&#x27;</span>, max_action=env.action_space.high) <span class="comment"># 1</span></span><br><span class="line">        self.Q_theta_1 = CriticNetwork(beta, input_dims, n_actions=n_actions, name=<span class="string">&#x27;critic_1&#x27;</span>)</span><br><span class="line">        self.Q_theta_2 = CriticNetwork(beta, input_dims, n_actions=n_actions, name=<span class="string">&#x27;critic_2&#x27;</span>)</span><br><span class="line">        self.V_psi = ValueNetwork(beta, input_dims, name=<span class="string">&#x27;value&#x27;</span>)</span><br><span class="line">        self.V_psi_bar = ValueNetwork(beta, input_dims, name=<span class="string">&#x27;target_value&#x27;</span>)</span><br><span class="line">        self.scale = reward_scale <span class="comment"># You will find this in the ablation study section of the paper this balances the explore/exploit tradeoff</span></span><br><span class="line">        self.update_psi_bar_using_psi(tau=<span class="number">1</span>)</span><br></pre></td></tr></table></figure><h1 id="Learning-in-SAC"><a href="#Learning-in-SAC" class="headerlink" title="Learning in SAC"></a>Learning in SAC</h1><p>The learning in the model is handled by the learn function. This function takes in the batch of data from the replay buffer and updates the parameters of the networks. The learning is done in the following steps:</p><ol><li>Sample a batch of data from the replay buffer. If the data is not enough i.e. smaller than batch size, return.</li><li>Optimize the Value Network using the soft Bellman equation (equation $7$)</li><li>Optimize the Policy Network using the policy gradient (equation $12$)</li><li>Optimize the Q Network using the Bellman equation (equation $6$)</li></ol><p>Couple of asides here, </p><ol><li>The words network and function can be used interchangeably. The neural network serves as a function approximator for the functions we are trying to learn (Value, Q, Policy).</li><li>The Value Networks and Policy Networks are dependent on the current state of the Q network. Only after these are updated can we update the Q network.</li><li>All loss functions are denoted by $J_{\text{network we are trying to optimize}}$ in the paper. The subscript denotes the network that is being optimized. For example, $J_{\psi}$ is the loss function for the Value Network, $J_{\phi}$ is the loss function for the Policy Network, and $J_{\theta}$ is the loss function for the Q Network.</li><li>The Target Network is simply a lagged duplicate of the current Value Network. Thus, it does not actually ever “learn” but simply updates it weights through a weighted average between the latest weights from the value network and its own weights, this is given by the parameter $\tau$ in the code. This is done to stabilize the learning process. </li><li>Variable names can be read as one would read the variable from the paper for instance $V_{\bar{\psi}}(s_{t+1})$ is given by <code>V_psi_bar_s_t_plus_1</code>. It is unfortunate that python does not allow for more scientific notation, but this is the best I could do.</li></ol><h1 id="Re-parameterization-Trick"><a href="#Re-parameterization-Trick" class="headerlink" title="Re-parameterization Trick"></a>Re-parameterization Trick</h1><p>One of the most confusing things to implement in python. <strong>You can skip this section if you are just starting out</strong> but its use will become clear later. Adding the details here for completeness. </p><p>The main problem we are trying to solve here is that Torch requires a computational graph to perform backpropagation of the gradients. <code>rsample()</code> preserves the graph information whereas <code>sample()</code> does not. This is because <code>rsample()</code> uses the reparameterization trick to sample from the distribution. The reparameterization trick is a way to sample from a distribution while preserving the gradient information. It is done by expressing the random variable as a deterministic function of a parameter and a noise variable. In this case, we are using the reparameterization trick to sample from the normal distribution. The normal distribution is parameterized by its mean and standard deviation. We can express the random variable as a deterministic function of the mean, standard deviation, and a noise variable. This allows us to sample from the distribution while preserving the gradient information. </p><ol><li><code>sample()</code>: Performs random sampling, cutting off the computation graph (i.e., no backpropagation). Uses torch.normal within torch.no_grad(), ensuring the result is detached.</li><li><code>rsample()</code>: Enables backpropagation using the reparameterization trick, separating randomness into an independent variable (eps). The computation graph remains intact as the transformation (loc + eps * scale) is differentiable.</li></ol><p><strong>Key Idea</strong>: eps is sampled once and remains fixed, while loc and scale change during optimization, allowing gradients to flow. Used in algorithms like SAC (Soft Actor-Critic) for reinforcement learning.<br>If you want to sample both the values and plot their distributions they will be identical (or as identical as two samples sampled from the same distribution can be).</p><p>A good explanation can be found here : <a href="https://stackoverflow.com/questions/60533150/what-is-the-difference-between-sample-and-rsample">https://stackoverflow.com/questions/60533150/what-is-the-difference-between-sample-and-rsample</a></p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">sample_normal</span>(<span class="params">self, state, reparameterize=<span class="literal">True</span></span>):</span><br><span class="line">    mu, sigma = self.forward(state)</span><br><span class="line">    probabilities = Normal(mu, sigma)</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> reparameterize:</span><br><span class="line">        actions = probabilities.rsample()</span><br><span class="line">    <span class="keyword">else</span>:</span><br><span class="line">        actions = probabilities.sample()</span><br><span class="line"></span><br><span class="line">    action = T.tanh(actions)*T.tensor(self.max_action).to(self.device)</span><br><span class="line">    log_probs = probabilities.log_prob(actions)</span><br><span class="line">    log_probs -= T.log(<span class="number">1</span>-action.<span class="built_in">pow</span>(<span class="number">2</span>)+self.reparam_noise)</span><br><span class="line">    log_probs = log_probs.<span class="built_in">sum</span>(<span class="number">1</span>, keepdim=<span class="literal">True</span>)</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> action, log_probs</span><br></pre></td></tr></table></figure><h1 id="Learning-the-Value-Function"><a href="#Learning-the-Value-Function" class="headerlink" title="Learning the Value Function"></a>Learning the Value Function</h1><p>With all the caveats and fine print out of the way we can begin the learn function.<br>Here we take a sample of data from the replay buffer. Now recall, that we need to take a random sample and not just the values because the data is not i.i.d. and we need to break the correlation between the data points. </p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">sample = self.memory.sample_buffer(self.batch_size)</span><br><span class="line">s_t, a_t_rb, r_t, s_t_plus_1, done = self.process_sample(sample, self.pi_phi.device)</span><br></pre></td></tr></table></figure><p>Let us first state the loss function of the value function. This is equation 5 of the Haarnoja et al. (2018) paper. </p>$$J_V(\psi) = \mathbb{E}\_{s_t  \sim D} \[ \frac{1}{2} ( V_\psi(s_t) - \mathbb{E}\_{a_t\sim\pi_{\phi}}[Q\_\theta(s_t,a_t) - \log \pi_\phi(a_t|s_t)])^2 \]$$<p>Comments, </p><ol><li>$V_\psi(s_t)$ is the output of the value function, which would just be a forward pass through the value neural network denoted by ``self.V_psi(s_t)`` in the code.</li><li>$V_{\bar{\psi}}(s_{t+1})$ is the output of the target value function, which would just be a forward pass through the target value neural network for the next state denoted by ``self.V_psi_bar(s_t_plus_1)`` in the code.</li><li>We also need the output of the Q function, which would just be a forward pass through the Q neural network denoted by <code>self.Q_theta_1.forward(s_t, a_t)</code> in the code. But since we have two Q networks, we need to take the minimum of the two. This is done to reduce the overestimation bias in the Q function.</li></ol><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line">V_psi_s_t = self.V_psi(s_t).view(-<span class="number">1</span>)</span><br><span class="line"></span><br><span class="line">V_psi_bar_s_t_plus_1 = self.V_psi_bar(s_t_plus_1).view(-<span class="number">1</span>)</span><br><span class="line"></span><br><span class="line">V_psi_bar_s_t_plus_1[done] = <span class="number">0.0</span></span><br><span class="line"></span><br><span class="line">a_t_D, log_pi_t_D = self.pi_phi.sample_normal(s_t, reparameterize=<span class="literal">False</span>) <span class="comment"># here we are not using the reparameterization trick because we are not backpropagating through the policy network</span></span><br><span class="line"></span><br><span class="line">log_pi_t_D = log_pi_t_D.view(-<span class="number">1</span>)</span><br><span class="line"></span><br><span class="line"><span class="comment"># Find the value of the Q function for the current state and action, since we have two networks we take the minimum of the two</span></span><br><span class="line">Q_theta_1_s_t_a_t_D = self.Q_theta_1.forward(s_t, a_t_D)</span><br><span class="line">Q_theta_2_s_t_a_t_D = self.Q_theta_2.forward(s_t, a_t_D)</span><br><span class="line">Q_theta_min_s_t_a_t_D = T.<span class="built_in">min</span>(Q_theta_1_s_t_a_t_D, Q_theta_2_s_t_a_t_D)</span><br><span class="line"><span class="comment"># This is the Q value to be used in equation 5</span></span><br><span class="line">Q_theta_min_s_t_a_t_D = Q_theta_min_s_t_a_t_D.view(-<span class="number">1</span>)</span><br><span class="line"></span><br><span class="line">self.V_psi.optimizer.zero_grad()</span><br><span class="line"><span class="comment"># This is exactly equation 5</span></span><br><span class="line">J_V_psi = <span class="number">0.5</span> * F.mse_loss(V_psi_s_t, Q_theta_min_s_t_a_t_D - log_pi_t_D)</span><br><span class="line">J_V_psi.backward(retain_graph=<span class="literal">True</span>) <span class="comment"># again, we don&#x27;t need to backpropagate through the policy network</span></span><br><span class="line">self.V_psi.optimizer.step() <span class="comment"># Update the value network</span></span><br><span class="line"></span><br></pre></td></tr></table></figure><h1 id="Learning-the-Policy-Function"><a href="#Learning-the-Policy-Function" class="headerlink" title="Learning the Policy Function"></a>Learning the Policy Function</h1><p>The policy function is learned using the policy gradient. This is equation 12 of the Haarnoja et al. (2018) paper.</p>$$J_{\pi}(\phi)= \mathbb{E}\_{s_t\sim \mathcal{D}, \epsilon_t\sim \mathcal{N}} [\log \pi\_{\phi}(f_{\phi}(\epsilon_t;s_t)|s_t)|) - Q_\theta(s_t,f_{\phi}(\epsilon_t;s_t)]$$<p>The expectation means that we can use the mean of the observed values to approximate the expectation.<br>For performing the optimization on the policy network we need to do two things to get a prediction, </p><ol><li>Perform a forward pass through the network to get $\mu$ and $\sigma$.</li><li>Sample an action from the policy network using the reparameterization trick. This ensures that the computational graph is preserved and we can backpropagate through the policy network. This was not true in the previous case.<br>Here it may seems like the values for $Q_\theta(s_t,a_t)$ and $\log \pi_\phi(a_t|s_t)$ are the same as the ones we used for the value function. This is not the case, we need to sample a new action from the policy network and use that to compute the Q value and log probability. This is because we are trying to learn the policy function, which is a stochastic process. We need to sample a new action from the policy network and use that to compute the Q value and log probability. This is done using the reparameterization trick.</li></ol><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># a_t_D refers to actions drawn from a sample of the actor network and not the true actions taken from the replay buffer</span></span><br><span class="line">a_t_D, log_pi_t_D = self.pi_phi.sample_normal(s_t, reparameterize=<span class="literal">True</span>) <span class="comment"># here we are using the reparameterization trick because we are backpropagating through the policy network</span></span><br><span class="line">log_pi_t_D = log_pi_t_D.view(-<span class="number">1</span>)</span><br><span class="line">Q_theta_1_s_t_a_t_D = self.Q_theta_1.forward(s_t, a_t_D)</span><br><span class="line">Q_theta_2_s_t_a_t_D = self.Q_theta_2.forward(s_t, a_t_D)</span><br><span class="line">Q_theta_min_s_t_a_t_D = T.<span class="built_in">min</span>(Q_theta_1_s_t_a_t_D, Q_theta_2_s_t_a_t_D)</span><br><span class="line">Q_theta_min_s_t_a_t_D = Q_theta_min_s_t_a_t_D.view(-<span class="number">1</span>)</span><br><span class="line"></span><br><span class="line"><span class="comment"># This is equation 12 in the paper</span></span><br><span class="line"><span class="comment"># note that this is identical to the original loss function given by equation 10</span></span><br><span class="line"><span class="comment"># after doing the re-parameterization trick</span></span><br><span class="line">J_pi_phi = T.mean(log_pi_t_D - Q_theta_min_s_t_a_t_D)</span><br><span class="line">self.pi_phi.optimizer.zero_grad()</span><br><span class="line">J_pi_phi.backward(retain_graph=<span class="literal">True</span>)</span><br><span class="line">self.pi_phi.optimizer.step()</span><br></pre></td></tr></table></figure><h1 id="Learning-the-Q-Network"><a href="#Learning-the-Q-Network" class="headerlink" title="Learning the Q-Network"></a>Learning the Q-Network</h1><p>In this section we will optimize the critic network. This would correspond to equation 7 in the paper. </p>$$J_Q(\theta) = \mathbb{E}\_{(s_t,a_t) \sim \mathcal{D}} \left[ \frac{1}{2} \left( Q\_{\theta}(s_t, a_t) - \hat{Q}(s_t, a_t) \right)^2 \right] $$<p>Noting that, </p>$$\hat{Q}(s_t, a_t) = r_t + \gamma \mathbb{E}\_{s_{t+1}\sim p}V_{\bar{\psi}}(s_{t+1})$$<p>This is somewhat different from equation 7 in the paper,</p><ol><li>First, $r_t$ does not depend on $a_t,s_t$ in this case. This is because we are using the Inverted Pendulum environment, which gives a constant reward for each step.</li><li>Second, we drop the expectation over $s_{t+1}$ because we are using a single sample from the replay buffer for each $t$ (technically you should take the mean over multiple $s_{t+1}$ but this is a good enough approximation). </li><li>We use the actual actions taken from the replay buffer to compute the Q value. This is because we are trying to learn the Q function, which is a deterministic process. We need to use the actual actions taken from the replay buffer to compute the Q value. This is given by <code>a_t_rb</code> in the code. </li><li>We have two Q networks so we need to apply this individually to both networks.</li></ol><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># In this section we will optimize the two critic networks</span></span><br><span class="line"><span class="comment"># We will use the bellman equation to calculate the target Q value</span></span><br><span class="line">self.Q_theta_1.optimizer.zero_grad()</span><br><span class="line">self.Q_theta_2.optimizer.zero_grad()</span><br><span class="line"><span class="comment"># Equation 8 in the paper, in the paper the reward also depends on a_t</span></span><br><span class="line"><span class="comment"># but in this case we get a constant reward for each step, so we can just use r_t</span></span><br><span class="line"><span class="comment"># consequently, Q_hat_s_t AND NOT Q_hat_s_t_a_t</span></span><br><span class="line">Q_hat_s_t = self.scale*r_t + self.gamma*V_psi_bar_s_t_plus_1</span><br><span class="line">Q_theta_1_s_t_rb_at = self.Q_theta_1.forward(s_t, a_t_rb).view(-<span class="number">1</span>) <span class="comment"># this is the only place where actions from the replay buffer are used</span></span><br><span class="line">Q_theta_2_s_t_rb_at = self.Q_theta_2.forward(s_t, a_t_rb).view(-<span class="number">1</span>)</span><br><span class="line"><span class="comment"># this is equation 7 in the paper, one for each Q network</span></span><br><span class="line">J_Q_theta_1_loss = <span class="number">0.5</span> * F.mse_loss(Q_theta_1_s_t_rb_at, Q_hat_s_t)</span><br><span class="line">J_Q_theta_2_loss = <span class="number">0.5</span> * F.mse_loss(Q_theta_2_s_t_rb_at, Q_hat_s_t)</span><br><span class="line">J_Q_theta_12 = J_Q_theta_1_loss + J_Q_theta_2_loss</span><br><span class="line">J_Q_theta_12.backward()</span><br><span class="line">self.Q_theta_1.optimizer.step()</span><br><span class="line">self.Q_theta_2.optimizer.step()</span><br></pre></td></tr></table></figure><h1 id="Learning-the-target-value-network"><a href="#Learning-the-target-value-network" class="headerlink" title="Learning the target value network"></a>Learning the target value network</h1><p>The final piece of this puzzle is learning of the target value network. Now, there is no actual “learning” taking place in this network.<br>This network is simply a weighted lagged duplicate of the current value network. Thus, it does not actually ever “learn” but simply updates it weights through a weighted average between the latest weights from the value network and its own weights, this is given by the parameter $\tau$ in the code. This is done to stabilize the learning process.<br>This takes place in the line <code>self.update_psi_bar_using_psi(tau=None)</code> of the learn function.<br>The parameter tau is used to weight the copying, with tau &#x3D; 1 being a complete copy and tau &#x3D; 0 being no copy. Obviously for the learning to take place tau&gt;0 but usually a vale of $0.005$ is used.<br>This function corresponds to the last line in the algorithm, </p>$$\bar{\psi} \leftarrow \tau \psi + (1-\tau)\bar\psi$$<figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">update_psi_bar_using_psi</span>(<span class="params">self, tau=<span class="literal">None</span></span>):</span><br><span class="line">    <span class="comment"># This function corresponds to the update step inside algorithm 1</span></span><br><span class="line">    <span class="comment"># this is the last line in the algorithm</span></span><br><span class="line">    <span class="comment"># psi_bar = tau* psi + (1-tau)*psi_bar</span></span><br><span class="line">    <span class="keyword">if</span> tau <span class="keyword">is</span> <span class="literal">None</span>:</span><br><span class="line">        tau = self.tau</span><br><span class="line"></span><br><span class="line">    psi_bar = self.V_psi_bar.named_parameters()</span><br><span class="line">    psi = self.V_psi.named_parameters()</span><br><span class="line"></span><br><span class="line">    target_value_state_dict = <span class="built_in">dict</span>(psi_bar)</span><br><span class="line">    value_state_dict = <span class="built_in">dict</span>(psi)</span><br><span class="line"></span><br><span class="line">    <span class="keyword">for</span> name <span class="keyword">in</span> value_state_dict:</span><br><span class="line">        value_state_dict[name] = tau*value_state_dict[name].clone() + (<span class="number">1</span>-tau)*target_value_state_dict[name].clone()</span><br><span class="line"></span><br><span class="line">    self.V_psi_bar.load_state_dict(value_state_dict)</span><br></pre></td></tr></table></figure><h1 id="Conclusion"><a href="#Conclusion" class="headerlink" title="Conclusion"></a>Conclusion</h1><p>This post has been a detailed walk through of the Soft Actor Critic algorithm using inverted pendulum as an example. Other implementations of this algorithm exist. The best one I have found is Phil Tabor’s implementation.<br>However, there was not a very good connection between the code and the paper. This post was an attempt to bridge that gap by using notation that exactly matches the paper, while keeping the overall structure simple to understand.<br>In <a href="/2025/02/28/soft-actor-critic-lunar-lander/">my next post</a>, I will implement the Soft Actor Critic Algorithm on the Lunar Lander game, this will hopefully make for a more interesting visualization of how the algorithm learns better. </p><h1 id="References"><a href="#References" class="headerlink" title="References"></a>References</h1><ol><li>Haarnoja, T., Zhou, A., Abbeel, P., &amp; Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv preprint arXiv:1801.01290.</li><li><a href="https://github.com/philtabor/Youtube-Code-Repository/tree/master/ReinforcementLearning/PolicyGradient/SAC">https://github.com/philtabor/Youtube-Code-Repository/tree/master/ReinforcementLearning/PolicyGradient/SAC</a></li><li>Phil’s Youtube video <a href="https://www.youtube.com/watch?v=ioidsRlf79o">https://www.youtube.com/watch?v=ioidsRlf79o</a></li><li>Oliver Sigaud’s video <a href="https://www.youtube.com/watch?v=_nFXOZpo50U">https://www.youtube.com/watch?v=_nFXOZpo50U</a> (check out his channel and research for more)</li><li><a href="https://youtube.com/playlist?list=PLYpLNGpDoiMSMrvgVhgNRwOHTVYbX2lOa&si=unvWxJsJm_w4OcD-">https://youtube.com/playlist?list=PLYpLNGpDoiMSMrvgVhgNRwOHTVYbX2lOa&amp;si=unvWxJsJm_w4OcD-</a></li><li><a href="https://www.youtube.com/watch?v=kJ9CL7asR94&list=LL&index=22&t=41s">https://www.youtube.com/watch?v=kJ9CL7asR94&amp;list=LL&amp;index=22&amp;t=41s</a> (accent might be unclear, but trust me one of the best videos)</li></ol>]]>
    </content>
    <id>https://franciscormendes.github.io/2025/02/17/soft-actor-critic-inverted-pendulum-v0/</id>
    <link href="https://franciscormendes.github.io/2025/02/17/soft-actor-critic-inverted-pendulum-v0/"/>
    <published>2025-02-17T00:00:00.000Z</published>
    <summary>From-scratch PyTorch implementation of Soft Actor-Critic for the Inverted Pendulum task — entropy-regularized policy gradients, twin Q-networks, and automatic temperature tuning.</summary>
    <title>Soft Actor Critic (Visualized) : From Scratch in Torch for Inverted Pendulum</title>
    <updated>2026-04-10T14:24:00.563Z</updated>
  </entry>
  <entry>
    <author>
      <name>Francisco Romaldo Fernandes Mendes</name>
    </author>
    <category term="book-review" scheme="https://franciscormendes.github.io/categories/book-review/"/>
    <category term="book-review" scheme="https://franciscormendes.github.io/tags/book-review/"/>
    <category term="fiction" scheme="https://franciscormendes.github.io/tags/fiction/"/>
    <content>
      <![CDATA[<p>An odyssey borne out of the oldest tale in the oldest book in the world, Cain and Abel. Very rarely are well worn fables resurrected like new, but this book succeeded in telling an age old tale of fraternal rivalry across several generations, with a far more generous view of Cain. </p><p>The characters in the book represent major moral themes, from pure Biblical evil, as represented by Cathy. Pure angelic goodness in Adam and Aron. And finally, human moral frailty in Charles and Cal. The characters absorb you in their machinations, their trials and their triumphs till you are finally hanging on to every page. Yet, this book is no railway station page turner, it draws you in with the sheer weight of its story telling. The sheer beauty of its mundane moments. And the intellectual heft of characters like Sam Hamilton and Lee. We are treated to deep moral debates about each of the characters actions and Lee, in particular, draws on several pagan sources to supplement this very Christian tale. One cannot help but feel this book rewrites Genesis through Cain’s eyes. And one feels for the rejected offering, one also feels the anger and jealousy that inevitably come with being the less anointed child. The titanic internal struggle for goodness against these carnal feelings. But only in this darkness can human nature be born. Our visceral dislike of Abel’s unnatural goodness shows that we (I) are Cain’s progeny after all. </p><p>The language in this book is simple, and spills off the pages. At times chapters seem written in frenzied haste and at others each word is weighed as if by St. Peter himself. This book must have been a Herculean task, but the author proved more than equal to it. More Steinbeck to come!</p>]]>
    </content>
    <id>https://franciscormendes.github.io/2025/02/09/east-of-eden/</id>
    <link href="https://franciscormendes.github.io/2025/02/09/east-of-eden/"/>
    <published>2025-02-09T00:00:00.000Z</published>
    <summary>Steinbeck's retelling of Cain and Abel across California generations — the moral argument of 'timshel', and why the book's real subject is the capacity to choose otherwise.</summary>
    <title>Book Review: East of Eden by John Steinbeck</title>
    <updated>2026-04-10T14:24:00.545Z</updated>
  </entry>
  <entry>
    <author>
      <name>Francisco Romaldo Fernandes Mendes</name>
    </author>
    <category term="statistics" scheme="https://franciscormendes.github.io/categories/statistics/"/>
    <category term="signal-processing" scheme="https://franciscormendes.github.io/tags/signal-processing/"/>
    <category term="statistics" scheme="https://franciscormendes.github.io/tags/statistics/"/>
    <content>
      <![CDATA[<h1 id="Matching-MATLAB’s-resample-function-in-Python"><a href="#Matching-MATLAB’s-resample-function-in-Python" class="headerlink" title="Matching MATLAB’s resample function in Python"></a>Matching MATLAB’s resample function in Python</h1><p>It is rather annoying that a fast implementation of MATLAB’s resample function does not exist in Python with minimal theoretical knowledge of signal processing. This post aims to provide a simple implementation of MATLAB’s resample function in Python. With, you guessed it, zero context and therefore no theoretical knowledge of signal processing. The function ha been tested against MATLAB’s resample function using a simple example. I might include that later. I had originally answered this on StackExchange, but it is lost because the question was deleted.<br>Btw, <a href="https://stackoverflow.com/questions/28506137/python-resampling-implementation-like-matlabs-signal-toolboxs-resampling-funct">this</a> did not work for me. </p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> numpy <span class="keyword">as</span> np</span><br><span class="line"><span class="keyword">from</span> scipy.signal <span class="keyword">import</span> resample_poly</span><br><span class="line"><span class="keyword">from</span> math <span class="keyword">import</span> gcd</span><br><span class="line"><span class="keyword">def</span> <span class="title function_">matlab_resample</span>(<span class="params">x, resample_rate, orig_sample_rate</span>):</span><br><span class="line">    <span class="string">&quot;&quot;&quot;</span></span><br><span class="line"><span class="string">    Resample a signal by a rational factor (p/q) to match MATLAB&#x27;s `resample` function.</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">    Parameters:</span></span><br><span class="line"><span class="string">        x (array-like): Input signal.</span></span><br><span class="line"><span class="string">        p (int): Upsampling factor.</span></span><br><span class="line"><span class="string">        q (int): Downsampling factor.</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">    Returns:</span></span><br><span class="line"><span class="string">        array-like: Resampled signal.</span></span><br><span class="line"><span class="string">    &quot;&quot;&quot;</span></span><br><span class="line">    p = resample_rate</span><br><span class="line">    q = orig_sample_rate</span><br><span class="line">    factor_gcd = gcd(<span class="built_in">int</span>(p), <span class="built_in">int</span>(q))</span><br><span class="line">    p = <span class="built_in">int</span>(p // factor_gcd)</span><br><span class="line">    q = <span class="built_in">int</span>(q // factor_gcd)</span><br><span class="line"></span><br><span class="line">    <span class="comment"># Ensure input is a numpy array</span></span><br><span class="line">    x = np.asarray(x)</span><br><span class="line"></span><br><span class="line">    <span class="comment"># Use resample_poly to perform efficient polyphase filtering</span></span><br><span class="line">    y = resample_poly(x, p, q, window=(<span class="string">&#x27;kaiser&#x27;</span>, <span class="number">5.0</span>))</span><br><span class="line"></span><br><span class="line">    <span class="comment"># Match MATLAB&#x27;s output length behavior</span></span><br><span class="line">    output_length = <span class="built_in">int</span>(np.ceil(<span class="built_in">len</span>(x) * p / q))</span><br><span class="line">    y = y[:output_length]</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> y</span><br></pre></td></tr></table></figure><h1 id="References"><a href="#References" class="headerlink" title="References"></a>References</h1><p><a href="https://stackoverflow.com/questions/28506137/python-resampling-implementation-like-matlabs-signal-toolboxs-resampling-funct">https://stackoverflow.com/questions/28506137/python-resampling-implementation-like-matlabs-signal-toolboxs-resampling-funct</a></p>]]>
    </content>
    <id>https://franciscormendes.github.io/2024/12/17/matching-matlabs-resample/</id>
    <link href="https://franciscormendes.github.io/2024/12/17/matching-matlabs-resample/"/>
    <published>2024-12-17T00:00:00.000Z</published>
    <summary>A drop-in Python implementation of MATLAB's resample function using scipy's polyphase filter — matching MATLAB's output exactly, with no signal-processing background required.</summary>
    <title>Matching MATLAB's resample function in Python</title>
    <updated>2026-04-10T14:24:00.555Z</updated>
  </entry>
  <entry>
    <author>
      <name>Francisco Romaldo Fernandes Mendes</name>
    </author>
    <category term="statistics" scheme="https://franciscormendes.github.io/categories/statistics/"/>
    <category term="a-b-testing" scheme="https://franciscormendes.github.io/tags/a-b-testing/"/>
    <category term="statistics" scheme="https://franciscormendes.github.io/tags/statistics/"/>
    <category term="experimentation" scheme="https://franciscormendes.github.io/tags/experimentation/"/>
    <category term="recommender-systems" scheme="https://franciscormendes.github.io/tags/recommender-systems/"/>
    <category term="causal-inference" scheme="https://franciscormendes.github.io/tags/causal-inference/"/>
    <content>
      <![CDATA[<div class="series-box">  <div class="series-label">Series</div>  <div class="series-name">Bayesian Methods and Experimentation</div>  <ol class="series-list"><li class="series-item"><a href="/2024/07/19/bayesian-statistics/">Bayesian Statistics : A/B Testing, Thompson sampling of multi-armed bandits, Recommendation Engines and more from Big Consulting</a></li><li class="series-item series-current"><span>The Management Consulting Playbook for AB Testing (with an emphasis on Recommender Systems)</span></li><li class="series-item"><a href="/2024/08/04/rct-your-way-to-policy/">No, You Cannot RCT Your Way to Policy</a></li><li class="series-item"><a href="/2026/04/10/bayesian-vs-frequentist-sample-size/">Bayesian A/B Testing Is Not Immune to Peeking: Insights from a Costly Two-Sided Marketplace</a></li></ol></div><h1 id="Introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h1><p>Although I’ve focused much more on the ML side of consulting projects—and I really enjoy it—I’ve often had to dust off my statistician hat to measure how well the algorithms I build actually perform. Most of my experience in this area has been in verifying that recommendation engines, once deployed, truly deliver value. In this article, I’ll explore some key themes in AB Testing. While, I tried to be as general as possible, I did drill down on specific concepts that are particularly salient to recommender systems. </p><p>I thoroughly enjoy the “measurement science” behind these challenges; it’s a great reminder that classic statistics is far from obsolete. In practice, it also lets us make informed claims based on simulations, even if formal proofs aren’t immediately available. I’ve also included some helpful simulations.</p><h1 id="Basic-Structure-of-AB-Testing"><a href="#Basic-Structure-of-AB-Testing" class="headerlink" title="Basic Structure of AB Testing"></a>Basic Structure of AB Testing</h1><p>AB Testing begins on day zero, often in a room full of stakeholders, where your task is to prove that your recommendation engine, feature (like a new button), or pricing algorithm really works. Here, the focus shifts from the predictive power of machine learning to the causal inference side of statistics. (Toward the end of this article, I’ll also touch briefly on causal inference within the context of ML.)</p><h1 id="Phase-1-Experimental-Context"><a href="#Phase-1-Experimental-Context" class="headerlink" title="Phase 1: Experimental Context"></a>Phase 1: Experimental Context</h1><ul><li><p><strong>Define the feature under analysis</strong> and evaluate whether AB testing is necessary. Sometimes, if a competitor is already implementing the feature, testing may not be essential; you may simply need to keep pace.</p></li><li><p><strong>Establish a primary metric of interest.</strong> In consulting projects, this metric often aligns closely with engagement fees, so it’s critical to define it well.</p></li><li><p><strong>Identify guardrail metrics</strong>—these are typically independent of the experiment (e.g., revenue, profit, total rides, wait time) and represent key business metrics that should not be negatively impacted by the test.</p></li><li><p><strong>Set a null hypothesis,</strong> $H_0$ (usually representing a zero effect size on the main metric). Consider what would happen without the experiment, which may involve using non-ML recommendations or an existing ML recommendation in recommendation engine contexts.</p></li><li><p><strong>Specify a significance level,</strong> $\alpha$, which is the maximum probability of rejecting the null hypothesis when it is true, commonly set at 0.05. This value is conventional but somewhat arbitrary, and it’s challenging to justify because humans often struggle to assign accurate probabilities to risk.</p></li><li><p><strong>Define the alternative hypothesis,</strong> $H_1$, indicating the minimum effect size you hope to observe. For example, in a PrimeTime pricing experiment, you’d specify the smallest expected change in your chosen metric, such as whether rides will increase by hundreds or by 1%. This effect size is generally informed by prior knowledge and reflects the threshold at which the feature becomes worthwhile.</p></li><li><p><strong>Choose a power level,</strong> $1 - \beta$, usually set to 0.8. This means there is at least an 80% chance of rejecting the null hypothesis when $H_1$ is true.</p></li><li><p><strong>Select a test statistic</strong> with a known distribution under both hypotheses. The sample average of the metric of interest is often a good choice.</p></li><li><p><strong>Determine the minimum sample size</strong> required to achieve the desired power level $1 - \beta$ with all given parameters.</p></li></ul><p>Before proceeding, it’s crucial to recognize that many choices, like those for $\alpha$ and $\beta$, are inherently subjective. Often, these parameters are predefined by an existing statistics or measurement science team, and a “Risk” team may also weigh in to ensure the company’s risk profile remains stable. For instance, if you’re testing a recommendation engine, implementing a new pricing algorithm, and cutting costs simultaneously, the risk team might have input on how much overall risk the company can afford. This subjectivity often makes Bayesian approaches appealing, driving interest in a Bayesian perspective for AB Testing.</p><h1 id="Phase-2-Experiment-Design"><a href="#Phase-2-Experiment-Design" class="headerlink" title="Phase 2: Experiment Design"></a>Phase 2: Experiment Design</h1><p>With the treatment, hypothesis, and metrics established, the next step is to define the unit of randomization for the experiment and determine when each unit will participate. The chosen unit of randomization should allow accurate measurement of the specified metrics, minimize interference and network effects, and account for user experience considerations.The next couple of sections will dive deeper into certain considerations when designing an experiment, and how to statistically overcome them. In a recommendation engine context, this can be quite complex, since both treatment and control groups share the pool of products, it is possible that increased purchases from the online recommendation can cause the stock to run out for people who physically visit the store. So if we see the control group (i.e. the group not exposed to the new recommender system) buying more competitor products (competitors to the products you are recommending) this could simply be because the product was not available and the treatment was much more effective than it seemed!</p><h1 id="Unit-of-Randomization-and-Interference"><a href="#Unit-of-Randomization-and-Interference" class="headerlink" title="Unit of Randomization and Interference"></a>Unit of Randomization and Interference</h1><p>Now that you have approval to run your experiment, you need to define the unit of randomization. This can be tricky because often there are multiple levels at which randomization can be carried out for example, you can randomize your app experience by session, you could also randomize it by user. This leads to our first big problem in AB testing. What is the best unit of randomization? And what are the pitfalls of picking the wrong unit? Sometimes, the unit is picked for you, you simply may not have recommendation engine data at the exact level you want. A unit is often hard to conceptualize, it is easy to think that it is one user. But one user at different points in their journey through the app can be treated as different units.</p><h2 id="Example-of-Interference"><a href="#Example-of-Interference" class="headerlink" title="Example of Interference"></a>Example of Interference</h2><p>Interference is a huge problem in recommendation engines for most retail problems. Let me walk you through an interesting example we saw for a large US retailer. We were testing whether a certain product (high margin obviously!) was being recommended to users. The treatment group was shown the product and the control group was not. The metric of interest was the number of purchases of a basket of high margin products. The control group purchased the product at a rate of $\tau_0\%$ and the treatment group purchased the product at a rate of $\tau_t\%$. The experiment was significant at the $0.05$ level. However, after the experiment we noticed that the difference in sales closed up to $\tau_t - \tau_0 = \delta\%$. This was because the treatment group was buying up the stock of the product and the control group was not because they <em>could not</em>. Sometimes the act of being recommended a product was a kind of treatment in itself. This is a non-classical example of interference. This is a good reason to use a formal causal inference framework to measure the effect of the treatment. One way to do this is DAGs, which I will discuss later. The best way to run an experiment like this is to randomize by region. However, this is not always possible since regions share the same stock. But I think you get the idea.</p><h2 id="Robust-Standard-Errors-in-AB-Tests"><a href="#Robust-Standard-Errors-in-AB-Tests" class="headerlink" title="Robust Standard Errors in AB Tests"></a>Robust Standard Errors in AB Tests</h2><p>You can fix interference by clustering at the region level but very often this leads to another problem of its own. The unit of treatment allocation is now fundamentally bigger than the unit at which you are conducting the analysis. We do not really recommend products at the store level, we recommend products at the user level. So while we assign treatment and control at the store level we are analyzing effects at the user level. As a consequence we need to adjust our standard errors to account for this. This is where robust standard errors come in. In such a case, the standard errors you calculate for the average treatment effect are<br><em>lower</em> than what they truly are. And this has far-reaching effects for power, effect size and the like.</p><p>Recall, the variance of the OLS estimator</p>$$\text{Var}(\hat \beta) = (X’X)^{-1} X’ \epsilon \epsilon’ X (X’X)^{-1}$$<p>You can analyze the variance matrix under various assumptions to estimate, $$\epsilon \epsilon’ = \Omega$$</p><p>Under homoscedasticity,</p>$$\Omega = \begin{bmatrix} \sigma^2 & 0 & \dots & 0 & 0 \\ 0 & \sigma^2 & \dots & 0 & 0 \\ \vdots & & \ddots & & \vdots \\ 0 & 0 & \dots & \sigma^2 & 0 \\ 0 & 0 & \dots & 0 & \sigma^2 \\ \end{bmatrix} = \sigma^2 I_n$$<p>Under heteroscedasticity (Heteroscedastic robust standard errors),</p>$$\Omega = \begin{bmatrix} \sigma^2_1 & 0 & \dots & 0 & 0 \\ 0 & \sigma^2_2 & & 0 & 0 \\ \vdots & & \ddots & & \vdots \\ 0 & 0 & & \sigma^2_{n-1} & 0 \\ 0 & 0 & \dots & 0 & \sigma^2_n \\ \end{bmatrix}$$<p>And finally under clustering, $$\Omega = \begin{bmatrix} \epsilon_1^2 & \epsilon_1 \epsilon_2 & 0 & 0 & \dots & 0 & 0 \\ \epsilon_1 \epsilon_2 & \epsilon_2^2 & 0 & 0 & & 0 & 0 \\ 0 & 0 & \epsilon_3^2 & \sigma^2_{34} & & 0 & 0 \\ 0 & 0 & \sigma^2_{34} & \epsilon_3^2 & & 0 & 0 \\ \vdots & & & & \ddots & & \vdots \\ 0 & 0 & 0 & 0 & & \epsilon_{n-1}^2 & \sigma^2_{n-1,n} \\ 0 & 0 & 0 & 0 & \dots & \sigma^2_{n-1,n} & \epsilon_n^2 \\ \end{bmatrix}$$</p><p>The cookbook, for estimating $\Omega$ is therefore multiplying your matrix $\epsilon\epsilon'$ with some kind of banded matrix that represents your assumption $C$,</p>$$\Omega = C\epsilon \epsilon'= \begin{bmatrix} 1 & 1 & 0 & 0 & \dots & 0 & 0 \\ 1 & 1 & 0 & 0 & \dots & 0 & 0 \\ 0 & 0 & 1 & 1 & \dots & 0 & 0 \\ 0 & 0 & 1 & 1 & \dots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \dots & 1 & 1 \\ 0 & 0 & 0 & 0 & \dots & 1 & 1 \\ \end{bmatrix} \begin{bmatrix} \sigma_1^2 & \sigma_{12} & \sigma_{13} & \dots & \sigma_{1n} \\ \sigma_{12} & \sigma_2^2 & \sigma_{23} & \dots & \sigma_{2n} \\ \sigma_{13} & \sigma_{23} & \sigma_3^2 & \dots & \sigma_{3n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \sigma_{1n} & \sigma_{2n} & \sigma_{3n} & \dots & \sigma_n^2 \\ \end{bmatrix}$$<h2 id="Range-of-Clustered-Standard-Errors"><a href="#Range-of-Clustered-Standard-Errors" class="headerlink" title="Range of Clustered Standard Errors"></a>Range of Clustered Standard Errors</h2>$$\hat{\text{Var}}(\hat{\beta}) = \sum_{g=1}^G \sum_{i=1}^{n_g} \sum_{j=1}^{n_g} \epsilon_i, \epsilon_j$$$$\hat{\text{Var}}(\hat{\beta}) \in [ \sum_{i} \epsilon_i^2, \sum_{g} n_g^2 \epsilon_g^2]$$<p>Where the left boundary is where no clustering occurs and all errors are independent and the right boundary is where the clustering is very strong but variance between clusters is zero. It is fair to ask, why we need to multiply by a matrix of assumptions $C$ at all, the answer is that the assumptions scale the error to tolerable levels, such that the error is not too large or too small. By pure coincidence, it is possible to have high covariance between any two observations, whether to include it or not is predicated by your assumption matrix $C$.</p><h1 id="Power-Analysis"><a href="#Power-Analysis" class="headerlink" title="Power Analysis"></a>Power Analysis</h1><p>I have found that power analysis is an overlooked part of AB Testing, in Consulting you will probably have to work with the existing experimentation team to make sure the experiment is powered correctly. There is usually some amount of haggling and your tests are likely to be underpowered. There is a good argument to be made about overpowering your tests (such a term does not exist in statistics, who would complain about that), but this usually comes with some risk to guardrail metrics, thus you are likely to under power your tests when considering a guardrail metric. This is OKAY, because remember the $0.05$ level is a convention, and the $0.8$ power level is also a convention that by definition err on the side of NOT rejecting the null. So if you see an effect with an underpowered test you do have some latitude to make a claim while reducing the significance level of your test.</p><p>Power analysis focuses on reducing the probability of accepting the null hypothesis when the alternative is true. To increase the power of an A&#x2F;B test and reduce false negatives, three key strategies can be applied:</p><ul><li><p>Effect Size: Larger effect sizes are easier to detect. This can be achieved by testing bold, high-impact changes or trying new product areas with greater potential for improvement. Larger deviations from the baseline make it easier for the experiment to reveal significant effects.</p></li><li><p>Sample Size: Increasing sample size boosts the test’s accuracy and ability to detect smaller effects. With more data, the observed metric tends to be closer to its true value, enhancing the likelihood of detecting genuine effects. Adding more participants or reducing the number of test groups can improve power, though there’s a balance to strike between test size and the number of concurrent tests.</p></li><li><p>Reducing Metric Variability: Less variability in the test metric across the sample makes it easier to spot genuine effects. Targeting a more homogeneous sample or employing models that account for population variability helps reduce noise, making subtle signals easier to detect.</p></li></ul><p>Finally, experiments are often powered at 80% for a postulated effect size — enough to detect meaningful changes that justify the new feature’s costs or improvements. Meaningful effect sizes depend on context, domain knowledge, and historical data on expected impacts, and this understanding helps allocate testing resources efficiently.</p><p><img src="/2024/11/08/consulting-ab-testing/netflix_power.png" alt="Power 2"></p><p>In an A&#x2F;B test, the power of a test (the probability of correctly detecting a true effect) is influenced by the effect size, sample size, significance level, and pooled variance. The formula for power,</p>$1 - \beta$, can be approximated as follows for a two-sample test:$$\text{Power} = \Phi \left( \frac{\Delta - z_{1-\alpha/2} \cdot \sigma_{\text{pooled}}}{\sigma_{\text{pooled}} / \sqrt{n}} \right)$$<p>Where,</p><ul><li>$\Delta$ is the **Minimum Detectable Effect (MDE)**, representing the smallest effect size we aim to detect.</li><li>$z_{1-\alpha/2}$ is the critical z-score for a significance level$\alpha$ (e.g., 1.96 for a 95% confidence level).</li><li>$\sigma_{\text{pooled}}$ is the **pooled standard deviation<p>** of the metric across groups, representing the combined variability.</p></li><li>$n$ is the **sample size per group**.</li><li>$\Phi$ is the **cumulative distribution function<p>** (CDF) of the standard normal distribution, which gives the probability that a value is below a given z-score.</p></li></ul><h2 id="Understanding-the-Role-of-Pooled-Variance"><a href="#Understanding-the-Role-of-Pooled-Variance" class="headerlink" title="Understanding the Role of Pooled Variance"></a>Understanding the Role of Pooled Variance</h2><ul><li><p><strong>Power decreases</strong> as the <strong>pooled variance</strong> ($\sigma_{\text{pooled}}^2$) increases. Higher variance increases the &quot;noise&quot; in the data, making it more challenging to detect the effect (MDE) relative to the variation.</p></li><li><p>When <strong>pooled variance is low</strong>, the test statistic (difference between groups) is less likely to be drowned out by noise, so the test is more likely to detect even smaller differences. This results in <strong>higher power</strong> for a given sample size and effect size.</p></li></ul><h2 id="Practical-Implications"><a href="#Practical-Implications" class="headerlink" title="Practical Implications"></a>Practical Implications</h2><p>In experimental design:</p><ul><li><p>Reducing $\sigma_{\text{pooled}}$ (e.g., by choosing a more homogeneous sample) improves power without increasing sample size.</p></li><li><p>If $\sigma_{\text{pooled}}$ is high due to natural variability, increasing the sample size $n$ compensates by lowering the standard error $\left(\frac{\sigma_{\text{pooled}}}{\sqrt{n}}\right)$, thereby maintaining power.</p></li></ul><h1 id="Difference-in-Difference"><a href="#Difference-in-Difference" class="headerlink" title="Difference in Difference"></a>Difference in Difference</h1><p>Randomizing by region to solve interference can create a new issue: regional trends may bias results. If, for example, a fast-growing region is assigned to the treatment, any observed gains may simply reflect that region’s natural growth rather than the treatment’s effect.</p><p>In recommender system tests aiming to boost sales, retention, or engagement, this issue can be problematic. Assigning a growing region to control and a mature one to treatment will almost certainly make the treatment group appear more effective, potentially masking the true impact of the recommendations.</p><h2 id="Linear-Regression-Example-of-DiD"><a href="#Linear-Regression-Example-of-DiD" class="headerlink" title="Linear Regression Example of DiD"></a>Linear Regression Example of DiD</h2><p>To understand the impact of a new treatment on a group, let’s consider an example where everyone in group $G$ receives a treatment at time $t_e$. Our goal is to measure how this treatment affects outcomes over time.</p><p>First, we’ll introduce some notation:</p><p>Define $\mathbb{1}_A(x)$, which indicates if $x$ belongs to a specific set $A$:</p><div style="text-align:left">Let $T = \{t : t > t_e\}$, which represents the period after treatment. We can use this to set up a few key indicators:<ul><li>$\mathbb{1}_{T(t)} = 1$ if the time $t$ is after the treatment, and $0$ otherwise.</li><li>$\mathbb{1}_{G(i)} = 1$ if an individual $i$ is in group $G$, meaning they received the treatment.</li><li>if they are both $1$ then they refer to those in the treatment group during the post-treatment period.</li></ul><div><p>Using these indicators, we can build a simple linear regression model:</p><div style="text-align:left">$$y_{it} = \beta_0 + \beta_1 \mathbb{1}_{T(t)} + \beta_2 \mathbb{1}_{G(i)} + \beta_3 \mathbb{1}_{T(t)} \mathbb{1}_{G(i)}+ \epsilon_{it}$$<div><p>In this model, the coefficient $\beta_3$ is the term we’re most interested in. It represents the Difference-in-Differences (DiD) effect: how much the treatment group’s outcome changes after treatment compared to the control group’s change in the same period. In other words, $\beta_3$ provides a clearer picture of the treatment’s direct impact, isolating it from other factors.</p><p>For this model to work reliably, we rely on the <em>parallel trends assumption</em>: the control and treatment groups would have followed similar paths over time if there had been no treatment. Although the initial levels of $y_{it}$ can differ between groups, they should trend together in the absence of intervention.</p><h2 id="Testing-the-Parallel-Trends-Assumption"><a href="#Testing-the-Parallel-Trends-Assumption" class="headerlink" title="Testing the Parallel Trends Assumption"></a>Testing the Parallel Trends Assumption</h2><p>You can always test whether your data satisfies the parallel trends assumption by looking at it. In a practical environment, I have never really tested this assumption, for two big reasons (it is also why I personally think DiD is not a great method):</p><ul><li>If you need to test an assumption in your data, you are likely to have a problem with your data. If it is not obvious from some non-statistical argument or plot etc you are unlikely to be able to convince a stakeholder that it is a good assumption.</li><li>The data required to test this assumption, usually invalidates its need. If you have data to test this assumption, you likely have enough data to run a more sophisticated model than DiD (like CUPED).</li></ul><p>Having said all that, here are some ways you can test the parallel trends assumption:</p><ul><li><p><strong>Visual Inspection:</strong></p><ul><li><p>Plot the average outcome variable over time for both the treatment and control groups, focusing on the pre-treatment period. If the trends appear roughly parallel before the intervention, this provides visual evidence supporting the parallel trends assumption.</p></li><li><p>Make sure any divergence between the groups only occurs after the treatment.</p></li></ul></li><li><p><strong>Placebo Test:</strong></p><ul><li><p>Pretend the treatment occurred at a time prior to the actual intervention and re-run the DiD analysis. If you find a significant “effect” before the true treatment, this suggests that the parallel trends assumption may not hold.</p></li><li><p>Use a range of pre-treatment cutoff points and check if similar differences are estimated. Consistent non-zero results may indicate underlying trend differences unrelated to the actual treatment.</p></li></ul></li><li><p><strong>Event Study Analysis (Dynamic DiD):</strong></p><ul><li><p>Extend the DiD model by including lead and lag indicators for the treatment. </p></li><li><p>If pre-treatment coefficients (leads) are close to zero and non-significant, it supports the parallel trends assumption. Large or statistically significant leads could indicate violations of the assumption.</p></li></ul></li><li><p><strong>Formal Statistical Tests:</strong></p><ul><li><p>Run a regression on only the pre-treatment period, introducing an interaction term between time and group to test for significant differences in trends:</p></li><li><p>If the coefficient $\alpha_3$ on the interaction term is close to zero and statistically insignificant, this supports the parallel trends assumption. A significant $\alpha_3$ would indicate a pre-treatment trend difference, which would challenge the assumption.</p></li></ul></li><li><p><strong>Covariate Adjustment (Conditional Parallel Trends):</strong></p><ul><li>If parallel trends don’t hold unconditionally, you might adjust for observable characteristics that vary between groups and influence the outcome. This is a more relaxed “conditional parallel trends” assumption, and you could check if trends are parallel after including covariates in the model.</li></ul></li></ul><p>If you can make all this work for you, great, I never have. In the dynamic world of recommendation engines (especially always ‘’online’’ recommendation engines) it is very difficult to find a reasonably good cut-off point for the placebo test. And the event study analysis is usually not very useful since the treatment is usually ongoing.</p><h1 id="Peeking-and-Early-Stopping"><a href="#Peeking-and-Early-Stopping" class="headerlink" title="Peeking and Early Stopping"></a>Peeking and Early Stopping</h1><p>Your test is running, and you’re getting results—some look good, some look bad. Let’s say you decide to stop early and reject the null hypothesis because the data looked good. What could happen? Well, you shouldn’t. In short, you’re changing the power of the test. A quick simulation can show the difference: with early stopping or peeking, your rejection rate of the null hypothesis is much higher than the 0.05 you intended. This isn’t surprising since increasing the sample size raises the chance of rejecting the null when it’s true.</p><p>The benefits of early stopping aren’t just about self-control. It can also help prevent a bad experiment from affecting critical guardrail metrics, letting you limit the impact while still gathering needed information. Another example is when testing expendable items. Think about a magazine of bullets: if you test by firing each bullet, you’re guaranteed they all work—but now you have no bullets left. So you might rephrase the experiment as, How many bullets do I need to fire to know this magazine works?</p><p>In consulting you are going to peek early, you have to live with it. For one reason or another, a bug in production, an eager client whatever the case, you are going to peek, so you better prepare accordingly.</p><div align="left"><h4 id="Simulated-Effect-of-Peeking-on-Experiment-Outcomes"><a href="#Simulated-Effect-of-Peeking-on-Experiment-Outcomes" class="headerlink" title="Simulated Effect of Peeking on Experiment Outcomes"></a>Simulated Effect of Peeking on Experiment Outcomes</h4><table><thead><tr><th><img src="/2024/11/08/consulting-ab-testing/withoutPeeking.png" alt="Without Peeking"></th><th><img src="/2024/11/08/consulting-ab-testing/withPeekingAfter100rounds.png" alt="With Peeking"></th></tr></thead><tbody><tr><td><strong>(a) Without Peeking:</strong> $\frac{3}{100}$ reject null, $\alpha=0.05$</td><td><strong>(b) With Peeking:</strong> $\frac{29}{100}$ reject null, $\alpha=0.05$</td></tr></tbody></table></div><p>Under a given null hypothesis, we run 100 simulations of experiments and record the z-statistic for each. We do this once without peeking and let the experiments run for $1000$ observations. In the peeking case, we stop whenever the z-statistic crosses the boundary but only after $100$th observation.</p><h1 id="Sequential-Testing-for-Peeking"><a href="#Sequential-Testing-for-Peeking" class="headerlink" title="Sequential Testing for Peeking"></a>Sequential Testing for Peeking</h1><p>The Sequential Probability Ratio Test (SPRT) compares the likelihood ratio at the $n$-th observation, given by:</p>$$\Lambda_n = \frac{L(H_1 \mid x_1, x_2, \dots, x_n)}{L(H_0 \mid x_1, x_2, \dots, x_n)}$$<p>where $L(H_0 \mid x_1, x_2, \dots, x_n)$ and $L(H_1 \mid x_1, x_2, \dots, x_n)$ are the likelihood functions under the null hypothesis $H_0$ and the alternative hypothesis $H_1$, respectively.</p><p>The test compares the likelihood ratio to two thresholds, $A$ and $B$, and the decision rule is:</p>$$\text{If } \Lambda_n \geq A, \text{ accept } H_1,$$ $$\text{If } \Lambda_n \leq B, \text{ accept } H_0,$$ $$\text{If } B < \Lambda_n < A, \text{ continue sampling}.$$<p>The thresholds $A$ and $B$ are determined based on the desired error probabilities. For a significance level $\alpha$ (probability of a Type I error) and power $1 - \beta$ (probability of detecting a true effect when $H_1$ is true), the thresholds are given by:</p>$$A = \frac{1 - \beta}{\alpha}, \quad B = \frac{\beta}{1 - \alpha}.$$<h3 id="Normal-Distribution"><a href="#Normal-Distribution" class="headerlink" title="Normal Distribution"></a>Normal Distribution</h3><p>This test is in practice a lot easier to carry out for certain distributions like the normal distribution, assume an unknown mean $\mu$ and known variance $\sigma^2$</p>$$\begin{aligned} H_0: \quad & \mu = 0 , \\ H_1: \quad & \mu = 0.1 \end{aligned}$$$$\mathcal L(\mu) = \left( \frac{1}{\sqrt{2 \pi} \sigma } \right)^n e^{- \sum_{i=1}^{n} \frac{(X_i - \mu)^2}{2 \sigma^2}}$$$$\Lambda(X) = \frac{\mathcal L (0.1, \sigma^2)}{\mathcal L (0, \sigma^2)} = \frac{e^{- \sum_{i=1}^{n} \frac{(X_i - 0.1)^2}{2 \sigma^2}}}{e^{- \sum_{i=1}^{n} \frac{(X_i)^2}{2 \sigma^2}}}$$<p>The sequential rule becomes the recurrent sum, $S_i$ (with $S_0=0$) $$S_{i} = S_{i-1} + \log(\Lambda_{i})$$</p><p>With the stopping rule</p><ul><li>$S_i \geq b$ : Accept $H_1$</li><li>$S_i\geq a$ : Accept $H_0$</li><li>$a<S_i<b$ : continue</li></ul>$a \approx \log {\frac {\beta }{1-\alpha }} \quad \text{and} \quad  b \approx \log {\frac {1-\beta }{\alpha }}$<p>There is another elegant method outlined in Evan Miller’s blog post, which I will not go into here but just state it for brevity (it is also used at Etsy, so there is certainly some benefit to it). It is a very good read and I highly recommend it.</p><ul><li>At the beginning of the experiment, choose a sample size $N$.</li><li>Assign subjects randomly to the treatment and control, with 50% probability each.</li><li>Track the number of incoming successes from the treatment group. Call this number $T$.</li><li>Track the number of incoming successes from the control group. Call this number $C$.</li><li>If $T−C$ reaches $2\sqrt{N}$, stop the test. Declare the treatment to be the winner.</li><li>If $T+C$ reaches $N$, stop the test. Declare no winner.</li><li>If neither of the above conditions is met, continue the test.</li></ul><p>Using these techniques you can “peek” at the test data as it comes in and decide to stop as per your requirement. This is very useful as the following simulation using this more complex criteria shows. Note that what you want to verify is two things,</p><ul><li><p>Does early stopping under the null hypothesis, accept the null in approximately $\alpha$ fraction of simulations once the stopping criteria is reached and does it do so<br><em>fast</em>.</p></li><li><p>Does early stopping under the alternative reject the null hypothesis in $1-\beta$ fraction of simulations and does it do so<br><em>fast</em>.</p></li></ul><p>The answer to these two questions is not always symmetrical, and it seems that we need more samples to reject the null (case 2) versus accept it case 1. Which is as it should be! But in both cases, as the simulations below show, you need a significantly fewer number of samples than before.</p><h2 id="CUPED-and-Other-Similar-Techniques"><a href="#CUPED-and-Other-Similar-Techniques" class="headerlink" title="CUPED and Other Similar Techniques"></a>CUPED and Other Similar Techniques</h2><p>Recall, our diff-in-diff equation, $$Y_{i,t} = \alpha + \beta D_i + \gamma \mathbb I (t=1) + \delta D_i * \mathbb I (t=1) + \varepsilon_{i,t}$$</p><p>Diff in Diff is nothing but CUPED for $\theta=1$. I state this without proof. I was not able to find a clear one any where.</p><p>Consider the auto-regression with control variates regression equation, $$Y_{i, t=1} = \alpha + \beta D_i + \gamma Y_{i, t=0} + \varepsilon_i$$ This is also NOT equivalent to CUPED, nor is it a special case. Again, I was not able to find a good proof anywhere.</p><h1 id="Multiple-Hypotheses"><a href="#Multiple-Hypotheses" class="headerlink" title="Multiple Hypotheses"></a>Multiple Hypotheses</h1><p>In most of the introduction, we set the scene by considering only one hypotheses. However, in real life you may want to test multiple hypotheses at the same time.</p><ul><li><p>You may be testing multiple hypotheses even if you did not realize it, such as over time. In the example of early stopping you are actually checking multiple hypotheses. One at every time point.</p></li><li><p>You truly want to test multiple features of your product at the same time and want to run one test to see if the results got better.</p></li></ul><h2 id="Regression-Model-Setup"><a href="#Regression-Model-Setup" class="headerlink" title="Regression Model Setup"></a>Regression Model Setup</h2><p>We consider a regression model with three treatments, $D_1$, $D_2$, and $D_3$, to study their effects on a continuous outcome variable, $Y$. The model is specified as: $$Y = \beta_0 + \beta_1 D_1 + \beta_2 D_2 + \beta_3 D_3 + \epsilon$$ where:</p><ul><li>$Y$ is the outcome variable,</li><li>$D_1$, $D_2$, and $D_3$ are binary treatment indicators (1 if the treatment is applied, 0 otherwise),</li><li>$\beta_0$ is the intercept,</li><li>$\beta_1$, $\beta_2$, and $\beta_3$ are the coefficients representing the effects of treatments $D_1$, $D_2$, and $D_3$, respectively,</li><li>$\epsilon$ is the error term, assumed to be normally distributed with mean 0 and variance $\sigma^2$.</li></ul><h2 id="Hypotheses-Setup"><a href="#Hypotheses-Setup" class="headerlink" title="Hypotheses Setup"></a>Hypotheses Setup</h2><p>We aim to test whether each treatment has a significant effect on the outcome variable $Y$. This involves testing the null hypothesis that each treatment coefficient is zero.</p><p>The null hypotheses are formulated as follows: $$H_0^{(1)}: \beta_1 = 0$$ $$H_0^{(2)}: \beta_2 = 0$$ $$H_0^{(3)}: \beta_3 = 0$$</p><p>Each null hypothesis represents the assumption that a particular treatment (either $D_1$, $D_2$, or $D_3$) has no effect on the outcome variable $Y$, implying that the treatment coefficient $\beta_i$ for that treatment is zero.</p><h2 id="Multiple-Hypothesis-Testing"><a href="#Multiple-Hypothesis-Testing" class="headerlink" title="Multiple Hypothesis Testing"></a>Multiple Hypothesis Testing</h2><p>Since we are testing three hypotheses simultaneously, we need to control for the potential increase in false positives. We can use a multiple hypothesis testing correction method, such as the<br><strong>Bonferroni correction</strong> or the <strong>Benjamini-Hochberg procedure</strong>.</p><h2 id="Bonferroni-Correction"><a href="#Bonferroni-Correction" class="headerlink" title="Bonferroni Correction"></a>Bonferroni Correction</h2><p>With the Bonferroni correction, we adjust the significance level $\alpha$ for each hypothesis test by dividing it by the number of tests $m = 3$. If we want an overall significance level of $\alpha = 0.05$, then each individual hypothesis would be tested at: $$\alpha_{\text{adjusted}} = \frac{\alpha}{m} = \frac{0.05}{3} = 0.0167$$</p><h2 id="Benjamini-Hochberg-Procedure"><a href="#Benjamini-Hochberg-Procedure" class="headerlink" title="Benjamini-Hochberg Procedure"></a>Benjamini-Hochberg Procedure</h2><p>Alternatively, we could apply the Benjamini-Hochberg procedure to control the False Discovery Rate (FDR). The procedure involves sorting the p-values from smallest to largest and comparing each p-value $p_i$ with the threshold: $$p_i \leq \frac{i}{m} \cdot \alpha$$ where $i$ is the rank of the p-value and $m$ is the total number of tests. We declare all hypotheses with p-values meeting this criterion as significant. This framework allows us to assess the individual effects of $D_1$, $D_2$, and $D_3$ while properly accounting for multiple hypothesis testing.</p><h1 id="Variance-Reduction-CUPED"><a href="#Variance-Reduction-CUPED" class="headerlink" title="Variance Reduction: CUPED"></a>Variance Reduction: CUPED</h1><p>When analyzing the effectiveness of a recommender system, sometimes your metrics are skewed by high variance in the metric i.e. $Y_i$. One easy way to fix this is by using the usual outlier removal suite of techniques. However, outlier removal is a difficult thing to statistically define, and very often you may be losing “whales”. Customers who are truly large consumers of a product. One easy way to do this would be to normalize the metric by its mean, i.e. $Y_i = \frac{Y_i}{\bar Y}$. Any even better way to do this would be to normalize the metric by that users own mean, i.e. $Y_i = \frac{Y_i}{\bar Y_i}$. This is the idea behind CUPED.</p><p>Consider, the regression form of the treatment equation,</p>$$Y_{i, t=1} = \alpha + \beta D_i + \varepsilon_i$$<p>Assume you have data about the metric from before, and have values $Y_{i,t=0}$. Where the subscript denoted the $i$ individuals outcome, before the experiment was even run, $t=1$.</p>$$\hat Y^{cuped}\_{t=1} = \theta\bar Y_{t=0} + \theta \mathbb E [Y_{t=0} ]$$<p>This is like running a regression of $Y_{t=1}$ on $Y_{t=0}$.</p>$$Y\_{i, t=1} = \theta Y_{i, t=0} + \hat Y^{cuped}_{i, t=1}$$<p>Now, use those residuals in the treatment equation above,</p>$$\hat Y^{cuped}_{i, t=1} = \alpha + \beta D_i + \varepsilon_i$$<p>And then estimate the treatment effect.</p><p>The statistical theory behind CUPED is fairly simple and setting up the regression equation is not difficult. However, in my experience, choosing the right window for pre-treatment covariates is extremely difficult, choose the right window and you reduce your variance by a lot. The right window depends a lot on your business. Some key considerations,</p><ul><li>Sustained purchasing behavior is a key requirement. If the $Y_{t=0}$ is not a good predictor of $Y_{t=1}$ for the interval $t=0$ to $t=1$ then the variance of $Y^{cuped}$ will be high. Defeating the purpose.</li><li>Longer windows come with computational costs.</li><li>In practice, because companies are testing things all the time you could have noise left over from a previous experiment that you need to randomize&#x2F; control for.</li></ul><h2 id="Simulating-CUPED"><a href="#Simulating-CUPED" class="headerlink" title="Simulating CUPED"></a>Simulating CUPED</h2><p>One way you can guess a good pre-treatment window is by simulating the treatment effect for various levels of MDEs (the change you expect to see in $Y_i$) and plot the probability of rejecting the alternative hypothesis if it is true i.e. Power.</p><p><img src="/2024/11/08/consulting-ab-testing/variance_reduction_vs_Lift.png" alt="MDE vs Power for 2 Different Metrics"></p><p>So you read off your hypothesized MDE and Power, and then every point to the left of that is a good window. As an example, lets say you know your MDE to be $3\%$ and you want a power of $0.8$, then your only option is the 16 week window. Analogously, if you have an MDE of $5\%$ and you want a power of $0.8$, then the conventional method (with no CUPED) is fine as you can attain an MDE of $4\%$ with a power of $0.8$. Finally, if you have an MDE of $4\%$ and you want a power of $0.8$ then a 1 week window is fine.</p><p>Finally, you can check that you have made the right choice by plotting the variance reduction factor against the pre-period (weeks) and see if the variance reduction factor is high.</p><p>CUPED is a very powerful technique, but if I could give one word of advice to anyone trying to do it, it would be this:<br><em>get the pre-treatment window<br>right</em>. This has more to do with business intelligence than with statistics. In this specific example longer windows gave higher variance reduction, but I have seen cases where a “sweet spot” exists.</p><h1 id="Variance-Reduction-CUPAC"><a href="#Variance-Reduction-CUPAC" class="headerlink" title="Variance Reduction: CUPAC"></a>Variance Reduction: CUPAC</h1><p>As it turns out we can control variance, by other means using the same principle as CUPED. The idea is to use a control variate that is not a function of the treatment. Recall, the regression equation we ran for CUPED, $$Y\_{i, t=1} = \theta Y_{i, t=0} + \hat Y^{cuped}_{i, t=1}$$ Generally speaking, this is often posed as finding some $X$ that is uncorrelated with the treatment but correlated with $Y$.</p>$$Y\_{i, t=1} = \theta X_{i, t=0} + \hat Y^{cuped}_{i, t=1}$$<p>You could use<br><em>any</em> $X$ that is uncorrelated with the treatment but correlated with $Y$. An interesting thing to try would be to fit a highly non-linear machine learning model to $Y_t$ (such as random forest, XGBoost) using a set of observable variables $Z_t$, call it $f(Z_t)$. Then use $f(Z_t)$ as your $X$.</p>$$Y\_{i, t=1} = \theta f(Z_{i,t=1}) + \hat Y^{cuped}_{i, t=1}$$<p>Notice here two things, - that $f(Y)$ is not a function of $D_i$ but is a function of $Y_i$. - that $f(Z)$ does not (necessarily) need any data from $t=0$ to be calculated, so it is okay, if<br><em>no pre-treatment data<br>exists</em>! - if pre-treatment data exists then you can use it to fit $f(Z)$ and then use it to predict $Y$ at $t=1$ as well, so it can only enhance the performance of your fit and thereby reduce variance even more.</p><p>If you really think about it, any process to create pre-treatment covariates inevitably involves finding some $X$ highly correlated with outcome and uncorrelated with treatment and controlling for that. In CUPAC we just dump all of that into one ML model and let the model figure out the best way to control for variance using all the variables we threw in it.</p><p>I highly recommend CUPAC over CUPED, it is a more general technique and can be used in a wider variety of situations. If you really want to, you can throw $Y_{t=0}$ into the mix as well!</p><h2 id="A-Key-Insight-Recommendation-Engines-and-CUPAC-CUPED"><a href="#A-Key-Insight-Recommendation-Engines-and-CUPAC-CUPED" class="headerlink" title="A Key Insight: Recommendation Engines and CUPAC&#x2F; CUPED"></a>A Key Insight: Recommendation Engines and CUPAC&#x2F; CUPED</h2><p>Take a step back and think about what $f(Z)$ is<br><em>really</em> saying in context of a recommender system, it is saying given some $Z$ can I predict my outcome metric. Let us say the outcome metric is some $G(Y)$, where $Y$ is sales.</p>$$G(Y) = G(f(Z)) + \varepsilon$$<p>What is a recommender system? It takes some $Z$ and predicts $Y$.</p>$$\hat Y = r(Z) + \varepsilon'$$$$G(\hat Y) = G(r(Z)) + \varepsilon''$$<p>This basically means that a pretty good function to control for variance is a recommender system itself! Now you can see why CUPAC is so powerful, it is a way to control for variance using a recommender system itself. You have all the pieces ready for you. HOWEVER! You cannot use the recommender system you are currently testing as your $f(Z)$, that would be mean that $D_i$ is correlated with $f(Z)$ and that would violate the assumption of uncorrelatedness. Usually, the existing recommender system (the pre-treatment one) can be used for this purpose. The finally variable $Y^{cupac}$ then has a nice interpretation it is not the difference between what people<br><em>truly</em> did and the recommended value, but rather the difference between the two recommender systems! Any model is a variance reduction model, it is just a question of how much variance it reduces. Since the existing recommender system is good enough it is likely to reduce a lot of variance. If it is terrible (which is why they hired you in the first place) then this approach is unlikely to work. But in my experience, existing recommendations are always pretty good in the industry it is a question of finding those last few drops of performance increase.</p><h1 id="Conclusion"><a href="#Conclusion" class="headerlink" title="Conclusion"></a>Conclusion</h1><p>The above are pretty much all you can expect to find in terms of evaluating models in Consulting. In my experience considering all the possibilities that would undermine your test are worth thinking about <em>before</em> embarking on the AB test. </p>]]>
    </content>
    <id>https://franciscormendes.github.io/2024/11/08/consulting-ab-testing/</id>
    <link href="https://franciscormendes.github.io/2024/11/08/consulting-ab-testing/"/>
    <published>2024-11-08T00:00:00.000Z</published>
    <summary>A/B testing in management consulting: frequentist and Bayesian methods, multiple testing corrections, and the specific challenges of testing recommender systems in a live production environment.</summary>
    <title>The Management Consulting Playbook for AB Testing (with an emphasis on Recommender Systems)</title>
    <updated>2026-04-10T14:24:00.543Z</updated>
  </entry>
  <entry>
    <author>
      <name>Francisco Romaldo Fernandes Mendes</name>
    </author>
    <category term="physics" scheme="https://franciscormendes.github.io/categories/physics/"/>
    <category term="mathematics" scheme="https://franciscormendes.github.io/tags/mathematics/"/>
    <category term="physics" scheme="https://franciscormendes.github.io/tags/physics/"/>
    <content>
      <![CDATA[<h1 id="Introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h1><p>It is often difficult to speak about things like singularities because of their prevalence in pop culture. Oftentimes a concept like this takes a life of its own, forever ingrained in ones imagination as a still from a movie (for me this is that scene from Inception where they encounter Gargantua for the first time). Like many concepts in theoretical physics, popular culture is often better at bringing them into light than it is at bringing them into focus. In this article I will try to explain in simple terms what a singularity is and how that relates to physical reality. As always, I will give an exact example of the singularity by means of an equation. At the end, once the mathematics is clear, I will try to explain what the physical reality of the singularity is. </p><h1 id="Mathematical-Singularities"><a href="#Mathematical-Singularities" class="headerlink" title="Mathematical Singularities"></a>Mathematical Singularities</h1><p>Singularity of $f(x) = \frac{1}{x}$</p><p>1. <strong>Behavior of the Function:</strong></p>$$f(x) = \frac{1}{x}$$<p>- As $x \to 0^+$ (approaching from the positive side): $$f(x) \to +\infty$$ - As $x \to 0^-$ (approaching from the negative side): $$f(x) \to -\infty$$</p><p>At $x = 0$, the function becomes infinitely large (or small), making $x = 0$ a singularity. This is a <strong>pole</strong> of the function where the value tends to infinity.</p><p>2. <strong>Undefined at the Singularity:</strong></p><p>The function $f(x) = \frac{1}{x}$ is <strong>undefined</strong> at $x = 0$, which is the point of discontinuity.</p><p>In mathematics, singularities are not a problem.</p><h1 id="Physics-Singularities"><a href="#Physics-Singularities" class="headerlink" title="Physics Singularities"></a>Physics Singularities</h1><p>The singularity of a black hole can be described by the <strong>Schwarzschild metric</strong>, which is the solution to Einstein’s field equations for a non-rotating, uncharged black hole. The Schwarzschild metric is given by:</p>$$ds^2 = - \left( 1 - \frac{2GM}{r c^2} \right) c^2 dt^2 + \left( 1 - \frac{2GM}{r c^2} \right)^{-1} dr^2 + r^2 \left( d\theta^2 + \sin^2 \theta \, d\phi^2 \right)$$<p>Where:</p><ul><li>$ds^2$ is the spacetime interval,</li><li>$c$ is the speed of light,</li><li>$G$ is the gravitational constant,</li><li>$M$ is the mass of the black hole,</li><li>$r$ is the radial coordinate,</li><li>$\theta$ and $\phi$ are angular coordinates.</li></ul><p>Be careful though these are not polar co-ordinates, these are coordinates for the Schwarzschild metric. They are a kind of nested spherical coordinate system, this does not seem to affect the solution but helpful to know.</p><p>The singularity occurs at $r = 0$. As $r \to 0$, the term $\frac{2GM}{r c^2}$ grows without bound, leading to an infinite curvature of spacetime. This represents the <strong>physical singularity</strong> of the black hole.</p><p>Additionally, the $g_{tt}$ component of the Schwarzschild metric, which is the time-time component, becomes singular as $r \to 0$:</p>$$g_{tt} = - \left( 1 - \frac{2GM}{r c^2} \right)$$<p>As $r \to 0$, $g_{tt} \to -\infty$, indicating the breakdown of spacetime and the presence of a singularity.</p><p>You can create another singularity by setting $r = 2GM/c^2$ in the metric, this is the event horizon of the black hole. This is the point at which light can no longer escape the black hole. However, this is solely a mathematical singularity, since you can still define the metric at this point by a change of coordinates. One such set of coordinates is the Kruskal-Szekeres coordinates, which are used to describe the Schwarzschild metric in a way that is regular across the event horizon.</p><p>The Schwarzschild metric in Kruskal-Szekeres coordinates is given by:</p>$$ds^2 = \frac{32 G^3 M^3}{r c^6} e^{-r/2GM/c^2} \left( -dU dV \right) + r^2 \left( d\theta^2 + \sin^2 \theta \, d\phi^2 \right)$$<p>where $r$ is a function of $U$ and $V$, implicitly determined by:</p>$$U V = \left( \frac{r}{r_s} - 1 \right) e^{r / r_s}$$<p>Here, $r_s$ is the Schwarzschild radius:</p>$$r_s = \frac{2GM}{c^2}$$<p>The coordinate singularity at $r = r_s$ in the Schwarzschild metric is removed by transforming to Kruskal-Szekeres coordinates, and the metric remains regular across the event horizon.</p><h1 id="Another-Physics-Singularity"><a href="#Another-Physics-Singularity" class="headerlink" title="Another Physics Singularity"></a>Another Physics Singularity</h1><p>Again, starting from yet another solution for the field equations we can derive FLRW metric (Friedmann-Lemaître-Robertson-Walker metric) which describes the universe as a whole. The words homogenous and isotropic, effectively mean that instead of considering each individual planet in the universe as an actual individual body, we consider them to be individual particles in a fluid (in fact, the FLRW metric considers each galaxy to be a particle). We do this so that we can use equations for fluids to simplify the stress energy tensor $T$ in the Field Equations. Our strategy to solve the field equations is as follows,</p><ol><li>Assume the universe is some kind of fluid (so basically zoom out till all the galaxies look like a fluid)</li><li>From 1, you can write down the stress energy tensor $T_{\mu\nu}$ for the fluid, this is a simple equation (This is $0$ for the Schwarzschild metric, and for many other useful metrics, so we never really had this problem before, but when you zoom out you need it)</li><li></li></ol><p>The FLRW metric, which describes a homogeneous and isotropic universe, is given by:</p>$$ds^2 = - c^2 dt^2 + a(t)^2 \left( \frac{dr^2}{1 - k r^2} + r^2 d\theta^2 + r^2 \sin^2 \theta \, d\phi^2 \right)$$<p>Where:</p><ul><li>$ds^2$ is the spacetime interval,</li><li>$c$ is the speed of light,</li><li>$t$ is the cosmic time,</li><li>$a(t)$ is the scale factor of the universe,</li><li>$r$ is the radial coordinate,</li><li>$\theta$ and $\phi$ are angular coordinates,</li><li>$k$ is the curvature of space, which can be $-1$, $0$, or $1$.</li><li>The scale factor $a(t)$ describes how the universe expands or contracts with time.</li><li>The curvature parameter $k$ determines the geometry of space: negative curvature for $k = -1$, flat curvature for $k = 0$, and positive curvature for $k = 1$.</li></ul><h2 id="Friedmann-Equations-Recap"><a href="#Friedmann-Equations-Recap" class="headerlink" title="Friedmann Equations Recap"></a>Friedmann Equations Recap</h2><p>The <strong>Big Bang</strong> is represented in the <strong>Friedmann equations</strong> as a <strong>singularity</strong> at the beginning of time when the scale factor $a(t) $approaches zero. This signifies an initial state of infinite density, temperature, and curvature.</p><p>The <strong>Friedmann equations</strong> in cosmology are derived from Einstein’s field equations for a <strong>homogeneous and isotropic</strong> universe. Assuming zero cosmological constant ($\lambda = 0$), they are:</p><ol><li><p><strong>First Friedmann Equation</strong>: $$    \left( \frac{\dot{a}}{a} \right)^2 = \frac{8 \pi G}{3} \rho - \frac{k}{a^2}    $$</p></li><li><p><strong>Second Friedmann Equation (acceleration equation)</strong>: $$    \frac{\ddot{a}}{a} = - \frac{4 \pi G}{3} \left( \rho + \frac{3p}{c^2} \right)    $$</p></li><li><p><strong>Continuity Equation (conservation of energy)</strong>: $$    \dot{\rho} + 3 \frac{\dot{a}}{a} \left( \rho + \frac{p}{c^2} \right) = 0    $$</p></li></ol><p>where: - $a(t) $is the <strong>scale factor</strong> (the “size” of the universe at a given time $t $), - $\rho $is the <strong>energy density</strong>, - $p $is the <strong>pressure</strong>, - $G $is the gravitational constant, - $k $is the <strong>curvature parameter</strong> ($k = 0 $for a flat universe, $k = +1 $for closed, and $k = -1 $for open).</p><h2 id="Representation-of-the-Big-Bang-Singularity"><a href="#Representation-of-the-Big-Bang-Singularity" class="headerlink" title="Representation of the Big Bang Singularity"></a>Representation of the Big Bang Singularity</h2><p>In the context of the Friedmann equations, the <strong>Big Bang</strong> is identified by the conditions: - $a(t) \to 0$as $t \to 0$, - $\rho \to \infty $as $a(t) \to 0$(implying infinite density and temperature), - <strong>Curvature</strong> becomes infinite, signaling a physical singularity.</p><h3 id="Explanation-Using-the-First-Friedmann-Equation"><a href="#Explanation-Using-the-First-Friedmann-Equation" class="headerlink" title="Explanation Using the First Friedmann Equation"></a>Explanation Using the First Friedmann Equation</h3><p>In the <strong>first Friedmann equation</strong>: $$\left( \frac{\dot{a}}{a} \right)^2 = \frac{8 \pi G}{3} \rho - \frac{k}{a^2}$$</p><p>As $t \to 0 $: - The <strong>scale factor</strong> $a(t) $approaches zero. - For a positive energy density $\rho$, the term $\frac{\dot{a}}{a}$(known as the <strong>Hubble parameter</strong>) goes to infinity, meaning the rate of expansion is initially unbounded. - If $a \to 0$, then the energy density $\rho \to \infty $since $\rho $is inversely related to the volume of the universe.</p><p>Thus, at $a = 0 $, the universe is in a state of <strong>infinite density</strong> and <strong>infinite curvature</strong>, which we identify as the Big Bang singularity.</p><h3 id="Continuity-Equation-and-Energy-Conservation"><a href="#Continuity-Equation-and-Energy-Conservation" class="headerlink" title="Continuity Equation and Energy Conservation"></a>Continuity Equation and Energy Conservation</h3><p>The <strong>continuity equation</strong>: $$\dot{\rho} + 3 \frac{\dot{a}}{a} \left( \rho + \frac{p}{c^2} \right) = 0$$</p><p>implies that as $a(t) $approaches zero, the rapid change in the scale factor causes the energy density $\rho $to increase sharply, reinforcing the singularity concept.</p><h2 id="Physical-Interpretation"><a href="#Physical-Interpretation" class="headerlink" title="Physical Interpretation"></a>Physical Interpretation</h2><p>At $t = 0 $, when the scale factor $a(t) = 0 $, the energy density $\rho $theoretically becomes infinite, meaning all mass, energy, and curvature are compressed into a single point. This condition marks the <strong>beginning of the universe</strong>, as described by the Big Bang theory, before which classical descriptions of time and space may no longer apply due to quantum gravitational effects.</p><p>In short, the <strong>Big Bang singularity</strong> in the Friedmann equations marks the initial state of the universe at $t = 0 $, where $a = 0 $, density and temperature are infinite, and classical general relativity predicts a breakdown in spacetime structure.</p><h1 id="Connection-to-Reality"><a href="#Connection-to-Reality" class="headerlink" title="Connection to Reality"></a>Connection to Reality</h1><p>While all of the above can be found in a basic undergraduate textbook, I think the goal of me writing this post was to have a collection of examples of singularities both from mathematics and physics to reinforce the idea of reality. While in the mathematical examples, the $x=0$ does not represent an actual place that we can go and take measurements of $y$, but what if we did? What if we indeed knew a physical place in the world, where the function $\frac{1}{x}$ really described the behavior of the world. This is not hard, you could imagine this as the share that each person gets (of a cake or similarly sweet treat) if there are $x$ people. If there are $3$ people, each person gets $\frac{1}{3}$ of a share. What does it mean to have $0$ people? This is the kind of question that the mathematical singularity is trying to answer. But it is physically impossible to have $0$ people, so the singularity is not a real place. If you had $0$ people and a cake, the question of dividing it does not make sense. In much the same way, the singularity of the Schwarzschild metric is not a real place, it is a place where the equations break down. This does not mean that some wild stuff happens at the singularity, it means that the equations we are using to describe the world are not valid at that point. This is the same as saying that the function $\frac{1}{x}$ is not defined at $x=0$. Very often in movies, the singularity is portrayed as a place where the laws of physics break down. This is not true it is just that the laws of physics defined by the equations work everywhere else but not at that point. This could mean one of two things, 1. The equations are not valid at that point, so we need to find new equations that are valid at that point. 2. Some wild stuff happens at that point, and we need to find out what that is. And rework our equations to include that.</p><p>But simply by looking at the equations, we cannot say which of the two is true. We need to go out and measure the world to find out.</p><h1 id="References"><a href="#References" class="headerlink" title="References"></a>References</h1><p><a href="https://diposit.ub.edu/dspace/bitstream/2445/59759/1/TFG-Arnau-Romeu-Joan.pdf">https://diposit.ub.edu/dspace/bitstream/2445/59759/1/TFG-Arnau-Romeu-Joan.pdf</a></p>]]>
    </content>
    <id>https://franciscormendes.github.io/2024/10/22/singularities/</id>
    <link href="https://franciscormendes.github.io/2024/10/22/singularities/"/>
    <published>2024-10-22T00:00:00.000Z</published>
    <summary>From 1/x to the Schwarzschild radius: what singularities mean in analysis and general relativity, and whether they correspond to anything physically real.</summary>
    <title>A Short Note on Singularities in Physics and Mathematics</title>
    <updated>2026-04-10T14:24:00.562Z</updated>
  </entry>
  <entry>
    <author>
      <name>Francisco Romaldo Fernandes Mendes</name>
    </author>
    <category term="machine-learning" scheme="https://franciscormendes.github.io/categories/machine-learning/"/>
    <category term="machine-learning" scheme="https://franciscormendes.github.io/tags/machine-learning/"/>
    <category term="recommender-systems" scheme="https://franciscormendes.github.io/tags/recommender-systems/"/>
    <category term="low-rank-approximation" scheme="https://franciscormendes.github.io/tags/low-rank-approximation/"/>
    <category term="graph-neural-networks" scheme="https://franciscormendes.github.io/tags/graph-neural-networks/"/>
    <content>
      <![CDATA[<h1 id="Introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h1><p><em>When do I use “old-school” ML models like matrix factorization and when do I use graph neural networks?</em> </p><p><em>Can we do something better than matrix factorization?</em> </p><p><em>Why can’t we use neural networks? What is matrix factorization anyway?</em> </p><p>These are just some of the questions, I get asked whenever I start a recommendation engine project. Answering these questions requires a good understanding of both algorithms, which I will try to outline here. The usual way to understand the benefit of one algorithm over the other is by trying to prove that one is a special case of the other.</p><p>While it can be shown that a Graph Neural Network can be expressed as a matrix factorization problem. This matrix is not easy to interpret in the usual sense. Contrary to popular belief, matrix factorization (MF) is not “simpler” than a Graph Neural Network (nor is the opposite true). To make matters worse, the GCN is actually more expensive to train since it takes far more cloud compute than does MF. The goal of this article is to provide some intuition as to when a GCN might be worthwhile to try out.</p><p>This article is primarily aimed at data science managers with some background in linear algebra (or not, see next sentence) who may or may not have used a recommendation engine package before. Having said that, if you are not comfortable with some proofs I have a key takeaways subsection in each section that should form a good basis for decision making that perhaps other team members can dig deep into.</p><h1 id="Key-Tenets-of-Linear-Algebra-and-Graphs-in-Recommendation-Engine-design"><a href="#Key-Tenets-of-Linear-Algebra-and-Graphs-in-Recommendation-Engine-design" class="headerlink" title="Key Tenets of Linear Algebra and Graphs in Recommendation Engine design"></a>Key Tenets of Linear Algebra and Graphs in Recommendation Engine design</h1><p>The key tenets of design come down to the difference between a graph and a matrix. The linking between graph theory and linear algebra comes from the fact that ALL graphs come with an adjacency matrix. More complex versions of this matrix (degree matrix, random walk graphs) capture more complex properties of the graph. Thus you can usually express any theorem in graph theory in matrix form by use of the appropriate matrix. </p><ol><li>The Matrix Factorization of the interaction matrix (defined below) is the most commonly used form of matrix factorization. Since this matrix is the easiest to interpret.</li><li><em>Any</em> Graph Convolutional Neural Network can be expressed as the factorization of <em>some</em> matrix, this matrix is usually far removed from the interaction matrix and is complex to interpret.</li><li>For a given matrix to be factorized, matrix factorization requires fewer parameters and is therefore easier to train. </li><li>Graphical structures are easily interpretable even if matrices expressing their behavior are not.</li></ol><h1 id="Tensor-Based-Methods"><a href="#Tensor-Based-Methods" class="headerlink" title="Tensor Based Methods"></a>Tensor Based Methods</h1><p>In this section, I will formulate the recommendation engine problem as a large tensor or matrix that needs to be “factorized”.<br>In one of my largest projects in Consulting, I spearheaded the creation of a recommendation engine for a top 5 US retailer. This project presented a unique challenge: the scale of the data we were working with was staggering. The recommendation engine had to operate on a 3D tensor, made up of products × users × time. The sheer size of this tensor required us to think creatively about how to scale and optimize the algorithms.</p><p>Let us start with some definitions, assume we have $n_u, n_v$ and $n_t$, users, products and time points respectively.</p><ol><li><p>User latent features, given by matrix $U$ of dimension $n_u \times r$ and each index of this matrix is $u_i$</p></li><li><p>Products latent features, given by matrix $V$, of dimensions $n_v \times r$ and each index of this matrix is $v_j$</p></li><li><p>Time latent features given by Matrix $T$, of dimensions $n_t \times r$ and each index of this matrix is $t_k$</p></li><li><p>Interaction given by $y_{ijk}$ in the tensor case, and $y_{ij}$ in the matrix case. Usually this represents either purchasing decision, or a rating (which is why it is common to name this $r_{ijk}$) or a search term. I will use the generic term “interaction” to denote any of the above.</p></li></ol><p>In the absence of a third dimension one could look at it as a matrix factorization problem, as shown in the image below,</p><p><img src="/2024/09/28/graph-convolutional-neural-network-and-matrix-factorization/matrix_Factorization.png" alt="Matrix Factorization"></p><p>Increasingly, however, it is important to take other factors into account when designing a recommendation system, such as context and time. This has led to the tensor case being the more usual case.</p><p><img src="/2024/09/28/graph-convolutional-neural-network-and-matrix-factorization/tensor_factorization.png" alt="Tensor Factorization"></p><p>This means that for the $i$th user, $j$th product at the $k$th moment in time, the interaction $y_{ijk}$ is functionally represented by the dot product of these $3$ matrices, $$y_{ijk} \approx u_i\cdot v_j\cdot t_k$$ An interaction $y_{ijk}$ can take a variety of forms, the most common approach, which we follow here will be, $y_{ijk} = 1$, if the $i$th user interacted with the $j$th product at that $k$th instance. Else, $0$. But other more complex functional forms can exist, where we can use the rating of an experience at that moment, where instead of $y \in {0,1}$ we can have a more general form $y \in \mathcal{R}$. Thus this framework is able to handle a variety of interaction functions. A question we often get is that this function is inherently linear since it is the dot product of multiple matrices. We can handle non-linearity in this framework as well, via the use of non-linear function (a.k.a an activation function). $$y_{ijk} \approx {1- \exp^{u_i\cdot v_j\cdot t_k }}$$ Or something along those lines. However, one of the attractions of this approach is that it is absurdly simply to set up.</p><h2 id="Side-Information"><a href="#Side-Information" class="headerlink" title="Side Information"></a>Side Information</h2><p>Very often in a real word use case, our clients often have information that they are eager to use in a recommendation system. These range from user demographic data that they know from experience is important, to certain product attribute data that has been generated from a different machine learning algorithm. In such a case we can integrate that into the equation given above,</p>$$y_{ijk} \approx u_i\cdot v_j\cdot t_k  +  v_j \cdot v'_j + u_i \cdot u'_i$$<p>Where, $u'_i, v'_i$ are attributes for users and products that are known beforehand. Each of these vectors are rows in $U', V'$, that are called “side-information&quot; matrices.</p><h2 id="Optimization"><a href="#Optimization" class="headerlink" title="Optimization"></a>Optimization</h2><p>We can then set up the following loss function,</p>$$\mathcal{L}(X, U, V, W_t, U', V') = \| X - (U \cdot V \cdot W_t) \|^2 + \lambda_1 \| U \cdot U' - X_u \|^2 + \lambda_2 \| V \cdot V' - X_p \|^2 + \lambda_3 (\| U \|^2 + \| V \|^2 + \| W_t \|^2)$$<p>Where:</p><ul><li>$\lambda_1$ and $\lambda_2$ are regularization terms for the alignment with side information.</li><li>$\lambda_3$ controls the regularization of the latent matrices $U$, $V$, and $W_t$.</li><li><p>The first term is the reconstruction loss of the tensor, ensuring that the interaction between users, products, and time is well-represented.</p></li><li><p>The second and third terms align the latent factors with the side information for users and products, respectively.</p></li></ul><h2 id="Tensor-Factorization-Loop"><a href="#Tensor-Factorization-Loop" class="headerlink" title="Tensor Factorization Loop"></a>Tensor Factorization Loop</h2><p>For each iteration:</p><ol><li><p>Compute the predicted tensor using the factorization: $$\hat{X} = U \cdot V \cdot W_t$$</p></li><li><p>Compute the loss using the updated loss function.</p></li><li><p>Perform gradient updates for $U$, $V$, and $W_t$.</p></li><li><p>Regularize the alignment between $U$, $V$ with $U'$ and $V'$</p></li><li><p>Repeat until convergence.</p></li></ol><h2 id="Key-Takeaway"><a href="#Key-Takeaway" class="headerlink" title="Key Takeaway"></a>Key Takeaway</h2><p>Matrix factorization allows us to decompose a matrix into two low-rank matrices, which provide insights into the properties of users and items. These matrices, often called embeddings, either embed given side information or reveal latent information about users and items based on their interaction data. This is powerful because it creates a representation of user-item relationships from behavior alone.</p><p>In practice, these embeddings can be valuable beyond prediction. For example, clients often compare the user embedding matrix</p>$U$ with their side information to see how it aligns. Interestingly, clustering users based on$U$ can reveal new patterns that fine-tune existing segments. Rather than being entirely counter-intuitive, these new clusters may separate users with subtle preferences, such as distinguishing between those who enjoy less intense thrillers from those who lean toward horror. This fine-tuning enhances personalization, as users in large segments often miss out on having their niche behaviors recognized.\<p>Mathematically, the key takeaway is the following equation (at the risk of overusing a cliche, this is the $e=mc^2$ of the recommendation engines world)</p>$$y_{ij} = u_i'v_j + \text{possibly other regularization terms}$$<p>Multiplying the lower dimensional representation of the $i$th user and the $j$th item together yields a real number that represents the magnitude of the interaction. Very low and its not going to happen, and very high means that it is. These two vectors are the “deliverable”! How we got there is irrelevant. Turns out there are multiple ways of getting there. One of them is the Graph Convolutional Network. In recommendation engine literature (particularly for neural networks) embeddings are given by $H$, in the case of matrix factorization, $H$ is obtained by stacking $U$ and $V$,</p>$$H = [U \hspace{5 pt} V]$$<h2 id="Extensions"><a href="#Extensions" class="headerlink" title="Extensions"></a>Extensions</h2><p>You do not need to stick to the simple multiplication in the objective function, you can do something more complex,</p>$$\min \sum_{(i,j) \in E} y_{ij} \left( \log \sigma(U_i^T V_j) + (1 - y_{ij}) \log (1 - \sigma(U_i^T V_j)) \right)$$<p>The above objective function is the LINE embedding. Where $\sigma$ is some non-linear function.</p><h1 id="Interaction-Tensors-as-Graphs"><a href="#Interaction-Tensors-as-Graphs" class="headerlink" title="Interaction Tensors as Graphs"></a>Interaction Tensors as Graphs</h1><p>One can immediately view a the interactions between users and items as a bipartite graph, where an edge is present only if the user interacts with that item. It is immediately obvious that we can embed the interactions matrix inside the adjacency matrix, noting that there are no edges between users and there are no edges between items.</p><p>The adjacency matrix $A$ can be represented as:</p>$$A = \begin{bmatrix}0 & R \\R^T & 0\end{bmatrix}$$<p>Recall, the matrix factorization $R = UV^T$,</p>$$A \approx\begin{bmatrix}0 & UV^T \\VU^T & 0\end{bmatrix}$$<p>where:</p><ul><li>$R$ is the user-item interaction matrix (binary values: 1 if a user has interacted with an item, 0 otherwise),</li><li>$R^T$ is the transpose of $R$, representing item-user interactions.</li></ul><p>For example, if $R$ is the following binary interaction matrix:</p>$$R = \begin{bmatrix}1 & 0 & 1 \\1 & 1 & 0\end{bmatrix}$$<p>Note, here that $R$ could have contained real numbers (such as ratings etc.) but the adjacency matrix is strictly binary. Using the weighted adjacency matrix is perfectly “legal”, but has mathematical implications that we will discuss later. Thus, the adjacency matrix $A$ becomes: </p>$$A = \begin{bmatrix}0 & 0 & 0 & 1 & 0 & 1 \\0 & 0 & 0 & 1 & 1 & 0 \\1 & 1 & 0 & 0 & 0 & 0 \\0 & 1 & 0 & 0 & 0 & 0 \\1 & 0 & 0 & 0 & 0 & 0\end{bmatrix}$$<p><img src="/2024/09/28/graph-convolutional-neural-network-and-matrix-factorization/adjacency_matrix_graph.png" alt="Bipartite graph of user-items and ratings matrix"></p><h2 id="Matrix-Factorization-of-Adjacency-Matrix"><a href="#Matrix-Factorization-of-Adjacency-Matrix" class="headerlink" title="Matrix Factorization of Adjacency Matrix"></a>Matrix Factorization of Adjacency Matrix</h2><p>Now you could use factorize, $$A \approx LM^T$$ And then use the embeddings $L$ and $M$, but now $L$ represents embeddings both for users and items (as does $M$). However, this matrix is much bigger than $R$ since the top left and bottom right block matrix are $0$. You are much better off using the $R = UV^T$ formulation to quickly converge on the optimal embeddings. The key here is that factorizing this matrix is roughly equivalent to factorizing the $R$ matrix. This is important because the adjacency matrix plays a key role in the graphical convolutional network.</p><h1 id="What-are-the-REAL-Cons-of-Matrix-Factorization"><a href="#What-are-the-REAL-Cons-of-Matrix-Factorization" class="headerlink" title="What are the REAL Cons of Matrix Factorization"></a>What are the REAL Cons of Matrix Factorization</h1><p>Matrix factorization offers key advantages in a consulting setting by quickly assessing the potential of more advanced methods on a dataset. If the user-item matrix performs well, it indicates useful latent user and item embeddings for predicting interactions. Additionally, regularization terms help estimate the impact of any side information provided by the client. The resulting embeddings, which include both interaction and side information, can be used by marketing teams for tasks like customer segmentation and churn reduction.<br>First, let me clarify some oft quoted misconceptions about matrix factorization disadvantages versus GCNs,</p><ol><li><p><em>User item interactions are a simple dot product ($\hat y_{ij} = u_i'v_j$</em>) and is therefore not linear. This is not true, even in the case of a GCN the final prediction is given by a simple dot product between the embeddings.</p></li><li><p><em>Matrix factorization cannot use existing features</em> . This is probably due to the fact that matrix factorization was popularized by the simple Netflix case, where only user-item matrix was specified. But in reality, very early in the development of matrix factorization, all kinds of additional regularization terms such as bias, side information etc. were introduced. The side information matrices are where you can specify existing features (recall, $y_{ij} = u_i'v_j + \text{possibly other regularization terms}$).</p></li><li><p><em>Cannot handle cold start</em> Neither matrix factorization nor neural networks can handle the cold start problem very well. However, this is not an unfair criticism as the neural network is better, but this is more as a consequence of its truly revolutionary feature, which I will discuss under its true advantage.</p></li><li><p><em>Higher order interactions</em> this is also false, but it is hard to see it mathematically. Let me outline a simple approach to integrate side information. Consider the matrix adjacency matrix $A$, $A^2$ gives you all edges with length $2$, such that $A + A^2$ represents all nodes that are at most $2$ edges away. You can then factorize this matrix to get what you want. This is not an unfair criticism either as multiplying such a huge matrix together is not advised and neither is it the most intuitive method.</p></li></ol><p>The biggest problem with MF is that a matrix is simply not a good representation of how people interact with products and each other. Finding a good mathematical representation of the problem is sometimes the first step in solving it. Most of the benefits of a graph convolutional neural network come as a direct consequence of using a graph structure not from the neural network architecture. The graph structure of a user-item behavior is the most general form of representation of the problem.</p><p><img src="/2024/09/28/graph-convolutional-neural-network-and-matrix-factorization/graph_similarity.png" alt="2nd Limitation of Matrix Factorization Matrix Factorization cannot &quot;see&quot; that the neighborhood structure of node &lt;!--HEXOMATH101--&gt; and node &lt;!--HEXOMATH102--&gt; are identical"></p><ol><li><p>Complex Interactions - In this structure one can easily add edges between users and between products. Note in the matrix factorization case, this is not possible since $R$ is only users x items. To include more complex interactions you pay the price with a larger and larger matrix.</p></li><li><p>Graph Structure - Perhaps the most visually striking feature of graph neural networks is that they can leverage graph structure itself (see Figure 4). Matrix factorization cannot do so easily</p></li><li><p>Higher order interactions can be captured more intuitively than in the case of matrix factorization</p></li></ol><p>Before implementing a GCN, it’s important to understand its potential benefits. In my experience, matrix factorization often provides good results quickly, and moving to GCNs makes sense only if matrix factorization has already shown promise. Another key factor is the size and richness of interactions. If the graph representation is primarily bipartite, adding user edges may not significantly enhance the recommender system. In retail, edges sometimes represented families, but these structures were often too small to be useful—giving different recommendations to family members like $11$ and $1$ is acceptable since family ties alone don’t imply similar consumption patterns. However, identifying influencers, such as nodes with high degrees connected to isolated nodes, could guide targeted discounts for products they might promote.</p><p>I would be remiss, if I did not add that ALL of these issues with matrix factorization can be fixed by tweaking the factorization in some way. In fact, a recent paper <em>Unifying Graph Convolutional Networks as Matrix Factorization</em> by Liu et. al. does exactly this and shows that this approach is even better than a GCN. Which is why I think that the biggest advantage of the GCN is not that it is “better” in some sense, but rather the richness of the graphical structure lends itself naturally to the problem of recommending products, <em>even if</em> that graphical structure can then be shown to be equivalent to some rather more complex and less intuitive matrix structure. I recommend the following experiment flow :</p><h1 id="A-Simple-GCN-model"><a href="#A-Simple-GCN-model" class="headerlink" title="A Simple GCN model"></a>A Simple GCN model</h1><p>Let us continue on from our adjacency matrix $A$ and try to build a simple ML model of an embedding, we could hypothesize that an embedding is linearly dependent on the adjacency matrix.</p>$$H = f(AWX + I_nWX)$$<p>The second additive term bears a bit of explaining. Since the adjacency matrix has a $0$ diagonal, a value of $0$ get multiplied with the node’s own features $x\in X$. To avoid this we add the node’s own feature matrix $X$ using the diagonal matrix.</p><p>We need to make another important adjustment to $A$, we need to divide each term in the adjacency matrix by the degree of each node. $$\tilde{A} = A + I_n$$ $$A \equiv \tilde{D}^\frac{1}{2}\tilde{A}\tilde{D}^\frac{1}{2}$$ At the risk of abusing notation, we redefine $A$ as some normalized form of the adjacency matrix after edges connecting each node with itself have been added to the graph. I like this notation because it emphasizes the fact that you do not need to do this, if you suspect that normalizing your nodes by their degree of connectivity is not important then you do not need to do this step (though it costs you nothing to do so). In retail, the degree of a user node refers to the number of products they consume, while the degree of a product node reflects the number of customers it reaches. A product may have millions of consumers, but even the most avid user node typically consumes far fewer, perhaps only hundreds of products.</p><p>Here $X = [X_{u}, X_{i}$]. $$H  = [U V]$$</p><p>Here we can split the equations by the subgraphs for which they apply to,</p>$$H_u = f(A_u W_u X_u)$$ $$H_v = f(A_v W_v X_v)$$<p>Note the equivalence the matrix case, in the matrix case we have to stack it ourselves because of the way we set up the matrix, but in the case of a GCN $H$ is already $m\times n$ and represents embeddings of both users and items.</p><p>The likelihood of an interaction is,</p>$$\hat y_{ij} = H_u^T H_v$$<p>The loss function is,</p>$$L = \sum_{(u, i) \in \mathcal{I}} \left( y_{ui} - \hat{y}_{ui} \right)^2$$<p>We can substitute the components of $H$ to get a tight expression for optimizing loss,</p>$$L = \sum_{(u, i) \in \mathcal{I}} \left( y_{ui} - f(A_u W_u X_u)^T f(A_v W_v X_v)\right)^2$$<p>This is the main “result” of this blog post that you can equally look at this one layer GCN as a matrix factorization problem of the user-item interaction matrix but with the more complex looking low rank matrices on the right. In this sense, you can always create a matrix factorization that equates to the loss function of a GCN.</p><p>You can update parameters using SGD or some other technique. I will not get into that too much in this post.</p><h2 id="Understanding-the-GCN-equation"><a href="#Understanding-the-GCN-equation" class="headerlink" title="Understanding the GCN equation"></a>Understanding the GCN equation</h2><p>Equations 1 and 2 are the most important equations in the GCN framework. $W$ is some $(m+n) \times d$ set of weights that learn how to embed or encode the information contained in $X$ into $H$. For this one layer model, we are only considering values from the nodes that are one edge away, since the value of $h_i$ is only dependent on all the $x_j$‘s that are directly connected to it and its own $x_i$. However, if you then apply this operation again, $H$ now has all the information contained in all the nodes connected to it in its own $h_i$ but also so does every other nodes $h_k$.</p>$$H^0 = f(AW^0X + I_nW^0X)$$$$H^1 = f(AW^1H^0 + I_nW^1H^0)$$<p>More succinctly, $$H^1 = f(AW^1 f(AW^0X + I_nW^0X)+ I_nW^1H^0)$$</p><h2 id="Equivalence-to-Matrix-Factorization-for-a-one-layer-GCN"><a href="#Equivalence-to-Matrix-Factorization-for-a-one-layer-GCN" class="headerlink" title="Equivalence to Matrix Factorization for a one layer GCN"></a>Equivalence to Matrix Factorization for a one layer GCN</h2><p>You could just as easily have started with two random matrices $U$ and $V$ and optimize them using your favorite optimization algorithm and end up with the likelihood for interaction function, </p>$$\hat y_{ij} = U^T V \equiv H_u^T H_v$$<p>So you get the same outcome for a one layer GCN as you would from matrix factorization. Note that, it has been proved that even multi-layer GCNs are equivalent to matrix factorization but the matrix being factorized is not that easy to interpret. </p><h2 id="Key-Takeaways"><a href="#Key-Takeaways" class="headerlink" title="Key Takeaways"></a>Key Takeaways</h2><p>The differences between MF and GCN really begin to take form when we go into multi-layerd GCNs. In the case of the one layer GCN the embeddings of $H^0$ are only influenced by the nodes connected to it. Thus the features of a customer node will be only influenced by the products that they buy, similarly, the product node will be only influenced by the customers who by them. However, for deeper neural networks :</p><ol><li><p>2 layer: every customer’s embedding is influenced by the embeddings of the products they consume and the embeddings of other customers of the products they consume. Similarly, every product is influenced by the customers who consume that product as well as by the products of the customers who consume that product.</p></li><li><p>3 layer: every customers embedding is influenced by the products they consume, other customers of the products they consume and products consumed by other customers of the products they consume. Similarly, every product is influenced by the consumers of that product, as well as products of consumers of that product as well as products consumed by consumers of that product.</p></li></ol><p>You can see where this is going, in most practical applications, there are only so many levels you need to go to get a good result. In my experience $2$ is the bare minimum (because $1$ is unlikely to do better than an MF, in fact they are equivalent) and $3$ is about how deep you can feasibly go without exploding the number of training parameters.</p><p>That leads to another critical point when considering GCNs, you really pay a price (in blood, mind you) for every layer deep you go. Consider the one layer case, you really have $n\times d$ and $n\times d'$ parameters to learn, because you have to learn both the weight matrix $W$ and the matrix of embeddings $H$. But the MF case you directly learn $H$. So if you were only going to go one layer deep you might as well use matrix factorization.</p><p>Going the other way, if you are considering more than $3$ layers the reality of the problem (in my usual signal processing problems this would be “physical” laws) i.e. the behavioral constraints mean that more than 3 degrees deep of influence (think about what point 3 would mean for a $5$ layer network) is unlikely to be supported by any theoretical evidence of consumer behavior.</p><h1 id="Final-Prayer-and-Blessing"><a href="#Final-Prayer-and-Blessing" class="headerlink" title="Final Prayer and Blessing"></a>Final Prayer and Blessing</h1><p>I would like for the reader of this to leave with a better sense of the relationship between matrix factorization and GCNs. Like most neural network based models we tend to think of them as a black box and a black box that is “better”. However, in the one layer GCN case we can see that they are equal, with the GCN in fact having more learnable parameters (therefore more cost to train).<br>Therefore, it makes sense to use $2$ layers or more. But when using more, we need to justify them either behaviorally or with expert advice.</p><h3 id="How-to-go-from-MF-to-GCNs"><a href="#How-to-go-from-MF-to-GCNs" class="headerlink" title="How to go from MF to GCNs"></a>How to go from MF to GCNs</h3><ol><li><p>Start with matrix factorization of the user-item matrix, maybe add in context or time. If it performs well and recommendations line up with non-ML recommendations (using base segmentation analysis), that means the model is at least somewhat sensible.</p></li><li><p>Consider doing a GCN next if the performance of MF is decent but not great. Additionally, definitely try GCN if you know (from marketing etc) that the richness of the graph structure actually plays a role in the prediction. For example, in the sale of Milwaukee tools a graph structure is probably not that useful. However, for selling Thursday Boots which is heavily influenced by social media clusters, the graph structure might be much more useful.</p></li><li><p>Interestingly, the MF matrices tend to be very long and narrow (there are usually thousands of users and most companies have far more users than they have products. This is not true for a company like Amazon (300 million users and 300 million products). But if you have a long narrow matrix that is sparse you are not too concerned with computation since at worst you have $m\times n \approx O(n), m<<n$, it does not matter much whether you do MF or GCN, but $m\times n  = O(mn)$ when $m\approx n$, for such a case the matrix approach will probably give you a faster result.</p></li></ol><p>It is worthwhile in a consulting environment to always start with a simple matrix factorization, the GCN for simplicity of use and understanding but then find a matrix structure that approximates only the most interesting and rich aspects of the graph structure that actually influence the final recommendations.</p><h1 id="References"><a href="#References" class="headerlink" title="References"></a>References</h1><p><a href="https://jonathan-hui.medium.com/graph-convolutional-networks-gcn-pooling-839184205692">https://jonathan-hui.medium.com/graph-convolutional-networks-gcn-pooling-839184205692</a><br><a href="https://tkipf.github.io/graph-convolutional-networks/">https://tkipf.github.io/graph-convolutional-networks/</a> <a href="https://openreview.net/forum?id=HJxf53EtDr">https://openreview.net/forum?id=HJxf53EtDr</a><br><a href="https://distill.pub/2021/gnn-intro/">https://distill.pub/2021/gnn-intro/</a> <a href="https://jonathan-hui.medium.com/graph-convolutional-networks-gcn-pooling-839184205692">https://jonathan-hui.medium.com/graph-convolutional-networks-gcn-pooling-839184205692</a></p>]]>
    </content>
    <id>https://franciscormendes.github.io/2024/09/28/graph-convolutional-neural-network-and-matrix-factorization/</id>
    <link href="https://franciscormendes.github.io/2024/09/28/graph-convolutional-neural-network-and-matrix-factorization/"/>
    <published>2024-09-28T00:00:00.000Z</published>
    <summary>The mathematical essentials for recommender systems: from matrix factorization via SVD to graph convolutional networks, and why the spectral perspective unifies both approaches.</summary>
    <title>Unifying Tensor Factorization and Graph Neural Networks: Review of Mathematical Essentials for Recommender Systems</title>
    <updated>2026-04-10T14:24:00.546Z</updated>
  </entry>
  <entry>
    <author>
      <name>Francisco Romaldo Fernandes Mendes</name>
    </author>
    <category term="machine-learning" scheme="https://franciscormendes.github.io/categories/machine-learning/"/>
    <category term="machine-learning" scheme="https://franciscormendes.github.io/tags/machine-learning/"/>
    <category term="embedded-ml" scheme="https://franciscormendes.github.io/tags/embedded-ml/"/>
    <category term="convolutional-neural-networks" scheme="https://franciscormendes.github.io/tags/convolutional-neural-networks/"/>
    <category term="low-rank-approximation" scheme="https://franciscormendes.github.io/tags/low-rank-approximation/"/>
    <category term="model-compression" scheme="https://franciscormendes.github.io/tags/model-compression/"/>
    <category term="lora" scheme="https://franciscormendes.github.io/tags/lora/"/>
    <content>
      <![CDATA[<div class="series-box">  <div class="series-label">Series</div>  <div class="series-name">Low-Rank Approximation for Neural Networks</div>  <ol class="series-list"><li class="series-item"><a href="/2024/04/24/lora-2/">Part II :  Shrinking Neural Networks for Embedded Systems Using Low Rank Approximations (LoRA)</a></li><li class="series-item series-current"><span>Part III :  What does Low Rank Factorization of a Convolutional Layer really do?</span></li><li class="series-item"><a href="/2024/04/03/lora/">Part I :  Shrinking Neural Networks for Embedded Systems Using Low Rank Approximations (LoRA)</a></li></ol></div><h1 id="Decomposition-of-a-Convolutional-layer"><a href="#Decomposition-of-a-Convolutional-layer" class="headerlink" title="Decomposition of a Convolutional layer"></a>Decomposition of a Convolutional layer</h1><p>In <a href="/2024/04/03/lora/">Part I</a> I described (in some detail) what it means to decompose a matrix multiply into a sequence of low rank matrix multiplies, and <a href="/2024/04/24/lora-2/">Part II</a> extended that to convolutional kernels and rank selection. We can go further still for general tensors, though this is somewhat less easy to see since tensors in higher dimensions are quite hard to visualize.<br>Recall, the matrix formulation,</p>$$Y = XW + b = XUSV' + b$$<p>Where $U$ and $V$ are the left and right singular vectors of $W$ respectively. The idea is to approximate $W$ as a sum of outer products of $U$ and $V$ of lower rank.<br>Now instead of a weight matrix multiplication $y = WX + b$ we have a kernel operation, $y = K\circledast X + b$ where $\circledast$ is the convolution operation. The idea is to approximate $K$ as a sum of outer products of $U$ and $V$ of lower rank.<br>Interestingly, you can also think about this as a matrix multiplication, by creating a Toplitz matrix version of $K$ , call it $K'$ and then doing $y = K'X + b$. But this comes with issues as $K'$ is much much bigger than $K$. So we just approach it as a convolution operation for now. </p><h1 id="Convolution-Operation"><a href="#Convolution-Operation" class="headerlink" title="Convolution Operation"></a>Convolution Operation</h1><p>At the heart of it, a convolution operation takes a smaller cube subset of a “cube” of numbers (also known as the map stack) multiplies each of those numbers by a fixed set of numbers (also known as the kernel) and gives a single scalar output. Let us start with what each “slice” of the cube really represents.</p><p><img src="/2024/09/13/lora-3/image_parrot.png" alt="Each channel represents the intensity of one color. And since we have already separated out the channels we can revert it to grey-scale. Where white means that color is very intense or the value at that pixel is high and black means it is very low."></p><p><img src="/2024/09/13/lora-3/lighthouse.png" alt="Each such image is shaped into a &quot;cube&quot;. For an RGB image, the &quot;depth&quot; of the image is 3 (one for each color)."></p><p>Now that we have a working example of the representation, let us try to visualize what a convolution is.</p><p><img src="/2024/09/13/lora-3/convolution.png" alt="Basic Convolution, maps a &quot;cube&quot; to a number"></p><p>A convolution operation takes a subset of the RGB image across all channels and maps it to one number (a scalar), by multiplying the cube of numbers with a fixed set of numbers (a.k.a kernel, not pictured here) and adding them together.A convolution operation multiplies each pixel in the image across all $3$ channels with a fixed number and add it all up.</p><h1 id="Low-Rank-Approximation-of-Convolution"><a href="#Low-Rank-Approximation-of-Convolution" class="headerlink" title="Low Rank Approximation of Convolution"></a>Low Rank Approximation of Convolution</h1><p>Now that we have a good idea of what a convolution looks like, we can now try to visualize what a low rank approximation to a convolution might look like. The particular kind of approximation we have chosen here does the following 4 operations to approximate the one convolution operation being done.</p><p><img src="/2024/09/13/lora-3/2_conv.png" alt="Still maps a cube to a number but does so via a sequence of 2 &quot;simpler&quot; operations"></p><h1 id="Painful-Example-of-Convolution-by-hand"><a href="#Painful-Example-of-Convolution-by-hand" class="headerlink" title="Painful Example of Convolution by hand"></a>Painful Example of Convolution by hand</h1><p>Consider the input matrix :</p>$$X = \begin{bmatrix}1 & 2 & 3 & 0 & 1 \\0 & 1 & 2 & 3 & 0 \\3 & 0 & 1 & 2 & 3 \\2 & 3 & 0 & 1 & 2 \\1 & 2 & 3 & 0 & 1 \\\end{bmatrix}$$ Input slice: $$\begin{bmatrix}1 & 2 & 3 \\0 & 1 & 2 \\3 & 0 & 1 \\\end{bmatrix}$$<p>Kernel: $$\begin{bmatrix}1 & 0 & -1 \\1 & 0 & -1 \\1 & 0 & -1 \\\end{bmatrix}$$</p><p>Element-wise multiplication and sum: $$(1 \cdot 1) + (2 \cdot 0) + (3 \cdot -1) + \\(0 \cdot 1) + (1 \cdot 0) + (2 \cdot -1) + \\(3 \cdot 1) + (0 \cdot 0) + (1 \cdot -1)$$</p>$$\implies1 + 0 - 3 + \\0 + 0 - 2 + \\3 + 0 - 1 = -2$$ Now repeat that by moving the kernel one step over (you can in fact change this with the stride argument for convolution).<h1 id="Low-Rank-Approximation-of-convolution"><a href="#Low-Rank-Approximation-of-convolution" class="headerlink" title="Low Rank Approximation of convolution"></a>Low Rank Approximation of convolution</h1><p>Now we will painfully do a low rank decomposition of the convolution kernel above. There is a theorem that says that a $2D$ matrix can be approximated by a sum of 2 outer products of two vectors. Say we can express $K$ as, $$K \approx a_1 \times b_1 + a_2\times b_2$$</p><p>We can easily guess $a_i, b_i$. Consider, $$a_1 = \begin{bmatrix}     1\\     1\\     1\\ \end{bmatrix}$$ $$b_1 = \begin{bmatrix}     1\\     0\\     -1\\ \end{bmatrix}$$ $$a_2 = \begin{bmatrix}     0\\     0\\     0\\ \end{bmatrix}$$ $$b_2 = \begin{bmatrix}     0\\     0\\     0\\ \end{bmatrix}$$</p><p>This is easy because I chose values for the kernel that were easy to break down. How to perform this breakdown is the subject of the later sections.</p>$$K = a_1\times b_1 + a_2 \times b_2 = \begin{bmatrix}1 & 0& -1 \\1 & 0 & -1 \\1 & 0 & -1 \\\end{bmatrix} +\begin{bmatrix}0 & 0 & 0 \\0 & 0 & 0 \\0 & 0 & 0 \\\end{bmatrix} = \begin{bmatrix}1 & 0 & -1 \\1 & 0 & -1 \\1 & 0 & -1 \\\end{bmatrix}$$<p>Consider the original kernel matrix $K$ and the low-rank vectors:</p>$$K = \begin{bmatrix}1 & 0 & -1 \\1 & 0 & -1 \\1 & 0 & -1\end{bmatrix}$$$$a_1 = \begin{bmatrix}1 \\1 \\1\end{bmatrix}, \quadb_1 = \begin{bmatrix}1 \\0 \\-1\end{bmatrix}$$<p>The input matrix $M$ is:</p>$$M = \begin{bmatrix}1 & 2 & 3 & 0 & 1 \\0 & 1 & 2 & 3 & 0 \\3 & 0 & 1 & 2 & 3 \\2 & 3 & 0 & 1 & 2 \\1 & 2 & 3 & 0 & 1\end{bmatrix}$$<h2 id="Convolution-with-Original-Kernel"><a href="#Convolution-with-Original-Kernel" class="headerlink" title="Convolution with Original Kernel"></a>Convolution with Original Kernel</h2><p>Perform the convolution at the top-left corner of the input matrix:</p>$$\text{Input slice} = \begin{bmatrix}1 & 2 & 3 \\0 & 1 & 2 \\3 & 0 & 1\end{bmatrix}$$$$\text{Element-wise multiplication and sum:}$$$$\begin{aligned}(1 \times 1) + (2 \times 0) + (3 \times -1) + \\(0 \times 1) + (1 \times 0) + (2 \times -1) + \\(3 \times 1) + (0 \times 0) + (1 \times -1) &= \\1 + 0 - 3 + 0 + 0 - 2 + 3 + 0 - 1 &= -2\end{aligned}$$<h2 id="Convolution-with-Low-Rank-Vectors"><a href="#Convolution-with-Low-Rank-Vectors" class="headerlink" title="Convolution with Low-Rank Vectors"></a>Convolution with Low-Rank Vectors</h2><p>Using the low-rank vectors:</p>$$a_1 = \begin{bmatrix}1 \\1 \\1\end{bmatrix}, \quadb_1 = \begin{bmatrix}1 \\0 \\-1\end{bmatrix}$$<p>Step 1: Apply $b_1$ (filter along the columns):**</p>$$\text{Column-wise operation:}$$$$\begin{aligned}1 \cdot \begin{bmatrix}1 \\0 \\-1\end{bmatrix} &= \begin{bmatrix}1 \\0 \\-1\end{bmatrix} \\2 \cdot \begin{bmatrix}1 \\0 \\-1\end{bmatrix} &= \begin{bmatrix}2 \\0 \\-2\end{bmatrix} \\3 \cdot \begin{bmatrix}1 \\0 \\-1\end{bmatrix} &= \begin{bmatrix}3 \\0 \\-3\end{bmatrix}\end{aligned}$$$$\text{Summed result for each column:}$$$$\begin{bmatrix}1 \\0 \\-1\end{bmatrix} +\begin{bmatrix}2 \\0 \\-2\end{bmatrix} +\begin{bmatrix}3 \\0 \\-3\end{bmatrix} =\begin{bmatrix}6 \\0 \\-6\end{bmatrix}$$<p>Step 2: Apply $a_1$ (sum along the rows):**</p>$$\text{Row-wise operation:}$$$$1 \cdot (6) + 1 \cdot (0) + 1 \cdot (-6) = 6 + 0 - 6 = 0$$<h2 id="Comparison"><a href="#Comparison" class="headerlink" title="Comparison"></a>Comparison</h2><ul><li><p>Convolution with Original Kernel: -2</p></li><li><p>Convolution with Low-Rank Vectors: 0</p></li></ul><p>The results are different due to the simplifications made by the low-rank approximation. But this is part of the problem that we need to optimize for when picking low rank approximations. In practice, we will ALWAYS lose some accuracy</p><h1 id="PyTorch-Implementation"><a href="#PyTorch-Implementation" class="headerlink" title="PyTorch Implementation"></a>PyTorch Implementation</h1><p>Below you can find the original definition of AlexNet. </p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">Net</span>(nn.Module):</span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">__init__</span>(<span class="params">self</span>):</span><br><span class="line">        <span class="built_in">super</span>().__init__()</span><br><span class="line">        self.layers  = nn.ModuleDict()</span><br><span class="line">        self.layers[<span class="string">&#x27;conv1&#x27;</span>] = nn.Conv2d(<span class="number">3</span>, <span class="number">6</span>, <span class="number">5</span>)</span><br><span class="line">        self.layers[<span class="string">&#x27;pool&#x27;</span>] = nn.MaxPool2d(<span class="number">2</span>, <span class="number">2</span>)</span><br><span class="line">        self.layers[<span class="string">&#x27;conv2&#x27;</span>] = nn.Conv2d(<span class="number">6</span>, <span class="number">16</span>, <span class="number">5</span>)</span><br><span class="line">        self.layers[<span class="string">&#x27;fc1&#x27;</span>] = nn.Linear(<span class="number">16</span> * <span class="number">5</span> * <span class="number">5</span>, <span class="number">120</span>)</span><br><span class="line">        self.layers[<span class="string">&#x27;fc2&#x27;</span>] = nn.Linear(<span class="number">120</span>, <span class="number">84</span>)</span><br><span class="line">        self.layers[<span class="string">&#x27;fc3&#x27;</span>] = nn.Linear(<span class="number">84</span>, <span class="number">10</span>)</span><br><span class="line"></span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">forward</span>(<span class="params">self,x</span>):</span><br><span class="line">        x = self.layers[<span class="string">&#x27;pool&#x27;</span>](F.relu(self.layers[<span class="string">&#x27;conv1&#x27;</span>](x)))</span><br><span class="line">        x = self.layers[<span class="string">&#x27;pool&#x27;</span>](F.relu(self.layers[<span class="string">&#x27;conv2&#x27;</span>](x)))</span><br><span class="line">        x = torch.flatten(x, <span class="number">1</span>)</span><br><span class="line">        x = F.relu(self.layers[<span class="string">&#x27;fc1&#x27;</span>](x))</span><br><span class="line">        x = F.relu(self.layers[<span class="string">&#x27;fc2&#x27;</span>](x))</span><br><span class="line">        x = self.layers[<span class="string">&#x27;fc3&#x27;</span>](x)</span><br><span class="line">        <span class="keyword">return</span> x</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">evaluate_model</span>(<span class="params">net</span>):</span><br><span class="line">    <span class="keyword">import</span> torchvision.transforms <span class="keyword">as</span> transforms</span><br><span class="line">    batch_size = <span class="number">4</span> <span class="comment"># [4, 3, 32, 32]</span></span><br><span class="line">    transform = transforms.Compose(</span><br><span class="line">        [transforms.ToTensor(),</span><br><span class="line">         transforms.Normalize((<span class="number">0.5</span>, <span class="number">0.5</span>, <span class="number">0.5</span>), (<span class="number">0.5</span>, <span class="number">0.5</span>, <span class="number">0.5</span>))])</span><br><span class="line">    classes = (<span class="string">&#x27;plane&#x27;</span>, <span class="string">&#x27;car&#x27;</span>, <span class="string">&#x27;bird&#x27;</span>, <span class="string">&#x27;cat&#x27;</span>,</span><br><span class="line">               <span class="string">&#x27;deer&#x27;</span>, <span class="string">&#x27;dog&#x27;</span>, <span class="string">&#x27;frog&#x27;</span>, <span class="string">&#x27;horse&#x27;</span>, <span class="string">&#x27;ship&#x27;</span>, <span class="string">&#x27;truck&#x27;</span>)</span><br><span class="line">    trainset = torchvision.datasets.CIFAR10(root=<span class="string">&#x27;../data&#x27;</span>, train=<span class="literal">True</span>,</span><br><span class="line">                                            download=<span class="literal">True</span>, transform=transform)</span><br><span class="line">    trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,</span><br><span class="line">                                              shuffle=<span class="literal">True</span>, num_workers=<span class="number">2</span>)</span><br><span class="line">    testset = torchvision.datasets.CIFAR10(root=<span class="string">&#x27;../data&#x27;</span>, train=<span class="literal">False</span>,</span><br><span class="line">                                           download=<span class="literal">True</span>, transform=transform)</span><br><span class="line">    testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,</span><br><span class="line">                                             shuffle=<span class="literal">False</span>, num_workers=<span class="number">2</span>)</span><br><span class="line">    <span class="comment"># prepare to count predictions for each class</span></span><br><span class="line">    correct_pred = &#123;classname: <span class="number">0</span> <span class="keyword">for</span> classname <span class="keyword">in</span> classes&#125;</span><br><span class="line">    total_pred = &#123;classname: <span class="number">0</span> <span class="keyword">for</span> classname <span class="keyword">in</span> classes&#125;</span><br><span class="line">    <span class="comment"># again no gradients needed</span></span><br><span class="line">    <span class="keyword">with</span> torch.no_grad():</span><br><span class="line">        <span class="keyword">for</span> data <span class="keyword">in</span> testloader:</span><br><span class="line">            images, labels = data</span><br><span class="line">            outputs = net(images)</span><br><span class="line">            _, predictions = torch.<span class="built_in">max</span>(outputs, <span class="number">1</span>)</span><br><span class="line">            <span class="comment"># collect the correct predictions for each class</span></span><br><span class="line">            <span class="keyword">for</span> label, prediction <span class="keyword">in</span> <span class="built_in">zip</span>(labels, predictions):</span><br><span class="line">                <span class="keyword">if</span> label == prediction:</span><br><span class="line">                    correct_pred[classes[label]] += <span class="number">1</span></span><br><span class="line">                total_pred[classes[label]] += <span class="number">1</span></span><br><span class="line">    <span class="comment"># print accuracy for each class</span></span><br><span class="line">    <span class="keyword">for</span> classname, correct_count <span class="keyword">in</span> correct_pred.items():</span><br><span class="line">        accuracy = <span class="number">100</span> * <span class="built_in">float</span>(correct_count) / total_pred[classname]</span><br><span class="line">        <span class="built_in">print</span>(<span class="string">f&#x27;Original Accuracy for class: <span class="subst">&#123;classname:5s&#125;</span> is <span class="subst">&#123;accuracy:<span class="number">.1</span>f&#125;</span> %&#x27;</span>)</span><br></pre></td></tr></table></figure><p>Now let us decompose the first convolutional layer into 3 simpler layers using SVD</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">slice_wise_svd</span>(<span class="params">tensor,rank</span>):</span><br><span class="line">    <span class="comment"># tensor is a 4D tensor</span></span><br><span class="line">    <span class="comment"># rank is the target rank</span></span><br><span class="line">    <span class="comment"># returns a list of 4D tensors</span></span><br><span class="line">    <span class="comment"># each tensor is a slice of the input tensor</span></span><br><span class="line">    <span class="comment"># each slice is decomposed using SVD</span></span><br><span class="line">    <span class="comment"># and the decomposition is used to approximate the slice</span></span><br><span class="line">    <span class="comment"># the approximated slice is returned as a 4D tensor</span></span><br><span class="line">    <span class="comment"># the list of approximated slices is returned</span></span><br><span class="line">    num_filters, input_channels, kernel_width, kernel_height = tensor.shape</span><br><span class="line">    kernel_U = torch.zeros((num_filters, input_channels,kernel_height,rank))</span><br><span class="line">    kernel_S = torch.zeros((input_channels,num_filters,rank,rank))</span><br><span class="line">    kernel_V = torch.zeros((num_filters,input_channels,rank,kernel_width))</span><br><span class="line">    approximated_slices = []</span><br><span class="line">    reconstructed_tensor = torch.zeros_like(tensor)</span><br><span class="line">    <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(num_filters):</span><br><span class="line">        <span class="keyword">for</span> j <span class="keyword">in</span> <span class="built_in">range</span>(input_channels):</span><br><span class="line">            U, S, V = torch.svd(tensor[i, j,:,:])</span><br><span class="line">            U = U[:,:rank]</span><br><span class="line">            S = S[:rank]</span><br><span class="line">            V = V[:,:rank]</span><br><span class="line">            kernel_U[i,j,:,:] = U</span><br><span class="line">            kernel_S[j,i,:,:] = torch.diag(S)</span><br><span class="line">            kernel_V[i,j,:,:] = torch.transpose(V,<span class="number">0</span>,<span class="number">1</span>)</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">    <span class="comment"># print the reconstruction error</span></span><br><span class="line">    <span class="built_in">print</span>(<span class="string">&quot;Reconstruction error: &quot;</span>,torch.norm(reconstructed_tensor-tensor).item())</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> kernel_U, kernel_S, kernel_V</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">svd_decomposition_conv_layer</span>(<span class="params">layer, rank</span>):</span><br><span class="line">    <span class="string">&quot;&quot;&quot; Gets a conv layer and a target rank,</span></span><br><span class="line"><span class="string">        returns a nn.Sequential object with the decomposition</span></span><br><span class="line"><span class="string">    &quot;&quot;&quot;</span></span><br><span class="line"></span><br><span class="line">    <span class="comment"># Perform SVD decomposition on the layer weight tensorly.</span></span><br><span class="line">    </span><br><span class="line">    layer_weight = layer.weight.data</span><br><span class="line">    kernel_U, kernel_S, kernel_V = slice_wise_svd(layer_weight,rank)</span><br><span class="line">    U_layer = nn.Conv2d(in_channels=kernel_U.shape[<span class="number">1</span>],</span><br><span class="line">                                                out_channels=kernel_U.shape[<span class="number">0</span>], kernel_size=(kernel_U.shape[<span class="number">2</span>], <span class="number">1</span>), padding=<span class="number">0</span>, stride = <span class="number">1</span>,</span><br><span class="line">                                                dilation=layer.dilation, bias=<span class="literal">True</span>)</span><br><span class="line">    S_layer = nn.Conv2d(in_channels=kernel_S.shape[<span class="number">1</span>],</span><br><span class="line">                                                out_channels=kernel_S.shape[<span class="number">0</span>], kernel_size=<span class="number">1</span>, padding=<span class="number">0</span>, stride = <span class="number">1</span>,</span><br><span class="line">                                                dilation=layer.dilation, bias=<span class="literal">False</span>)</span><br><span class="line">    V_layer = nn.Conv2d(in_channels=kernel_V.shape[<span class="number">1</span>],</span><br><span class="line">                                                out_channels=kernel_V.shape[<span class="number">0</span>], kernel_size=(<span class="number">1</span>, kernel_V.shape[<span class="number">3</span>]), padding=<span class="number">0</span>, stride = <span class="number">1</span>,</span><br><span class="line">                                                dilation=layer.dilation, bias=<span class="literal">False</span>)</span><br><span class="line">    <span class="comment"># store the bias in U_layer from layer</span></span><br><span class="line">    U_layer.bias = layer.bias</span><br><span class="line"></span><br><span class="line">    <span class="comment"># set weights as the svd decomposition</span></span><br><span class="line">    U_layer.weight.data = kernel_U</span><br><span class="line">    S_layer.weight.data = kernel_S</span><br><span class="line">    V_layer.weight.data = kernel_V</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> [U_layer, S_layer, V_layer]</span><br><span class="line">    </span><br><span class="line">    </span><br><span class="line"><span class="keyword">class</span> <span class="title class_">lowRankNetSVD</span>(<span class="title class_ inherited__">Net</span>):</span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">__init__</span>(<span class="params">self, original_network</span>):</span><br><span class="line">        <span class="built_in">super</span>().__init__()</span><br><span class="line">        self.layers = nn.ModuleDict()</span><br><span class="line">        self.initialize_layers(original_network)</span><br><span class="line">    </span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">initialize_layers</span>(<span class="params">self, original_network</span>):</span><br><span class="line">        <span class="comment"># Make deep copy of the original network so that it doesn&#x27;t get modified</span></span><br><span class="line">        og_network = copy.deepcopy(original_network)</span><br><span class="line">        <span class="comment"># Getting first layer from the original network</span></span><br><span class="line">        layer_to_replace = <span class="string">&quot;conv1&quot;</span></span><br><span class="line">        <span class="comment"># Remove the first layer</span></span><br><span class="line">        <span class="keyword">for</span> i, layer <span class="keyword">in</span> <span class="built_in">enumerate</span>(og_network.layers):</span><br><span class="line">            <span class="keyword">if</span> layer == layer_to_replace:</span><br><span class="line">                <span class="comment"># decompose that layer</span></span><br><span class="line">                rank = <span class="number">1</span></span><br><span class="line">                kernel = og_network.layers[layer].weight.data</span><br><span class="line">                decomp_layers = svd_decomposition_conv_layer(og_network.layers[layer], rank)</span><br><span class="line">                <span class="keyword">for</span> j, decomp_layer <span class="keyword">in</span> <span class="built_in">enumerate</span>(decomp_layers):</span><br><span class="line">                    self.layers[layer + <span class="string">f&quot;_<span class="subst">&#123;j&#125;</span>&quot;</span>] = decomp_layer</span><br><span class="line">            <span class="keyword">else</span>:</span><br><span class="line">                self.layers[layer] = og_network.layers[layer]</span><br><span class="line">    </span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">forward</span>(<span class="params">self, x</span>):</span><br><span class="line">        x = self.layers[<span class="string">&#x27;conv1_0&#x27;</span>](x)</span><br><span class="line">        x = self.layers[<span class="string">&#x27;conv1_1&#x27;</span>](x)</span><br><span class="line">        x = self.layers[<span class="string">&#x27;conv1_2&#x27;</span>](x)</span><br><span class="line">        x = self.layers[<span class="string">&#x27;pool&#x27;</span>](F.relu(x))</span><br><span class="line">        x = self.layers[<span class="string">&#x27;pool&#x27;</span>](F.relu(self.layers[<span class="string">&#x27;conv2&#x27;</span>](x)))</span><br><span class="line">        x = torch.flatten(x, <span class="number">1</span>)</span><br><span class="line">        x = F.relu(self.layers[<span class="string">&#x27;fc1&#x27;</span>](x))</span><br><span class="line">        x = F.relu(self.layers[<span class="string">&#x27;fc2&#x27;</span>](x))</span><br><span class="line">        x = self.layers[<span class="string">&#x27;fc3&#x27;</span>](x)</span><br><span class="line">        <span class="keyword">return</span> x</span><br></pre></td></tr></table></figure><h1 id="Decomposition-into-a-list-of-simpler-operations"><a href="#Decomposition-into-a-list-of-simpler-operations" class="headerlink" title="Decomposition into a list of simpler operations"></a>Decomposition into a list of simpler operations</h1><p>The examples above are quite simple and are perfectly good for simplifying neural networks. This is still an active area of research. One of the things that researchers try to do is try to further simplify each already simplified operation, of course you pay the price of more operations. The one we will use for this example is one where the operations is broken down into four simpler operations. </p><p><img src="/2024/09/13/lora-3/decomp_conv.png" alt="CP Decomposition shown here, still maps a cube to a number but does so via a sequence of 4 &quot;simpler&quot; operations"></p><ul><li><p>(Green) Takes one pixel from the image across all $3$ channels and maps it to one value</p></li><li><p>(Red) Takes one long set of pixels from one channel and maps it to one value</p></li><li><p>(Blue) Takes one wide set of pixels from one channel and maps it to one value</p></li><li><p>(Green) takes one pixel from all $3$ channels and maps it to one value</p></li></ul><p>Intuitively, we are still taking the subset “cube” but we have broken it down so that in any given operation only $1$ dimension is not $1$. This is really the key to reducing the complexity of the initial convolution operation, because even though there are more such operations each operations is more complex.</p><h1 id="PyTorch-Implementation-1"><a href="#PyTorch-Implementation-1" class="headerlink" title="PyTorch Implementation"></a>PyTorch Implementation</h1><p>In this section, we will take AlexNet (<code>Net</code>), evaluate (<code>evaluate_model</code>) it on some data and then decompose the convolutional layers. </p><h2 id="Declaring-both-the-original-and-low-rank-network"><a href="#Declaring-both-the-original-and-low-rank-network" class="headerlink" title="Declaring both the original and low rank network"></a>Declaring both the original and low rank network</h2><p>Here we will decompose the second convolutional layer, given by the <code>layer_to_replace</code> argument. The two important lines to pay attention to are <code>est_rank</code> and <code>cp_decomposition_conv_layer</code>. The first function estimates the rank of the convolutional layer and the second function decomposes the convolutional layer into a list of simpler operations.</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">lowRankNet</span>(<span class="title class_ inherited__">Net</span>):</span><br><span class="line"></span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">__init__</span>(<span class="params">self, original_network</span>):</span><br><span class="line">        <span class="built_in">super</span>().__init__()</span><br><span class="line">        self.layers = nn.ModuleDict()</span><br><span class="line">        self.initialize_layers(original_network)</span><br><span class="line"></span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">initialize_layers</span>(<span class="params">self, original_network</span>):</span><br><span class="line">        <span class="comment"># Make deep copy of the original network so that it doesn&#x27;t get modified</span></span><br><span class="line">        og_network = copy.deepcopy(original_network)</span><br><span class="line">        <span class="comment"># Getting first layer from the original network</span></span><br><span class="line">        layer_to_replace = <span class="string">&quot;conv2&quot;</span></span><br><span class="line">        <span class="comment"># Remove the first layer</span></span><br><span class="line">        <span class="keyword">for</span> i, layer <span class="keyword">in</span> <span class="built_in">enumerate</span>(og_network.layers):</span><br><span class="line">            <span class="keyword">if</span> layer == layer_to_replace:</span><br><span class="line">                <span class="comment"># decompose that layer</span></span><br><span class="line">                rank = est_rank(og_network.layers[layer])</span><br><span class="line">                decomp_layers = cp_decomposition_conv_layer(og_network.layers[layer], rank)</span><br><span class="line">                <span class="keyword">for</span> j, decomp_layer <span class="keyword">in</span> <span class="built_in">enumerate</span>(decomp_layers):</span><br><span class="line">                    self.layers[layer + <span class="string">f&quot;_<span class="subst">&#123;j&#125;</span>&quot;</span>] = decomp_layer</span><br><span class="line">            <span class="keyword">else</span>:</span><br><span class="line">                self.layers[layer] = og_network.layers[layer]</span><br><span class="line">        <span class="comment"># Add the decomposed layers at the position of the deleted layer</span></span><br><span class="line"></span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">forward</span>(<span class="params">self, x, layer_to_replace=<span class="string">&quot;conv2&quot;</span></span>):</span><br><span class="line">        x = self.layers[<span class="string">&#x27;pool&#x27;</span>](F.relu(self.layers[<span class="string">&#x27;conv1&#x27;</span>](x)))</span><br><span class="line">        <span class="comment"># x = self.layers[&#x27;pool&#x27;](F.relu(self.laye[&#x27;conv2&#x27;](x)</span></span><br><span class="line">        x = self.layers[<span class="string">&#x27;conv2_0&#x27;</span>](x)</span><br><span class="line">        x = self.layers[<span class="string">&#x27;conv2_1&#x27;</span>](x)</span><br><span class="line">        x = self.layers[<span class="string">&#x27;conv2_2&#x27;</span>](x)</span><br><span class="line">        x = self.layers[<span class="string">&#x27;pool&#x27;</span>](F.relu(self.layers[<span class="string">&#x27;conv2_3&#x27;</span>](x)))</span><br><span class="line">        x = torch.flatten(x, <span class="number">1</span>)</span><br><span class="line">        x = F.relu(self.layers[<span class="string">&#x27;fc1&#x27;</span>](x))</span><br><span class="line">        x = F.relu(self.layers[<span class="string">&#x27;fc2&#x27;</span>](x))</span><br><span class="line">        x = self.layers[<span class="string">&#x27;fc3&#x27;</span>](x)</span><br><span class="line">        <span class="keyword">return</span> x</span><br><span class="line"></span><br></pre></td></tr></table></figure><h1 id="Evaluate-the-Model"><a href="#Evaluate-the-Model" class="headerlink" title="Evaluate the Model"></a>Evaluate the Model</h1><p>You can evaluate the model by running the following code. This will print the accuracy of the original model and the low rank model. </p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line">decomp_alexnet = lowRankNetSVD(net)</span><br><span class="line"><span class="comment"># replicate with original model</span></span><br><span class="line"></span><br><span class="line">correct_pred = &#123;classname: <span class="number">0</span> <span class="keyword">for</span> classname <span class="keyword">in</span> classes&#125;</span><br><span class="line">total_pred = &#123;classname: <span class="number">0</span> <span class="keyword">for</span> classname <span class="keyword">in</span> classes&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment"># again no gradients needed</span></span><br><span class="line"><span class="keyword">with</span> torch.no_grad():</span><br><span class="line">    <span class="keyword">for</span> data <span class="keyword">in</span> testloader:</span><br><span class="line">        images, labels = data</span><br><span class="line">        outputs = decomp_alexnet(images)</span><br><span class="line">        _, predictions = torch.<span class="built_in">max</span>(outputs, <span class="number">1</span>)</span><br><span class="line">        <span class="comment"># collect the correct predictions for each class</span></span><br><span class="line">        <span class="keyword">for</span> label, prediction <span class="keyword">in</span> <span class="built_in">zip</span>(labels, predictions):</span><br><span class="line">            <span class="keyword">if</span> label == prediction:</span><br><span class="line">                correct_pred[classes[label]] += <span class="number">1</span></span><br><span class="line">            total_pred[classes[label]] += <span class="number">1</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># print accuracy for each class</span></span><br><span class="line"><span class="keyword">for</span> classname, correct_count <span class="keyword">in</span> correct_pred.items():</span><br><span class="line">    accuracy = <span class="number">100</span> * <span class="built_in">float</span>(correct_count) / total_pred[classname]</span><br><span class="line">    <span class="built_in">print</span>(<span class="string">f&#x27;Lite Accuracy for class: <span class="subst">&#123;classname:5s&#125;</span> is <span class="subst">&#123;accuracy:<span class="number">.1</span>f&#125;</span> %&#x27;</span>)</span><br></pre></td></tr></table></figure><p>Let us first discuss estimate rank. For a complete discussion see the the references by Nakajima and Shinchi. The basic idea is that we take the tensor, “unfold” it along one axis (basically reduce the tensor into a matrix by collapsing around other axes) and estimate the rank of that matrix.<br>You can find <code>est_rank</code> below. </p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">from</span> __future__ <span class="keyword">import</span> division</span><br><span class="line"><span class="keyword">import</span> torch</span><br><span class="line"><span class="keyword">import</span> numpy <span class="keyword">as</span> np</span><br><span class="line"><span class="comment"># from scipy.sparse.linalg import svds</span></span><br><span class="line"><span class="keyword">from</span> scipy.optimize <span class="keyword">import</span> minimize_scalar</span><br><span class="line"><span class="keyword">import</span> tensorly <span class="keyword">as</span> tl</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">est_rank</span>(<span class="params">layer</span>):</span><br><span class="line">    W = layer.weight.data</span><br><span class="line">    <span class="comment"># W = W.detach().numpy() #the weight has to be a numpy array for tl but needs to be a torch tensor for EVBMF</span></span><br><span class="line">    mode3 = tl.base.unfold(W.detach().numpy(), <span class="number">0</span>)</span><br><span class="line">    mode4 = tl.base.unfold(W.detach().numpy(), <span class="number">1</span>)</span><br><span class="line">    diag_0 = EVBMF(torch.tensor(mode3))</span><br><span class="line">    diag_1 = EVBMF(torch.tensor(mode4))</span><br><span class="line"></span><br><span class="line">    <span class="comment"># round to multiples of 16</span></span><br><span class="line">    multiples_of = <span class="number">8</span> <span class="comment"># this is done mostly to standardize the rank to a standard set of numbers, so that </span></span><br><span class="line">    <span class="comment"># you do not end up with ranks 7, 9 etc. those would both be approximated to 8.</span></span><br><span class="line">    <span class="comment"># that way you get a sense of the magnitude of ranks across multiple runs and neural networks</span></span><br><span class="line">    <span class="comment"># return int(np.ceil(max([diag_0.shape[0], diag_1.shape[0]]) / 16) * 16)</span></span><br><span class="line">    <span class="keyword">return</span> <span class="built_in">int</span>(np.ceil(<span class="built_in">max</span>([diag_0.shape[<span class="number">0</span>], diag_1.shape[<span class="number">0</span>]]) / multiples_of) * multiples_of)</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">EVBMF</span>(<span class="params">Y, sigma2=<span class="literal">None</span>, H=<span class="literal">None</span></span>):</span><br><span class="line">    <span class="string">&quot;&quot;&quot;Implementation of the analytical solution to Empirical Variational Bayes Matrix Factorization.</span></span><br><span class="line"><span class="string">    This function can be used to calculate the analytical solution to empirical VBMF.</span></span><br><span class="line"><span class="string">    This is based on the paper and MatLab code by Nakajima et al.:</span></span><br><span class="line"><span class="string">    &quot;Global analytic solution of fully-observed variational Bayesian matrix factorization.&quot;</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">    Notes</span></span><br><span class="line"><span class="string">    -----</span></span><br><span class="line"><span class="string">        If sigma2 is unspecified, it is estimated by minimizing the free energy.</span></span><br><span class="line"><span class="string">        If H is unspecified, it is set to the smallest of the sides of the input Y.</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">    Attributes</span></span><br><span class="line"><span class="string">    ----------</span></span><br><span class="line"><span class="string">    Y : numpy-array</span></span><br><span class="line"><span class="string">        Input matrix that is to be factorized. Y has shape (L,M), where L&lt;=M.</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">    sigma2 : int or None (default=None)</span></span><br><span class="line"><span class="string">        Variance of the noise on Y.</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">    H : int or None (default = None)</span></span><br><span class="line"><span class="string">        Maximum rank of the factorized matrices.</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">    Returns</span></span><br><span class="line"><span class="string">    -------</span></span><br><span class="line"><span class="string">    U : numpy-array</span></span><br><span class="line"><span class="string">        Left-singular vectors.</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">    S : numpy-array</span></span><br><span class="line"><span class="string">        Diagonal matrix of singular values.</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">    V : numpy-array</span></span><br><span class="line"><span class="string">        Right-singular vectors.</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">    post : dictionary</span></span><br><span class="line"><span class="string">        Dictionary containing the computed posterior values.</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">    References</span></span><br><span class="line"><span class="string">    ----------</span></span><br><span class="line"><span class="string">    .. [1] Nakajima, Shinichi, et al. &quot;Global analytic solution of fully-observed variational Bayesian matrix factorization.&quot; Journal of Machine Learning Research 14.Jan (2013): 1-37.</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">    .. [2] Nakajima, Shinichi, et al. &quot;Perfect dimensionality recovery by variational Bayesian PCA.&quot; Advances in Neural Information Processing Systems. 2012.</span></span><br><span class="line"><span class="string">    &quot;&quot;&quot;</span></span><br><span class="line">    L, M = Y.shape  <span class="comment"># has to be L&lt;=M</span></span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> H <span class="keyword">is</span> <span class="literal">None</span>:</span><br><span class="line">        H = L</span><br><span class="line"></span><br><span class="line">    alpha = L / M</span><br><span class="line">    tauubar = <span class="number">2.5129</span> * np.sqrt(alpha)</span><br><span class="line"></span><br><span class="line">    <span class="comment"># SVD of the input matrix, max rank of H</span></span><br><span class="line">    U, s, V = torch.svd(Y)</span><br><span class="line">    U = U[:, :H]</span><br><span class="line">    s = s[:H]</span><br><span class="line">    V[:H].t_()</span><br><span class="line"></span><br><span class="line">    <span class="comment"># Calculate residual</span></span><br><span class="line">    residual = <span class="number">0.</span></span><br><span class="line">    <span class="keyword">if</span> H &lt; L:</span><br><span class="line">        residual = torch.<span class="built_in">sum</span>(torch.<span class="built_in">sum</span>(Y ** <span class="number">2</span>) - torch.<span class="built_in">sum</span>(s ** <span class="number">2</span>))</span><br><span class="line"></span><br><span class="line">    <span class="comment"># Estimation of the variance when sigma2 is unspecified</span></span><br><span class="line">    <span class="keyword">if</span> sigma2 <span class="keyword">is</span> <span class="literal">None</span>:</span><br><span class="line">        xubar = (<span class="number">1</span> + tauubar) * (<span class="number">1</span> + alpha / tauubar)</span><br><span class="line">        eH_ub = <span class="built_in">int</span>(np.<span class="built_in">min</span>([np.ceil(L / (<span class="number">1</span> + alpha)) - <span class="number">1</span>, H])) - <span class="number">1</span></span><br><span class="line">        upper_bound = (torch.<span class="built_in">sum</span>(s ** <span class="number">2</span>) + residual) / (L * M)</span><br><span class="line">        lower_bound = np.<span class="built_in">max</span>([s[eH_ub + <span class="number">1</span>] ** <span class="number">2</span> / (M * xubar), torch.mean(s[eH_ub + <span class="number">1</span>:] ** <span class="number">2</span>) / M])</span><br><span class="line"></span><br><span class="line">        scale = <span class="number">1.</span>  <span class="comment"># /lower_bound</span></span><br><span class="line">        s = s * np.sqrt(scale)</span><br><span class="line">        residual = residual * scale</span><br><span class="line">        lower_bound = <span class="built_in">float</span>(lower_bound * scale)</span><br><span class="line">        upper_bound = <span class="built_in">float</span>(upper_bound * scale)</span><br><span class="line"></span><br><span class="line">        sigma2_opt = minimize_scalar(EVBsigma2, args=(L, M, s, residual, xubar), bounds=[lower_bound, upper_bound],</span><br><span class="line">                                     method=<span class="string">&#x27;Bounded&#x27;</span>)</span><br><span class="line">        sigma2 = sigma2_opt.x</span><br><span class="line"></span><br><span class="line">    <span class="comment"># Threshold gamma term</span></span><br><span class="line">    threshold = np.sqrt(M * sigma2 * (<span class="number">1</span> + tauubar) * (<span class="number">1</span> + alpha / tauubar))</span><br><span class="line"></span><br><span class="line">    pos = torch.<span class="built_in">sum</span>(s &gt; threshold)</span><br><span class="line">    <span class="keyword">if</span> pos == <span class="number">0</span>: <span class="keyword">return</span> np.array([])</span><br><span class="line"></span><br><span class="line">    <span class="comment"># Formula (15) from [2]</span></span><br><span class="line">    d = torch.mul(s[:pos] / <span class="number">2</span>,</span><br><span class="line">                  <span class="number">1</span> - (L + M) * sigma2 / s[:pos] ** <span class="number">2</span> + torch.sqrt(</span><br><span class="line">                      (<span class="number">1</span> - ((L + M) * sigma2) / s[:pos] ** <span class="number">2</span>) ** <span class="number">2</span> - \</span><br><span class="line">                      (<span class="number">4</span> * L * M * sigma2 ** <span class="number">2</span>) / s[:pos] ** <span class="number">4</span>))</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> torch.diag(d)</span><br></pre></td></tr></table></figure><p>You can find the EVBMF code on my github page. I do not go into it in detail here. Jacob Gildenblatt’s code is a great resource for an in-depth look at this algorithm.</p><h1 id="Conclusion"><a href="#Conclusion" class="headerlink" title="Conclusion"></a>Conclusion</h1><p>So why is all this needed? The main reason is that we can reduce the number of operations needed to perform a convolution. This is particularly important in embedded systems where the number of operations is a hard constraint. The other reason is that we can reduce the number of parameters in a neural network, which can help with overfitting. The final reason is that we can reduce the amount of memory needed to store the neural network. This is particularly important in mobile devices where memory is a hard constraint.<br>What does this mean mathematically? Fundamentally it means that neural networks are over parameterized i.e. they have far more parameters than the information that they represent. By reducing the rank of the matrices needed carry out a convolution, we are representing the same operation (as closely as possible) with a lot less information. </p><h1 id="References"><a href="#References" class="headerlink" title="References"></a>References</h1><ul><li>[Low Rank approximation of CNNs] (<a href="https://arxiv.org/pdf/1511.06067">https://arxiv.org/pdf/1511.06067</a>)</li><li>[CP Decomposition] (<a href="https://arxiv.org/pdf/1412.6553">https://arxiv.org/pdf/1412.6553</a>)</li><li>Kolda &amp; Bader “Tensor Decompositions and Applications”in SIAM REVIEW, 2009</li><li>[1] Nakajima, Shinichi, et al. “Global analytic solution of fully-observed variational Bayesian matrix factorization.” Journal of Machine Learning Research 14.Jan (2013): 1-37.</li><li>[2] Nakajima, Shinichi, et al. “Perfect dimensionality recovery by variational Bayesian PCA.”</li><li>[Python implementation of EVBMF] (<a href="https://github.com/CasvandenBogaard/VBMF">https://github.com/CasvandenBogaard/VBMF</a>)</li><li>[Accelerating Deep Neural Networks with Tensor Decompositions - Jacob Gildenblat] (<a href="https://jacobgil.github.io/deeplearning/tensor-decompositions-deep-learning">https://jacobgil.github.io/deeplearning/tensor-decompositions-deep-learning</a>) </li><li>[Python Implementatioon of VBMF] (<a href="https://github.com/CasvandenBogaard/VBMF">https://github.com/CasvandenBogaard/VBMF</a>)</li><li>[Similar article that is more high level] (<a href="https://medium.com/@anishhilary97/low-rank-approximation-for-4d-kernels-in-convolutional-neural-networks-through-svd-65b30dc55f6b">https://medium.com/@anishhilary97/low-rank-approximation-for-4d-kernels-in-convolutional-neural-networks-through-svd-65b30dc55f6b</a>)</li></ul>]]>
    </content>
    <id>https://franciscormendes.github.io/2024/09/13/lora-3/</id>
    <link href="https://franciscormendes.github.io/2024/09/13/lora-3/"/>
    <published>2024-09-13T00:00:00.000Z</published>
    <summary>In this post, we will explore the Low Rank Approximation (LoRA) technique for shrinking neural networks for embedded systems. We will focus on the Convolutional Neural Network (CNN) case and discuss the rank selection process.</summary>
    <title>Part III :  What does Low Rank Factorization of a Convolutional Layer really do?</title>
    <updated>2026-04-10T14:24:00.552Z</updated>
  </entry>
  <entry>
    <author>
      <name>Francisco Romaldo Fernandes Mendes</name>
    </author>
    <category term="book-review" scheme="https://franciscormendes.github.io/categories/book-review/"/>
    <category term="book-review" scheme="https://franciscormendes.github.io/tags/book-review/"/>
    <category term="politics" scheme="https://franciscormendes.github.io/tags/politics/"/>
    <category term="fiction" scheme="https://franciscormendes.github.io/tags/fiction/"/>
    <content>
      <![CDATA[<p>It is hard to write this book review without overusing superlatives. Widely regarded as the inspiration for both 1984 and Brave New World, this book serves as the template for revolution in a dystopian world ruled by an all-encompassing police state. Written during the early years of the Soviet Union, it holds the dubious distinction of being one of the first (if not the first) books to be banned by the Communist Party.</p><p>I’ve been on a bit of a Soviet literature tear lately, and this book has been on my list for some time. I’m probably not the target demographic for most modern science fiction, but this book is so much more than that.</p><h1 id="Plot"><a href="#Plot" class="headerlink" title="Plot"></a>Plot</h1><p>The translation conveys an enthusiastic, eager, albeit halting and fragmented tone. This accurately reflects D-503’s (the protagonist’s) hurried journal entries, which form the chapters of the novel.</p><p>The book is set in 2600 AD, just after the 200 Years War (a war to end all wars), which concluded with the formation of One State, a totalitarian regime where everything, including sex, follows a fixed schedule regulated by a complex network of bureaucracies.</p><p>Refreshingly, for a science fiction novel, the protagonist is not unhappy with the status quo—he embraces and enjoys it. The One State aligns with his ideals of order, method, and rationality. Everything changes when he meets and subsequently falls in love with I-333.</p><p>While both 1984 and Brave New World feature similar female characters, the sexual attraction of I-333 is but a small facet of her complex appeal to D-503. She represents a curve in his life of straight lines and right angles. She is a violation of the order he holds so dear, yet he finds himself irresistibly drawn to her.</p><h1 id="Major-Themes"><a href="#Major-Themes" class="headerlink" title="Major Themes"></a>Major Themes</h1><p>The obvious metaphors of One State to the Soviet police state are compelling, as are D-503’s eulogies to their necessity and importance in daily life. This is a significant difference to most dystopian fiction that often reflects the downsides of dystopia but none of their motivating factors. Here we have the diary and journal entries of a true believer, we are treated with extensive philosophical insight into why One State exists and why it is good.</p><p><em>“The only means of ridding man of crime is ridding him of freedom.”</em></p><p>The idea that elimination of freedoms is the only way to rid man of crime is a recurrent theme in the book, of the many references to Christianity in this book, I found this to be the most compelling. Religion (and personal morality) only exists when there is freedom, without freedom there is no need for personal morality. Which is why One State was so successful, it obviated the need for religion, by eliminating all personal freedoms.</p><p>In the modern Western world, we often take it for granted that individual freedoms are the most basic right. This book asks the reader to challenge that idea substantively via conversations with I-333. One of the less understood ideas of communism is the importance of the collective, an idea almost incomprehensible to those educated primarily in Western philosophy. There is a scene in HBO’s Chernobyl where hundreds of workers scrape the top of the Nuclear plant of radioactive debris in what seems to be an irrational disregard for personal safety all for the greater good of the Soviet Union that captures this sentiment, this book re-iterates that theme across many different mundane freedoms, such as the right to privacy, right to procreate and the right to choose how to lead one’s life.</p><p><em>“We comes from God, I from the Devil.”</em></p><p>The above quote captures that sentiment <em>better</em> than the scene from Chernobyl does. It is interesting that Zamyatin chose this particular turn of phrase, but it corresponds to an idea in the Eastern Orthodox church that faith can only exist as a collective not as an individual.</p><p>When referring to belief in God, “I” is almost never used in the Orthodox Church. That is why there is no “<em>I</em> believe in God”, there is only “<em>we</em> believe in God”, in Orthodox prayer.</p><p>This theme often occurs in juxtaposition with another one, the contrast and similarity between religion and science. This is directly at odds with most of what we now experience in the West. Science and Individual Freedom are the core tenets of most modern societies, however, Zamyatin portrays these ideas as fundamentally in conflict with each other. To One State, science would measure every aspect of the human experience, and make cold blooded calculations of cost and benefit. And remove the benefit at any cost. What does it matter if one life is lost as long as the lives of countless others are preserved? Science allows us to measure everything, why not the human experience? Gradually the belief in the Science of One State obscures its rationale and assumptions.</p><p><em>“knowledge, absolutely sure of its infallibility, is faith”</em></p><p>And that a move away from the safety and comfort of rational science is a nightmare to him</p><p><em>“Now I no longer live in our clear, rational world; I live in the ancient nightmare world, the world of square roots of minus one.”</em></p><p>In addition, to the philosophical metaphors the books is rife with references to mathematics in the form of the Taylor and Maclaurin series, which form part of the broader mathematical narrative that is woven around life in One State. This is still relevant almost a hundred years after the book was written, the desire to quantify and measure the human experience is something innate to those who seek to mimic how science is applied to other more tangible fields.</p><h1 id="Summary"><a href="#Summary" class="headerlink" title="Summary"></a>Summary</h1><p>I left this book with a greater appreciation for the randomness and disorder inherent in human beings, and how that is our defining characteristic. Modern capitalist societies can benefit from redistribution through centralization and a stronger sense of community. However, this book offers valuable insight into the dangers of excessive centralization. The fact that the Soviet State remains the only large-scale implementation of communism and serves as the inspiration for One State is a humbling reminder that the economic Left is vulnerable to a complete lack of freedom, even when successful in achieving its overarching ideals. Often, the line between utopia and dystopia is blurred. It seems fitting that my next review could very well be Westad’s The Cold War.</p>]]>
    </content>
    <id>https://franciscormendes.github.io/2024/09/07/we-evgeny-zamyatin/</id>
    <link href="https://franciscormendes.github.io/2024/09/07/we-evgeny-zamyatin/"/>
    <published>2024-09-07T00:00:00.000Z</published>
    <summary>Zamyatin's We — the ur-text that preceded 1984 and Brave New World — read as a mathematical argument for why collective happiness cannot be solved like an equation.</summary>
    <title>Book Review : We - Evgeny Zamyatin</title>
    <updated>2026-04-10T14:24:00.565Z</updated>
  </entry>
  <entry>
    <author>
      <name>Francisco Romaldo Fernandes Mendes</name>
    </author>
    <category term="opinion" scheme="https://franciscormendes.github.io/categories/opinion/"/>
    <category term="economics" scheme="https://franciscormendes.github.io/tags/economics/"/>
    <category term="politics" scheme="https://franciscormendes.github.io/tags/politics/"/>
    <category term="game-theory" scheme="https://franciscormendes.github.io/tags/game-theory/"/>
    <content>
      <![CDATA[<h1 id="Introduction"><a href="#Introduction" class="headerlink" title="Introduction"></a>Introduction</h1><p>It is perhaps better to start this article off by clarifying what it is <strong>not</strong> rather than what it <strong>is</strong>. First, this is not a comprehensive review of RFK’s policies and what he stands for (there are far better places to seek that information). Second, this is not meant to convince you to vote one way or another based on policy and beliefs (again, there are far better places for that too). So then what the <em>blazes</em> did I write this for? Well, the motivation for this article comes from multiple conversations with friends and family who want to know more about voting for independents in general and RFK in particular. Addressing issues such as, </p><ul><li>“Is it a wasted vote?” </li><li>“Do I vote for RFK to make a point?” &#x2F; “If we do not vote for Independents, then how will they ever win?”</li></ul><p>I believe that these are important questions to ask and I hope to address them in this article. In order to answer those questions I will first explain the voting system in place and various strategies that can be used to <em>win</em> and election,</p><ul><li><p>Differences between Parliamentary (such as the UK and India) and Winner-Takes-All Democracy (USA).</p></li><li><p>Splitting the vote - what it really means. Different kinds of potential RFK voters and why they matter.</p></li><li><p>Strategic Misreporting - why people who say they might vote for RFK might not actually vote for RFK but simply want <em>you</em> to vote for him.</p></li></ul><p>As a recovering Game Theorist, I love to look at elections as “games” and therefore I will use the word “strategy” a lot. A strategy in this sense is an action (in this context voting for a candidate). In the game theoretic structure, we assume that a player (i.e. YOU) is playing to win. But what does it mean to win? In this context, winning means getting policies you care about enacted. I will also address a little, the issue of voting “to make a point” about the current system and why I feel like that is a bad idea. But for the most part, I assume that the reader, wants to get policies they care about enacted.</p><p>Equally important, I will assume that political parties have atleast some motivation to get elected. While getting elected is not the only motivation of political parties, it is certainly a very important one and allows us to separate out our strategies for voting for them.</p><h1 id="Differences-in-Democracy"><a href="#Differences-in-Democracy" class="headerlink" title="Differences in Democracy"></a>Differences in Democracy</h1><p>Perhaps the least understood part of this discussion is the inherent difference between Parliamentary democracy and Winner-Takes-All Democracy (this is technically called Representative Democracy, but I feel that the term obscures its meaning). Before understanding what you should do, it is perhaps worthwhile to understand what the system you are voting within intended for voters to think about. This could be quite different in both systems and have vastly different implications. Usually, the choice of system has more to do with the history and socio-cultural context at the time of setting up the democracy. It is very difficult to argue (vehemently, at least) for one over the other. But certainly, one should try to understand why a particular system was chosen and at least try to engage with viable strategies within that system.</p><p>For much of this article, I will consider $3$ hypothetical political parties, the first two are large and usually get most of the vote share, the independent is small. KH, DJT and RFK. KH, DJT and RFK (an independent). I will consider two hypothetical elections, one in a Parliamentary democracy and one in a Winner Takes All democracy.</p><h2 id="Parliamentary-Democracy"><a href="#Parliamentary-Democracy" class="headerlink" title="Parliamentary Democracy"></a>Parliamentary Democracy</h2><p>Consider $3$ candidates with the following vote shares and a $100$ seats in the “Parliament” in a hypothetical parliamentary democracy (number of seats won, in brackets).</p><ul><li><p>KH : $41%$ ($41$ seats)</p></li><li><p>DJT : $37%$ ($37$ seats)</p></li><li><p>RFK : $22%$ ($22$ seats)</p></li></ul><p>In a parliamentary democracy, KH narrowly wins the election. However, (and this is a big caveat), every time a decision is needed to be made, any one party would need to form an “alliance” with some or all of the other parties to reach the $50%$ mark. This means that a significant number of independents need to be swayed in order to pass a law (by either side). By the same token, DJT’s influence is not insignificant as they need to sway just $4$ more (than KH) independents to pass laws they want. This system comes with a clear message to the voting population’s strategy, you can (and should, if you want to) vote for a party that is smaller than the other two and their voice will be heard at every vote. This system also comes with a clear disadvantage, you need to appeal to independents at every voting instance. This is particularly worse when you consider a situation like this,</p><ul><li><p>KH : $49%$ ($49$ seats)</p></li><li><p>DJT : $48%$ ($48$ seats)</p></li><li><p>RFK : $3%$ ($3$ seats)</p></li></ul><p>In situations like this, RFK can hold up legislation that almost $49%$ of the country wants. Bear in mind, that bills in any democracy do not work in isolation, so RFK can hold up a super important bill (Free Childcare) that <em>even</em> their $3%$ want in exchange for a bill that <strong>only</strong> their $3%$ want (Bitcoin deregulation). There are two other future implications that are essential to understanding the Parliamentary system.</p><ul><li><p>The first, is that representative democracies encourage a proliferation of independent parties. They do this to the extent that the word independent party loses all meaning, and there are just a large number of parties that cater to ever more niche demographics that can sometimes seem hilariously contradictory (Pro Environment, Pro Socialism) and (Anti Environment, Pro Socialism).</p></li><li><p>The second, is that “winning” in a representative democracy ends up being one of two things. You either get $51%$ of the seats in parliament or you form a coalition that adds up to $51%$ using various smaller parties. In such a coalition, parties will often “give up” a few of their essential ideas or concepts (Environment) in exchange for passing laws that support another (perhaps more important) essential idea (Socialism).</p></li></ul><p>Notice, that voting for more and more independent parties does not lead to more diversity in voting ideologies, it just means that the reduction in diversity is left up to the party representative not the voters.</p><p>For example, say you voted for a pro-Environment, Pro Socialist party. Since they are a niche party they formed a coalition with a Socialist party and gave up on Environmental regulation. Now had you known the full result of the election in advance, you might not have wanted to give up on Environmentalism, you might have given up on Socialism instead. For instance, you could think, <em>if I cannot live in a cleaner environment I might as well have free markets</em>.</p><p>This paints a picture of a democracy that is very unstable. It is. Since the resolution or tolerance between conflicting ideas takes place at the parliament it is very difficult to gauge what issues are deal breakers for the voting population. But over time Parliamentary democracies tend to form $2$ major parties with a constellation of smaller parties that reflect minor interest groups. Governments are formed by one of the two major parties and a collection of smaller parties. We now turn to the other case.</p><p>To fix the issue of stability and to reduce the outsized influence of smaller parties, another form of democracy has been proposed that addresses these issues directly. </p><h2 id="Winner-Takes-All-Democracy"><a href="#Winner-Takes-All-Democracy" class="headerlink" title="Winner Takes All Democracy"></a>Winner Takes All Democracy</h2><p>It is a bit complicated to show an exact example of representative democracy in the US, but this example is a pretty good representation. In this example, there is no parliament, there is just a president, who can do whatever they want for the length of their term. Consider the vote share example as before,</p><ul><li><p>KH : $45%$ ($45$ seats)</p></li><li><p>DJT : $44%$ ($44$ seats)</p></li><li><p>RFK : $11%$ (11 seats)</p></li></ul><p>In this example KH, can pass all the laws they want. It does not matter that they do not have $50%$ of the vote share. Notice, also that <em>more</em> people did <em>not</em> want KH to be in power. Potentially <em>all</em> of RFK supporters (more on this later) could have preferred DJT to KH had they known the results of the election before hand.</p><p>What are the implications of this kind of democracy?</p><ul><li><p>First, notice that <em>after</em> the election the elected person is essentially a dictator. There is no need for any negotiation or working with any other parties. This is not a bad thing, since much of the confusion and instability of Parliamentary democracy is done away with.</p></li><li><p>Second, notice that there is a strong disincentive for other political parties to form since even at fairly high levels of representation you can end up with $0$ seats. Consider this example,</p></li><li><p>KH : $41$ ($41$ seats)</p></li><li><p>DJT : $37$ ($37$ seats)</p></li><li><p>RFK : $22$ ($22$ seats)</p></li></ul><p>While people who voted for KH might definitely consider voting for her again, some of the supporters of RFK might consider either :</p><ul><li><p>Not voting at all - which is why voter turnout is such an issue in the US elections</p></li><li><p>trying to persuade DJT to accept them into their party and fight for change in some of it’s core values (maybe considering the environment more).</p></li></ul><h1 id="Summary-of-Differences-in-Democracy-Styles"><a href="#Summary-of-Differences-in-Democracy-Styles" class="headerlink" title="Summary of Differences in Democracy Styles"></a>Summary of Differences in Democracy Styles</h1><p>The key takeaway is that in both systems you have to eventually reconcile your differences to reach that $51\%$ mark. In the Parliamentary system you leave it up to the person you vote for, no matter how small their party is. But in the Winner Take All system, you have to do it yourself, or you risk coming away with nothing (hence the Winner Takes ALL!). Again, either way, some (or most) of your ideologies will be resolved to reach a decision.</p><h1 id="Opinion-So-What-Should-You-Do"><a href="#Opinion-So-What-Should-You-Do" class="headerlink" title="Opinion : So What Should You Do?"></a>Opinion : So What Should You Do?</h1><p>Well, one thing is clear, since the US is a Winner Take All system you should reconcile your differences with the major parties and place your vote there. While it was not clear to me why this system was chosen in the US, it seems that the pressure of reconciling one’s differences is on oneself. This system is perhaps why we have a two party system in the first place. The motivation for a voter to vote for an independent is very low (but there is one situation in which it makes sense, more on that later) to the point that it has prevented the formation of more parties. Which is why it is ironic that many independents run on a ticket of plurality of opinion but do not actually advocate to change the actual voting system so that more political parties are motivated to coalesce around different combinations of ideas. But short of that, it is up to you to vote for a major party after giving up on some of your ideals.</p><h1 id="Implications-for-Reconciling-Differences"><a href="#Implications-for-Reconciling-Differences" class="headerlink" title="Implications for Reconciling Differences"></a>Implications for Reconciling Differences</h1><p>If you are reading this far it means you are at least considering voting for the major parties. One thing is clear when reconciling your differences, you need to figure out which party you would vote for if your top choice did not exist. Thus two kinds of voters exist,\</p><ul><li>$RFK \succ KH\succ DJT$</li><li>$RFK \succ DJT\succ KH$</li></ul><p>Where, $\succ$ means is that if $a\succ b$ you would vote for $a$ over $b$. For instance, if after casting your vote for RFK and seeing he lost you would rather DJT won (had you known RFK would not have won), that means DJT is your second choice. Thus, imagine a world in which RFK lost and think about who you would have preferred. That is who you should vote for. Similarly, if you voted for RFK and DJT won, and you wished that you voted for KH, then your second choice is KH.</p><p>There is however, one (and only one) situation in which you should vote for RFK and that is the situation in which you are truly indifferent between DJT and KH. That is,IF, on the day after the election you truly do not care if RFK lost. I think that such candidates are likely to be of two kinds (and I do not think readers of this article are likely to be either).</p><p><strong>Non-voters</strong> : They would probably have not voted any way. If you are going to vote if RFK was not running then this is NOT you.</p><p><strong>Ideologically inconsistent</strong> : Since independents and RFK generally seek to appeal to both parties and therefor take centrist positions, it is not possible for someone to be truly indifferent between KH and DJT. For example consider the following policy positions, - RFK (Pro-Life, Pro-Environment) - DJT (Pro-Life, Anti-Environment) - KH (Pro-Choice, Pro-Environment)</p><p>If you really are indifferent between KH and DJT then you are indifferent between (Pro-Life, Anti-Environment) and (Pro-Choice, Pro-Environment). This is unlikely, since these are such salient issues, you would certainly have an opinion on which you would rather have. If you really are indifferent about such important issues you are not an ideological voter and are motivated by something other than getting policies you care about enacted. This could be someone who votes for RFK to “make a point” about the current system. But equally this could be someone who votes based on personality rather than someone voting on issues alone.</p><h1 id="Strategic-Implications"><a href="#Strategic-Implications" class="headerlink" title="Strategic Implications"></a>Strategic Implications</h1><p>Interestingly, it is in the interest of the party that thinks they will lose to promote the independent candidate. Consider the following strategy by DJT,</p><ul><li>Promote RFK as an independent (ask your donors to donate to him).</li><li>Appear as similar as possible to RFK (public appearances, phone calls etc).</li><li>Make sure that RFK is on the ballot in as many states as possible.</li></ul><p>With this strategy it will be possible to make it appear like RFK is very similar to you but different enough from KH thereby ensuring that your vote base is intact but people will defect from KH.</p><h2 id="Strategic-Misreporting"><a href="#Strategic-Misreporting" class="headerlink" title="Strategic Misreporting"></a>Strategic Misreporting</h2><p>There is another more complex issue that is known to occur in voting. The best way to understand it is to understand that people voluntarily disclose their voting strategy and that this strategy is never verified. Essentially you can say you are going to vote for any candidate and no one will ever know if you did or not. People misreport for a variety of reasons, including embarrassment, social pressure and privacy. With the rise in far right parties in Europe, people are less likely to admit that they voted for them. However, one of the most interesting reasons to misreport is for strategic reasons. Consider the following strategy,</p><ul><li>You are a DJT voter and you know that RFK is more likely to take votes away from KH than DJT.</li><li>You tell people you are going to vote for RFK, this will encourage other people to vote for RFK.</li><li>This will make it more likely that KH will lose votes to RFK but not DJT.</li></ul><p>Thus when discussing your voting strategy it is important to remember that a person whose second choice candidate is KH and whose second choice is DJT are fundamentally different people.</p><h1 id="Conclusion"><a href="#Conclusion" class="headerlink" title="Conclusion"></a>Conclusion</h1><ul><li>“Is it a wasted vote?”</li></ul><p>Yes it is, for reasons above the American system expects you to reconcile your differences with the major parties and then cast your vote. If not, you will come away with either :</p><pre><code>- your third choice candidate winning implementing policies that are objectively worse for you. - you vote for an independent but the people telling you to do not (strategic misreporting).</code></pre><ul><li>“Do I vote for RFK to make a point?” &#x2F; “If I do not then no independent will ever win?”</li></ul><p>No you should not. The reason that independents do not win has more to do with the system than the fact that they do not get enough votes. Even if an independent ends up with very very high percentages of vote share they can end up with no representation. The system is inherently Winner-Takes-All, now you could ask, “why not change the system?” and that is a good question. Unfortunately that would need to be done by the major parties and they have no incentive to do so. But guess what, the best way to do that is to vote for a candidate from the major parties who has a policy of changing the voting system. Best of luck with that.</p><p>In the past many candidates have been independent and have garnered huge amounts of popular support (at the primary stage), but these candidates have inevitably joined either of the two parties. So what ends up happening is one of two things, </p><ul><li>if the major parties think an independent is popular and risks a big chunk of vote share, they offer them a ticket.</li><li>if the major parties do not view them as a risk they ignore them and hope they do not take too much vote share. If they do take vote share this has the effect of penalizing the candidate who has less fanatical (nationalistic&#x2F; personality driven) supporters since they are more open to truly voting based on ideology.</li></ul><p>I think that rank order voting is a good system to implement in the US, and advocating directly for that is a better strategy than voting for an independent. As I said, it is funny that independents do not directly advocate for this system, but it is likely that they are not able to get enough votes to be taken seriously.</p><p>Let us conclude with an example of rank order voting. In this voting system, instead of voting for candidates you express your preferences for all the candidates. And the candidate with the least points WINs. That is, not only do you care about how many ballots had your name at the top, but also considers how many people had you at the bottom.<br>KH\succ DJT\succ RFK (1)<br>KH\succ RFK\succ DJT (40)<br>DJT\succ KH\succ RFK  (1)<br>DJT\succ RFK\succ KH (36)<br>RFK\succ KH\succ DJT (15)<br>RFK\succ DJT\succ KH (7)</p><p>KH points : 41 * 1 + 16 * 2 + 43 * 3 &#x3D; 1 + 32 + 129 &#x3D; 162<br>DJT points : 37 * 1 + 8 *2 + 55 * 3 &#x3D; 37 + 16 + 165 &#x3D; 218<br>RFK points : 22 * 1 + 76 * 2 + 2 * 3 &#x3D; 22 + 152 + 6 &#x3D; 180</p><p>This example proves the benefits of rank order voting since you can notice several things. </p><ul><li>KH wins in both systems, if you have enough first place votes you are the winner pure and simple. </li><li>DJT’s loss was made worse by this system because of the huge number of people who had him at the bottom. This is not surprising for the people who had KH on top of their ballot. But because of the huge number of people who had RFK on the top of their ballot but DJT at the bottom of the ballot. </li><li>RFK is not as bad a candidate as it seems, even though he had only 22 first place votes, when considering his second place votes he is actually not a bad candidate.</li></ul><p>In the rank order system you can use your third place vote to essentially veto a bad candidate, it essentially says this is who I prefer at the top (RFK) but I definitely don’t want my 3rd place candidate (DJT) I would rather have (KH). This essentially allows the two different kinds of RFK voters to express <em>both</em> their preferences. </p>]]>
    </content>
    <id>https://franciscormendes.github.io/2024/08/12/rfk/</id>
    <link href="https://franciscormendes.github.io/2024/08/12/rfk/"/>
    <published>2024-08-12T00:00:00.000Z</published>
    <summary>Game theory applied to third-party voting: why winner-takes-all systems punish independent votes, and what Nash equilibrium says about whether voting for RFK is ever strategically rational.</summary>
    <title>What does Game Theory say about voting for RFK?</title>
    <updated>2026-04-10T14:24:00.561Z</updated>
  </entry>
</feed>
