/home/andrew$

Some thoughts on physics, statistics, computing & technology

Delving into focal words on inSPIRE HEP

March 04, 2026 — Andrew Fowlie

You've probably noticed that large language models (LLMs) have favourite words that they use more often than human writers. These are known as focal words and the phenomena of focal words is non-trivial. 2412.11385 call it 'the puzzle of lexical overrepresentation'.

I thought I'd check out the appearance of a focal word in the high-energy physics literature by querying the inSPIRE HEP database. I use the 'fulltext' search and looked at the word 'delve'. I think this does some kind of stemming so that, e.g., 'delve' also matches 'delving'. I normalized the results to the total number of papers per year. The results are:

Frequency of word delve

Of course, authors could be influenced by LLMs or imitating 'good' writing produced by LLMs. I don't know much about this field. Make of it what you will.

I can understand the spike, but I'm not sure why it decreased back down in 2025. Perhaps LLMs have evolved and 'delve' isn't such a common focal word anymore? Perhaps writers are conscious about hallmarks of LLMs in their work and edit instances of 'delve'? Perhaps 'delve' was a buzzword that entered popular consciousness because of LLMs?

Tags: ai, academia

Curious trends in arXiv submission data

March 04, 2026 — Andrew Fowlie

There was a curious discussion at Peter Woit's blog concerning recent arXiv submission trends. It was observed (after some initial confusion) that the number of revisions appeared to have increased dramatically in the last month or so.

An (AI generated) analysis, available on GitHub, confirmed this pattern. The data look like this.

arXiv submission trends

What causes that surge in revisions (red) versus posts (blue)? This recent trend appears in all arXiv categories. The AI declares that it is a real trend and speculates that authors are submitting revisions using generative AI tools. I'm naturally skeptical, so thought someone should least build a statistical model of what this plot might look like, assuming nothing but a stationary process.

So I did. I took submissions per month to be about 250 $$ n \sim \textrm{Po}(250) $$ and assumed that authors posted a revision upon publication to match the published version, about six months later, $$ d \sim \textrm{Po}(6) $$ What do you know?

arXiv submission trends

In the current month (here month 60), you see the first submissions (that haven't been replaced yet) and revisions (from papers from previous months). In past months, you only see revisions, as the first submissions are later replaced.

This was an interesting example of a stationary that process produces a mirage of non-stationary behaviour (a surge in the current month). The explanation about AI revisions is unwarranted. On the other hand, there is almost certainly non-stationary behavior in the dataset, as, e.g., the number of academics has increased over time.

Don't take my word for it, of course. Run it yourself. I'd love to see a Bayesian analysis that constructed a principled model and fitted it to the actual data.


"""
arXiv submission patterns
=========================
"""

import numpy as np
import matplotlib.pyplot as plt

rate_per_month = 250
publication_time_months = 6
end_month = 60  # 5 years


def simulate():

    # make papers
    initial = []
    for i in range(end_month):
        initial += np.random.poisson(rate_per_month) * [i + 1]

    # now post a new version after publication some time later
    published = [a + np.random.poisson(publication_time_months)
                 for a in initial]

    # final update before end of simulation
    final = [b if b <= end_month else a for a, b in zip(initial, published)]

    return initial, final


if __name__ == "__main__":

    initial, final = simulate()

    bins = np.arange(0.5, end_month + 1, 1)

    plt.hist(initial, bins=bins, label="Initial submission month")
    plt.hist(final, bins=bins, histtype="step", label="Final submission month")
    plt.legend()
    plt.xlabel("Month")
    plt.ylabel("Papers")
    plt.savefig("arxiv.png")

Tags: code, ai, arxiv, statistics

The Church of Reason

March 02, 2026 — Andrew Fowlie

Reading Zen and the Art of Motorcycle Maintenance. I didn't expect so much relevance to academia (though new little about the book):

The real University is not a material object. It is not a group of buildings that can be defended by police. He explained that when a college lost its accreditation, nobody came and shut down the school. There were no legal penalties, no fines, no jail sentences. Classes did not stop. Everything went on just as before. Students got the same education they would if the school didn’t lose its accreditation. All that would happen, Phaedrus said, would simply be an official recognition of a condition that already existed. It would be similar to excommunication. What would happen is that the real University, which no legislature can dictate to and which can never be identified by any location of bricks or boards or glass, would simply declare that this place was no longer “holy ground.” The real University would vanish from it, and all that would be left was the bricks and the books and the material manifestation …

The real University, he said, has no specific location. It owns no property, pays no salaries and receives no material dues. The real University is a state of mind. It is that great heritage of rational thought that has been brought down to us through the centuries and which does not exist at any specific location. It’s a state of mind which is regenerated throughout the centuries by a body of people who traditionally carry the title of professor, but even that title is not part of the real University. The real University is nothing less than the continuing body of reason itself.

Tags: reading, academic, philosophy

mp3 player

February 25, 2026 — Andrew Fowlie

I am trying to reduce my dependence on smartphones, as I have found them intrusive and distracting. I have a simple mp3 player that I can copy files to via USB-A. I like the design as I don't need a separate cable. It looks something like this one.

An MP3 player

It does support bluetooth headphones, but bluetooth earpods are easy to lose and have to be charged, so I am sticking with wired headphones.

With that hardware, I need some sofware for subscribing to and downloading podcasts that I listen to. I used to do this using the iPhone podcast app. I like to use computers using the keyboard and command-line. I found existing libraries overly complicated. I want to run a script that updates the mp3 player and that's it.

Thus, I created my own solution. The podcasts I want to listen to are stored as good-old-fashioned RSS feeds in a JSON file:


{
  "download_path": "~/podcasts",
  "sync_paths": [
    "/media/sdb1/Podcasts/"
  ],
  "default_keep": 1,
  "feeds": [
    {
      "url": "https://podcasts.files.bbci.co.uk/b0070hz6.rss"
    },
    {
      "url": "https://podcasts.files.bbci.co.uk/p02nrsln.rss"
    },
    {
      "url": "https://podcasts.files.bbci.co.uk/b006qpgr.rss"
    },
    {
      "url": "https://podcasts.files.bbci.co.uk/p02nrsc7.rss"
    },
    {
      "url": "https://podcasts.files.bbci.co.uk/p09k7ctp.rss"
    },
    {
      "url": "https://podcasts.files.bbci.co.uk/p02nrsjn.rss"
    },
    {
      "url": "https://librivox.org/rss/14834",
      "keep": "all"
    }
  ]
}

There are settings to control the number of episodes that are kept. A Python script reads this configuration, downloads episodes using requests, discards old episodes, and syncs it with the device using rsync. Thus, I plug in the mp3 player, run the script, and wait.

"""
Podcast downloader
==================
"""

import os
import re
import json
import shutil
import subprocess
from pathlib import Path

import feedparser
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry


RC_FILE = Path.home() / ".config" / "ypod.json"


session = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[502, 503, 504])
adapter = HTTPAdapter(max_retries=retries)
session.mount("http://", adapter)
session.mount("https://", adapter)

MIME_TYPES = {
    "audio/mpeg": ".mp3",
    "audio/mp4": ".m4a",
    "audio/aac": ".m4a",
    "audio/x-m4a": ".m4a",
    "audio/ogg": ".ogg",
    "audio/opus": ".opus",
}


def safe_filename(text):
    """
    Convert string into lowercase, underscore-separated, strictly alphanumeric filename
    """
    text = text.lower()
    # Replace any non-alphanumeric character with underscore
    text = re.sub(r"[^a-z0-9]+", "_", text)
    # Collapse multiple underscores
    text = re.sub(r"_+", "_", text)
    # Strip leading/trailing underscores
    text = text.strip("_")
    return text


class Config:
    """
    Manage a config file
    """

    def __init__(self, rc_file):
        if not rc_file.exists():
            raise FileNotFoundError(f"Config not found: {rc_file}")

        self.rc_file = rc_file

        with open(self.rc_file, "r") as f:
            self.config = json.load(f)

    @property
    def download_path(self):
        return Path(self.config["download_path"]).expanduser()

    @property
    def feeds(self):
        return self.config["feeds"]

    @property
    def default_keep(self):
        return self.config.get("default_keep", 1)

    @property
    def default_archive(self):
        return self.config.get("default_archive", True)

    @property
    def sync_paths(self):
        sync_paths = self.config.get("sync_paths", [])
        return [Path(p).expanduser() for p in sync_paths]


def deduce_file_ext(entry, audio_url):
    """
    Guess file extension ahead of time
    """
    # Try URL
    url_file_ext = os.path.splitext(audio_url)[1].split("?")[0]

    if url_file_ext:
        return url_file_ext

    # Check RSS enclosure type
    if "type" in entry.enclosures[0]:
        mime = entry.enclosures[0].type
        if mime in MIME_TYPES:
            return MIME_TYPES[mime]

    # Head request
    try:
        resp = session.head(audio_url, allow_redirects=True, timeout=10)
        mime = resp.headers.get("Content-Type", "").split(";")[0]
        if mime in MIME_TYPES:
            return MIME_TYPES[mime]
    except:
        pass

    # Fallback to mp3
    return ".mp3"


def download_entry(index, entry, download_path):
    """
    Download single entry from feed
    """
    episode_title = safe_filename(entry.title)

    if not entry.enclosures:
        print(f"Skipping {episode_title}: no audio found")
        return None

    audio_url = entry.enclosures[0].href
    file_ext = deduce_file_ext(entry, audio_url)
    file_name = download_path / f"{index}_{episode_title}{file_ext}"

    if file_name.exists():
        print(f"{file_name.name} already downloaded")
        return file_name

    with session.get(audio_url, stream=True, timeout=30) as response:
        response.raise_for_status()
        total_size = int(response.headers.get("content-length", 0))
        if total_size == 0:
            total_size_display = "?"
        else:
            total_size_display = f"{total_size / (1024 * 1024):.1f} MB"

        chunk_size = int(0.5 * 1024 * 1024)
        downloaded = 0

        with open(file_name, "wb") as f:
            for chunk in response.iter_content(chunk_size=chunk_size):
                if chunk:
                    f.write(chunk)
                    downloaded += len(chunk)
                    downloaded_display = f"{downloaded / (1024 * 1024):.1f} MB"
                    print(
                        f"\r{file_name.name} [{downloaded_display} / {total_size_display}]",
                        end="",
                        flush=True,
                    )

    print()  # move to next line after completion
    return file_name


def process_feed(feed, config):
    """
    Process a feed by downloading latest episode
    and archiving old ones
    """
    print(f"\nProcessing feed: {feed['url']}")
    resp = session.get(feed["url"], timeout=5)
    parsed = feedparser.parse(resp.content)

    if not parsed.entries:
        raise IOError(f"No episodes found in {feed['url']}")

    feed_title = safe_filename(parsed.feed.get("title", "unknown_feed"))
    print(f"Feed title: {feed_title}")

    # Download files

    latest_path = config.download_path / "latest" / feed_title
    latest_path.mkdir(parents=True, exist_ok=True)

    entries = sorted(
        parsed.entries, key=lambda e: e.get("published_parsed", 0), reverse=False
    )

    indexed_entries = list(enumerate(entries))

    keep = feed.get("keep", config.default_keep)
    if keep == "all":
        keep = len(entries)

    latest_entries = indexed_entries[-keep:]
    latest_file_names = [download_entry(i, e, latest_path) for i, e in latest_entries]
    latest_file_names = [f for f in latest_file_names if f is not None]

    # Move any old files from latest

    archive = feed.get("archive", config.default_archive)
    previous_path = config.download_path / "previous" / feed_title

    for f in latest_path.iterdir():
        if f not in latest_file_names:
            if archive:
                previous_path.mkdir(parents=True, exist_ok=True)
                target = previous_path / f.name
                shutil.move(str(f), str(target))
                print(f"Archived: {f.name}")
            else:
                os.remove(str(f))
                print(f"Deleted: {f.name}")


def sync(download_path, sync_path):
    """
    Sync latest episodes with e.g. USB mp3 player
    """
    if not sync_path.exists():
        print(f"\nSkipping sync: {sync_path} not found")
        return

    print(f"\nSyncing: {sync_path}")

    src_folder = download_path / "latest"
    dst_folder = sync_path / "latest"
    dst_folder.mkdir(parents=True, exist_ok=True)

    cmd = [
        "rsync",
        "-av",  # archive mode + verbose
        "--delete",  # delete files on destination that are missing locally
        str(src_folder) + "/",  # trailing slash important
        str(dst_folder) + "/",
    ]

    subprocess.run(cmd, check=True)

    rc_dest = sync_path / "ypod.json"
    shutil.copy(RC_FILE, rc_dest)
    print(f"Copied RC file file to: {rc_dest}")

    script = os.path.abspath(__file__)
    script_dst = sync_path / os.path.basename(script)
    shutil.copy(script, script_dst)
    print(f"Copied current script to: {script_dst}")


def main():
    print(f"ypod! Config at {RC_FILE}")

    config = Config(RC_FILE)
    config.download_path.mkdir(parents=True, exist_ok=True)

    for feed in config.feeds:
        try:
            process_feed(feed, config)
        except IOError as e:
            print(f"Error processing {feed}: {e}")

    for p in config.sync_paths:
        sync(config.download_path, p)


if __name__ == "__main__":
    main()

Tags: mp3, smartphone, podcasts, code