How I track AI model releases: my personal data system

I get this question a lot. "How do you keep track of all these models?" Fair question. The pace in 2023 has been genuinely hard to follow. So here's my actual setup, warts and all.

The system

Four components, running in parallel:

| Component | What it does | Time investment | Reliability | |-----------|-------------|----------------|-------------| | RSS feeds | Catches blog posts from major labs | 15 min/day | Good for big announcements | | arXiv alerts | Catches papers before blog posts | 10 min/day | Good for research models | | Hugging Face tracker (Python) | Detects new trending models daily | 0 min (automated) | Good for open source | | Manual spreadsheet | My curated record of every notable release | 30 min/week | Thorough but slow |

Total time: about 30-40 minutes per day, plus a weekly spreadsheet maintenance session.

Component 1: RSS feeds

I use Feedly with 23 feeds. Here are the ones that catch the most model releases:

| Feed | URL pattern | Avg releases caught per month | |------|-----------|------------------------------| | OpenAI Blog | openai.com/blog/rss | 2-3 | | Google AI Blog | blog.google/technology/ai/rss | 2-4 | | Meta AI Blog | ai.meta.com/blog/rss | 1-2 | | Anthropic Research | anthropic.com/research/rss | 1-2 | | Mistral AI Blog | mistral.ai/feed | 0-1 (new) | | Hugging Face Blog | huggingface.co/blog/feed.xml | 3-5 | | DeepMind Blog | deepmind.google/blog/rss | 2-3 | | Together AI Blog | together.ai/blog/rss | 1-2 | | arXiv CS.CL (new) | arXiv RSS for Computation and Language | 20-30 (many not model releases) |

The arXiv CS.CL feed is the noisiest. 20-30 new papers per day, most of which aren't model releases. I skim titles and abstracts during my morning coffee. Takes about 10 minutes once you develop pattern recognition for which titles signal a new model.

Component 2: arXiv alerts

I have custom alerts set up on Semantic Scholar for specific terms:

| Alert term | Hits per week | Signal quality | |-----------|---------------|---------------| | "language model" + "we release" | 3-5 | High | | "we introduce [model name]" | 2-4 | High | | "open source" + "weights" | 1-3 | Medium | | "benchmark" + "state of the art" | 8-12 | Low (many false positives) |

The "we release" filter is my best trick. Researchers almost always use that phrase when they're releasing model weights. "We introduce" catches new architectures. The combination catches 80-90% of notable model papers within 24 hours of publication.

Component 3: Hugging Face tracker

I wrote a Python script that runs daily and checks the Hugging Face trending models page. It's simple:

What the script tracks:

New models that appear on the trending page
Models with over 1,000 downloads in the first 24 hours
Models from known organizations (Meta, Mistral, EleutherAI, etc.)

Daily output:

| Date (example) | New trending models | Notable ones | |--------|-------------------|-------------| | Oct 18 | 7 | mistralai/Mistral-7B-Instruct-v0.1 | | Oct 19 | 4 | None over 1K downloads | | Oct 20 | 6 | teknium/OpenHermes-2-Mistral-7B | | Oct 21 | 5 | None notable |

The script sends me a daily summary. Most days it's noise (random fine-tunes that trend briefly). But it catches community-driven releases that don't have blog posts or papers, like the Dolphin, OpenHermes, and Neural Chat models that grew out of the Mistral 7B model family.

Component 4: The spreadsheet

My tracking spreadsheet has 312 rows as of today. Each row is a model release I consider "notable" (roughly: a new base model, a significant fine-tune, or a model from a major lab).

Columns I track:

| Column | Example | Why I track it | |--------|---------|---------------| | Release date | 2023-09-27 | Timeline charting | | Model name | Mistral 7B | Identification | | Organization | Mistral AI | Market mapping | | Parameters | 7.2B | Size comparison | | Training tokens | Unknown | Efficiency analysis | | Open/closed | Open | Market dynamics | | License | Apache 2.0 | Commercial viability | | Context window | 8K | Capability tracking | | MMLU score | 60.1% | Quality comparison | | HumanEval score | 30.5% | Coding quality | | Source | mistral.ai | Reference |

I fill in what I can at release time and go back to update when papers or evaluations come out. About 40% of models launch without benchmark numbers and get updated later.

What the data tells me

Some patterns from 312 model entries:

| Metric | 2022 total | 2023 (Jan-Oct) | |--------|-----------|----------------| | Total notable releases | 78 | 234 | | Open source releases | 31 (40%) | 168 (72%) | | Releases with published benchmarks | 52 (67%) | 147 (63%) | | Average parameters (new models) | 18.4B | 14.2B | | Median parameters | 7.0B | 7.0B |

The shift to open source is dramatic. 40% of notable releases in 2022 vs 72% in 2023. The absolute number went from 31 to 168. Open source isn't just keeping pace. It's dominating the release volume.

Average model size is actually dropping (18.4B to 14.2B). That's the Mistral effect and the broader trend toward efficient smaller models. The median stays at 7B because that's the sweet spot for consumer hardware.

The benchmark publication rate (63%) is lower than I'd like. Over a third of models launch without standardized evaluations. I've started penalizing models without benchmarks in my own assessments. If you won't share your scores, I assume they're bad.

The pain points

What doesn't work well:

Chinese and Asian model releases. My feeds are English-biased. I miss probably 30-40% of Chinese model releases because they're announced on WeChat, Zhihu, or Chinese arXiv mirrors before they appear in English sources.
Duplicate tracking. The same model gets released on Hugging Face by the original team, then reuploaded by 5 community members with quantized versions. My script counts these separately. I have to deduplicate manually.
"Notable" is subjective. I decide what's notable based on vibes and experience. Some models I skip turn out to be important later (I initially didn't track Vicuna, which was a mistake).
Keeping up. At 234 models in 10 months, that's about one new notable model every 1.3 days. The pace is accelerating. I'm not sure my current system scales past 400-500 models per year without automation.

Why I do this

My morning routine starts with checking the latest model papers. It's become the data equivalent of reading the sports page. Who released what, how it performed, what it means for the standings.

Is it necessary? No. Is it borderline obsessive? Probably. But every article I write on this blog starts with the spreadsheet. The data doesn't collect itself.

If you want to track a subset (say, just open source LLMs over 7B parameters), you can get 80% of the value from just the Hugging Face trending page and Papers With Code. Check those two sources daily and you'll catch most of the important stuff.

The remaining 20% is the obsessive part. That's my job.

If you found this interesting, you might also like:

-- dataku