It started on my last day at work in Spain. A Wednesday before two bank holidays. The office almost empty.

There was supposed to be a goodbye drink later downstairs at the company bar (yes, my former company had a bar), but realistically there weren't going to be many people around to attend it. Most teams were already offline for the long weekend. The only people still around were a few colleagues from engineering and the IT team, which mattered, because I still had to return my laptop as part of offboarding.

So with a quiet office, a few hours to spare, and curiosity doing what curiosity does, I opened ChatGPT and started exploring something I had been meaning to understand better: HTTPS and page structure before scraping. To experiment, I opened a random company careers page. Then I clicked on People. Something looked familiar. The page was powered by Teamtailor, an ATS and It reminded me of my last company internal People page, except for one difference: Only the Talent department was visible.

But on many startups and SaaS companies using Teamtailor? My hacker instinct turned on activation mode: If the structure powered by Teamtailor was shared across deployments, then these weren't isolated pages anymore. They were instances of a reusable schema.

So I wrote a small Python script to scrape the Teamtailor client list. Surprisely, it worked. At that moment I had a funny realization. Imagine you're a salesperson building an integration product for Teamtailor customers.

If you can programmatically discover companies using Teamtailor, then theoretically you could contact decision-makers across the entire ecosystem. You could even claim: "As a Teamtailor partner, they shared your contact because you are compatible with our solution." (I'm joking. Mostly.) Interestingly, I had once been contacted by a salesperson with a pitch like that, and at the time I genuinely thought Teamtailor must have shared their customer list.

Once I had the first extraction working, I switched to a more structured workflow in Cursor. Now I wanted only technical companies. So the architecture started taking shape:

find_teamtailor_companies.py ↓ teamtailor_companies.csv ↓ filter_tech_companies.py ↓ tech_companies.csv

At that point the project stopped being an experiment and started becoming a pipeline. Then I called an engineer colleague. I showed him what I was exploring. We opened the HTML together. Inside the <body> structure we could immediately see:

  • names
  • job titles
  • departments
  • profile links

And sometimes something even better: some employees had written mini biographies and added their LinkedIn URLs directly inside their Teamtailor profiles.

That call probably saved me an entire weekend of trial-and-error XPath experiments. Instead of guessing selectors, he explained to me to figurate out how to read the page source of the career site. I now understood the page architecture.

At that point the objective became clear:

  1. I already had a list of Teamtailor companies
  2. I knew each People page followed the same structure
  3. Each People page linked to individual profile pages
  4. Those profile pages sometimes exposed LinkedIn URLs

So the next step was obvious:

loop through companies → loop through people pages → extract profiles.

This became:

batch_scrape_people.py ↓ all_tech_people.csv

But something interesting happened during extraction. Some data visible in the interface wasn't appearing in my results. So we enriched the parsing approach by saving full page sources and testing against real HTML samples (easy here, you open the page source, copy past, save it as an .html into your IDE). These helped an LLM understand how Teamtailor structured profile metadata across deployments.

Instead of writing brittle rules manually, the extraction logic became schema-aware.

Once the pipeline produced a structured dataset of people, I remembered something I had experimented with earlier: automatically resolving LinkedIn profiles from scraped identity data.

This time it was easier. Because each row already contained:

  • name
  • company

So I added:

linkedin_search.py ↓ writes LinkedIn URLs back to Google Sheet (Sheet1)

Now the pipeline didn't just extract profiles. It enriched them.

Later that afternoon I called a junior colleague and showed her what I had built. She said: "That's cool… but what does it give that LinkedIn doesn't already give if we can search by company page?" She had a point and I am still trying to figure out how I can use that architecture. Important also to note that what I was building wasn't a replacement for LinkedIn search. It was for the sake of hacking. Furthermore, LinkedIn only shows what users decide to publish. Teamtailor People pages sometimes expose: department structure & internal positioning. It's a different signal surface.

By then I had a working sample pipeline. Now came the real question. Should I run it on the entire Teamtailor client universe? It was my last day. I didn't know when I would next have access to an IDE. So I pressed Start.

Cursor estimated: 1–2 hours runtime. My laptop return deadline: about one hour.

Not ideal.

  • First thought: "How expensive is it really to let something run this long?"
  • Second thought: "Will I lose everything if offboarding shuts down the machine mid-process?"

I walked to IT and tried negotiating an extra 30 minutes. Not possible. So I rushed back. Saved everything locally. Pushed the architecture to GitHub. Watched the progress bar.

One of the IT SysAdmin (a French fellow) walked past my desk near the exit door. He tapped his wrist and said: "Seven minutes left."

The script finished about ten minutes after my official deadline. I pushed the final version to GitHub. Closed the laptop. Returned it to IT. And was at the bar downstairs before anyone else.

What the Pipeline Became

None
https://github.com/diane-michaela/coding_sourcer/tree/main/teamtailor

The system:

  1. maps companies using Teamtailor
  2. filters technical organizations
  3. extracts People directories
  4. parses profile pages
  5. enriches identities with LinkedIn URLs

It transforms employer-branding infrastructure into structured candidate intelligence.