"What’s Your Preferred Self-Hosted Solution for Deep Monitoring (Beyond Simple Page Changes)?"

alfablend@lemmy.world · 7 days ago

"What’s Your Preferred Self-Hosted Solution for Deep Monitoring (Beyond Simple Page Changes)?"

xyro@lemmy.ca · edit-2 7 days ago

Started to test changedetection (https://github.com/dgtlmoon/changedetection.io) for similar usecases (monitoring government grant webpages), it can also detect change in pdf, but I didn’t test that feature that much. Worked fine so far for me.

theorangeninja@sopuli.xyz · 7 days ago

Can you point me to a tutorial how to setup that up properly for websites? I tried it a while ago and could not get it to work…

alfablend@lemmy.world · 6 days ago

Hello! For changedetection.io there is setup instruction with PIP install: https://github.com/dgtlmoon/changedetection.io/wiki/Microsoft-Windows What is your use case?

alfablend@lemmy.world · 7 days ago

@xyro Thanks for sharing your case! I’ve also tested changedetection.io — it’s a great tool for basic site monitoring.

But in my tests, it doesn’t go beyond the surface. If there’s a page with multiple document links, it’ll detect changes in the list (via diff), but it won’t automatically download and analyze the new documents themselves.

Here’s how I’ve approached this:

Crawl the page to extract links
Detect new document URLs
Download each document and extract keywords
Generate an AI summary using a local LLM
Add the result to a readable feed

P.S. If it helps, I can create a YAML template tailored to your grant-tracking case and run a quick test.

xyro@lemmy.ca · 6 days ago

Do you send the result of the diff to an Ollama instance ? I would be curious to see the pipeline 😇

alfablend@lemmy.world · 6 days ago

@xyro Ah, I see! I’m not using Ollama at the moment — my setup is based on GPT4All with a locally hosted DeepSeek model, which handles the semantic parsing directly.

As mentioned earlier, the pipeline doesn’t just diff pages — it detects new document URLs from the source feed (via selectors), downloads them, and generates structured summaries. Here’s a snippet from the YAML config to illustrate how that works:

(extract:
  events:
    selector: "results[*]"
    fields:
      url: pdf_url
      title: title
      order_number: executive_order_number

download:
  extensions: [".pdf"]

gpt:
  prompt: |
    Analyze this Executive Order document:
    - Purpose: 1–2 sentences
    - Key provisions: 3–5 bullet points
    - Agencies involved: list
    - Revokes/amends: if any
    - Policy impact: neutral analysis
)

To keep things efficient, I also support regex-based extraction before passing content to the LLM. That way, I can isolate relevant blocks (e.g. addresses, client names, conclusions) and reduce the noise in the prompt. Example from another config:

processing:
  extract_regex:
    - "object of cultural heritage"
    - "address[:\\s]\\s*(.{10,100}?)(?=\\n|$)"
    - "project(?:s)?"
    - "circumstances"
    - "client\\s*:?\\s*(.{10,100}?)(?=\\n|$)"
    - "(?:conclusions?)\\s*(.{50,300}?)(?=\\n|$)"

Let me know if you’re experimenting with similar flows — I’d be happy to share templates or compare how DeepSeek performs on your sources!