Hello! I’m evaluating tools to track changes in:

  • Government/legal PDFs (new regulations, court rulings)
  • News sites without reliable RSS
  • Tender portals
  • Property management messages (e.g. service notices)
  • Bank terms and policy updates

Current options I’ve tried:
• Huginn — Powerful but requires significant setup, no unified feed • Changedetection-io — good for HTML, limited for documents

Key needs:
✓ Local processing (no cloud dependencies)
✓ Multi-page PDF support
✓ Customizable alert rules
✓ Trying to reduce manual monitoring overhead — looking for robust, offline-first approaches

What’s working well for others? Especially interested in:

  1. Solutions combining OCR + text analysis
  2. Experience with local LLMs for this (NLP, not just diff)
  3. Creative workarounds you’ve built

(P.S. Testing a deep scraping + LLM pipeline — if results look promising, will share.)

    • theorangeninja@sopuli.xyz
      link
      fedilink
      English
      arrow-up
      1
      ·
      7 days ago

      Can you point me to a tutorial how to setup that up properly for websites? I tried it a while ago and could not get it to work…

    • alfablend@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      1
      ·
      7 days ago

      @xyro Thanks for sharing your case! I’ve also tested changedetection.io — it’s a great tool for basic site monitoring.

      But in my tests, it doesn’t go beyond the surface. If there’s a page with multiple document links, it’ll detect changes in the list (via diff), but it won’t automatically download and analyze the new documents themselves.

      Here’s how I’ve approached this:

      1. Crawl the page to extract links
      2. Detect new document URLs
      3. Download each document and extract keywords
      4. Generate an AI summary using a local LLM
      5. Add the result to a readable feed

      P.S. If it helps, I can create a YAML template tailored to your grant-tracking case and run a quick test.

      • xyro@lemmy.ca
        link
        fedilink
        English
        arrow-up
        1
        ·
        6 days ago

        Do you send the result of the diff to an Ollama instance ? I would be curious to see the pipeline 😇

        • alfablend@lemmy.worldOP
          link
          fedilink
          English
          arrow-up
          1
          ·
          6 days ago

          @xyro Ah, I see! I’m not using Ollama at the moment — my setup is based on GPT4All with a locally hosted DeepSeek model, which handles the semantic parsing directly.

          As mentioned earlier, the pipeline doesn’t just diff pages — it detects new document URLs from the source feed (via selectors), downloads them, and generates structured summaries. Here’s a snippet from the YAML config to illustrate how that works:

          (extract:
            events:
              selector: "results[*]"
              fields:
                url: pdf_url
                title: title
                order_number: executive_order_number
          
          download:
            extensions: [".pdf"]
          
          gpt:
            prompt: |
              Analyze this Executive Order document:
              - Purpose: 1–2 sentences
              - Key provisions: 3–5 bullet points
              - Agencies involved: list
              - Revokes/amends: if any
              - Policy impact: neutral analysis
          )
          

          To keep things efficient, I also support regex-based extraction before passing content to the LLM. That way, I can isolate relevant blocks (e.g. addresses, client names, conclusions) and reduce the noise in the prompt. Example from another config:

          processing:
            extract_regex:
              - "object of cultural heritage"
              - "address[:\\s]\\s*(.{10,100}?)(?=\\n|$)"
              - "project(?:s)?"
              - "circumstances"
              - "client\\s*:?\\s*(.{10,100}?)(?=\\n|$)"
              - "(?:conclusions?)\\s*(.{50,300}?)(?=\\n|$)"
          

          Let me know if you’re experimenting with similar flows — I’d be happy to share templates or compare how DeepSeek performs on your sources!