• Saik0@lemmy.saik0.com
    link
    fedilink
    English
    arrow-up
    2
    ·
    2 days ago

    They can also crawl this publically-accessible social media source for their data sets.

    Crawling would be silly. They can simply setup a lemmy node and subscribe to every other server. Activitypub crawler would be much more efficient as they wouldn’t accidentally crawl things that haven’t changed, but instead can read the activitypub updates.

    • Strawberry@lemmy.blahaj.zone
      link
      fedilink
      English
      arrow-up
      2
      ·
      2 days ago

      Sure but we’re in the comments section of an article about wikipedia being crawled, which is silly because they could just download a snapshot of wikipedia

      • TXL@sopuli.xyz
        link
        fedilink
        English
        arrow-up
        1
        ·
        2 hours ago

        That’s right. It’s not humans making careful decisions about what to download. It’s a program that follows links and saves files.