RSS Redux
9:30 -- Re-read Brian Lamb's blog post.
9:33 -- Poked around Stephen Downes' site, reading over some of the documentation on Edu_Rss. Really, I'm hoping to find an OPML file. Bingo.
9:40 -- Create a database on educon20.org.
9:42 -- Go to Drupal.org -- grap a copy of the 5.7 codebase, and the following modules: FeedAPI, FeedElement Mapper, Views, Views Bonus, Tagadelic, and CCK. At a later point, if nothing blows up, I'll probably add in Similar Content.
9:50 -- untar code. Realize I'm curious how long this will actually take, and resign myself to getting less sleep than I originally hoped. So it goes.
9:59 -- upload code to the server. Crack a beer. A good one.
10:04 -- bring site live.
10:08 -- in the process of installing the modules, realize I have forgotten to download the SimplePie parser. Oy.
10:16 -- create settings for the imported feeds, and create taxonomy categories the individual posts.
10:23 -- test import with a test feed. It looks good.
10:30 -- import opml file
10:35 -- first attempt at opml import bombs. Time to increase the memory allotted to php scripts in the settings.php file. Bumping it up to 40M ought to do it. If that doesn't work, I'll break up the opml file into multiple parts. At this point, I congratulate myself on the wise choice made at 9:59. A lesser beer would offer less solace during these times of peril.
10:42 -- second attempt bombs again. Time to try a third attempt, and see if it bombs in the same place. Don't know if I'm running into a php timeout, or a malformed xml file.
10:45 -- third attempt. Fingers crossed.
10:46 -- bombs out at close to the same place. In all likelihood, a php timeout issue. Small curses.
10:57 -- finished editing the original opml file into 4 smaller opml files. The first one imports with no issues -- 100 feeds down. Now trying the second opml file, which is larger than the first.
Note: I'm doing all this via a wireless connection, which is rather silly. When I am uploading files, I prefer to use a wired connection, as there is less chance of a transfer getting munged.
11:06 -- the second opml file bombed -- edited it into two smaller opml files. Trying again now.
11:13 -- the first two opml files have imported cleanly. The third is importing now. After this, two more to go.
11:22 -- opml import complete. Now, to begin the process of importing the feeds.
11:23 -- first cron run begun. In Drupal, there are many wonderful things that occur during a cron run. It is a sign of my general disintegration that I now have an active interest in things that occur during a cron run. During the first cron run, nearly 1000 posts were imported from the various feeds.
11:26 -- second cron run begun. An additional 2000 posts imported
11:30 -- third cron run begun.
11:37 -- fourth cron run begun.
11:45 -- create default views for imported feeds, and keyword directory.
12:06 -- install Similar Terms module -- this is a lightweight content recommendation engine.
12:25 -- for the last 20 minutes or so, I've been lost reading content.
12:40 -- set up a cron job to run automatically. This will serve two main purposes: import new posts, and index the site so that the search actually works. It will probably take about half a day for the site to get fully indexed; after that point, the full text search will work pretty well.
1:00 -- clean up this post. Wonder why I didn't go to bed earlier.
As of this writing, a little over 3.5 hours from when I started, there are nearly 7500 posts imported from around 500 different feeds.
- billfitzgerald's blog
- Login to post comments
Bill, this is an awesome
Bill, this is an awesome writeup.
What's becoming clear is that the OPML import needs an option for batch importing on cron. There is only so many nodes you can crank out on one page load in Drupal :)
Alex
Hello, Alex, I was just
Hello, Alex,
I was just going to stop by the Dev Seed blog and make a comment on your announcement --
What's awesome is the work you all have done to make setting up a powerful tool like this so easy.
You have done amazing work with this.
Cheers,
Bill
This is excellent. Awesome
This is excellent. Awesome work. Now is it pulling all those feeds, or just some. I noticed certain sites are in the directory but aren't being pulled, is that just to control the mayhem?
You're the RSS man!
Awesome stuff, as usual.
Site catching up
Hello, Jim,
The site is actually just catching up -- by choice and out of a perverse sense of curiosity, I built it up on a pretty anemic server -- I was curious if the resources needed to run the site would cause the server to keel over and make whimpering noises.
What I should have done is import the feeds in batches of 20-30, and then import all posts associated with those feeds. What I did, however, was to import all 500 feeds at once, and then start aggregating the posts associated with those feeds.
The php settings on the server limit scripts to 4 minutes -- the import is triggered on a cron run, and from that moment on the clock is ticking -- I haven't done an accurate count of how many posts can be imported on a cron run, or what the balance is between checking for updates vs. processing feeds that have yet to be updated.
Further complicating matters: many of the feed urls in the original opml file are dated, and no longer accurate.
In short, if I was to do this again, I'd probably do the opml imports in batches of fifty, test the uris for the feeds, and repeat until done.
Over time, though, all of the feeds will get pulled in -- in the meantime, though, I'll probably need to spend some time finding feeds that aren't working, and/or blogs that no longer exist, and removing them from the list.
Cheers,
Bill
Update
There are A LOT of feeds in the opml file I grabbed from the Edu_rss site that are off -- I've corrected close to 40 feeds, and there are still probably a bunch that need to be fixed.
Ironically, one of the uris that was off pointed to the Bava :)