Showing posts with label beautiful soup. Show all posts
Showing posts with label beautiful soup. Show all posts

Friday, April 17, 2009

What I learned at Pycon 2009

Why write a repetition about all the things that everyone else has documented the greatness of Pycon 2009 when I can write about the awesome stuff I learned?
  • How to use Python via XLRD and XLWT to handle truly brobdingnagian Excel files (8 million records anyone) without anything Microsoft.
  • New Internet scraping tools like html5lib and mechanize to reinforce lxml and the fading but still lovable BeautifulSoup.
  • Many awesome things about the Sphinx documentation experience.
  • That spending a day in test focused sessions is an intense experience.
  • argparse makes writing command line python interactions so much easier.
  • Good rules of thumb for breaking apart applications in not just Django, but for python modules in general.
  • Git! The Pinax community embraced Git just as I was started to work on it. I owe a lot to Brian Rosner's patient coaching, and Jannis Leidel's patience.
  • That I am addicted to checking on the Github graphs to see how I compare.
  • 15 minutes of sitting next to James Tauber gave me the grounding I wanted in JQuery event handling.
  • That people who hate XML based template languages HATE them. HATE HATE HATE.
  • Mashed potatoes and bacon pizza is yummy. And spinach and anchovy pizza rocks. Yes, I ate anchovies and oddly enough liked them.
  • I really want to teach python, Django, and Pinax. I may not be the best coder by far, but I think this could be the largest contribution I ever make to this community.
Some of what I need to follow up on
Did I miss anything? Let me know!

Tuesday, April 7, 2009

The end of my Feedfeeder story

Another post about Plone... but this time about me and not about Plone.

For about 18 months I have wrestled with consuming broken RSS feeds to pick up image of the day fields stipulated by customers. These are feeds so broken that no RSS parser, including the masterful Feedparser, can handle them (for example, one image of the day feed usually puts the image in the RSS header and changes that each day - no history is maintained). They aren't actually RSS, they just possess a file name that ends with '.rss'. Plus, periodically the way they are written changes so custom logic fails.

I have forked Reinout van Rees FeedFeeder project, and even proposed complicated logical revisions to handle broken these broken feeds and their shifting implementation. I called it Feedfeeder v2. Reinout always seemed hesitant, and I watched as other people extended on his work and despaired. I knew something was wrong but couldn't put my finger on it. I hesitated to work on it, even though funding for it was readily available.

Then between Spacebook, Pinax, and other efforts I shelved this effort for months, hiding my head in the virtual sand. And yet I knew it needs to be addressed. How could I handle something that broke the otherwise wonderful Feedparser?

During Pycon 2009 I came up with the answer. I took an excellent tutorial on html scraping and learned lots of little tricks to reinforce my skills with BeautifulSoup. You see, screen scraping is a secret pleasure I have. Scraping out a bit of data from a page is like a little puzzle. When I talked about this to someone, in the middle of my discussion with them the answer became clear as day.

The answer was to turn the problem from a RSS interpretation problem to a simple web page scraping puzzle.
  1. Fetch via urllib the XML file that pretends to be RSS.
  2. Parse it using BeautifulSoup or html5lib.
  3. Get all the images listed.
  4. Discard all but the largest image.
  5. Guess out the meta-data from the XML file and store that for the image.
Problem solved.

Now I just need to make a Plone 3 package to do this for me and my angst is finished.

My apologies Reinout for the time spent on trying to cook a solution via Feedfeeder. Thank you for your insights and your extreme patience. I think you tried to tell me to take a different path.

Tuesday, March 24, 2009

Pycon Tutorials attended by me

I'm attending four tutorials. My choices in the tutorials were driven by work I see coming towards me and my own greedy desires.

Session 1 - Working with Excel Files in Python
I chose this tutorial because whether or not I like Excel is moot. What is of importance is that people often want exports in Excel. Some customers can really spin Excel, and it is part of their critical tool set. Supporting this need ensures more work for me (and Python at NASA SMD) going forward. Notice how I don't mention Excel imports?

Session 2 - Django in the Real World
We are using Django for the NASA SMD Spacebook project. I've gotten what I think is a pretty good handle on Django, but a nice reinforcement might save a lot of headaches down the road.

Session 3 - Scrape the Web
I love to use BeautifulSoup (BS) for web scraping. BS makes it more a game than a chore. Mechanize needs to get into my toolset ASAP. My goal is to pick up some handy new tricks for scraping.

Session 4 - Internet Programming with Python
I'll admit a weakness. My understanding of network protocols is negligible. I'm going to use this class as a springboard to more knowledge.

Sunday, October 12, 2008

Help me with zc.testbrowser

I like zc.testbrowser. Toss in some BeautifulSoup to increase the accuracy of some tests and its a monstrously useful way to run tests. However...

For the life of me I can't get it to properly handle select fields (select or multi-select). Once I get the control, I can't seem to set select fields as selected.

Any help would be appreciated. This ate way too much of my time. What should have been a trivial test has caused me no end of frustration. The documentation is pretty good, and yet they don't seem to provide how to do this sort of thing.

In any case, once answered I plan to put the response in the zc.testbrowser reference card I am cooking up.

Update: Fixed the problem with some help from Aaron Van Derlip. Basically, since zc.testbrowser doesn't do JavaScript, sometimes you have to submit forms and links the hard way. I'll be putting that into my upcoming reference card.

Thursday, April 24, 2008

What I want in a feed aggregator

The list is simple:
  1. One page that displays all the content. Maybe do some pagination, or hide descriptions and just show titles. Otherwise have tags, author, description, and link to original post.
  2. One page with a text area that accepts one feed per line.
  3. Include some sort of authentication.
Ways to get this done
Google App Engine handles #3 for me nicely and gives me free hosting. But feedparser doesn't play well with it and I'm not about to do that kind of debugging. Maybe I ought to try BeautifulSoup?

I'm tempted to try a pure Django system, since that could handle all three, but then I would have to pay for hosting. The same would go for Grok as well. I don't want to pay for hosting yet. Or maybe I ought to just pony up a few bucks a month anyhow...

Of course, I can always write my own simple wxPython client.

What to do... what to do...

Update: Never code on two hours sleep. I'm going with Google app Engine because I realized that when you import of feedparser you can't do this:
import feedparser.py

Wednesday, May 16, 2007

Beautiful Soup is Beautiful

I have a bunch of content stored on an old instance of pmwiki. I've never liked pmwiki, since it seems to only have a half-hacked state method, and just in general feels insecure. Also, I've found that wikis can be useful, but if you have short content on each page, often a FAQ style treatment will do better than a regular wiki.

So I decided to convert the pmwiki pages into a pbwiki toc construct. It would put all the content onto one page, and use the tag to provide a top level table of contents. That meant I would have to:
  1. Scrape the pmwiki content index for all the meaningful links.
  2. Scrape out the title and urls of each link.
  3. Grab the content from each link.
  4. Reformat it all to work in the pbwiki format.
I've done screen scraping before, but not in Python, and not in this scope of effort. Well, Python seems to do everything well so I opened up htmllib and started to play, thinking I would be done by brunch-time.

Immediately I'm unhappy with htmllib. The docs suck. And it just seems awkward to use once I figure it out. Doesn't feel Pythonic, although I'm sure I'm wrong in that respect somehow. Its just for me, my Python pseudo code often ends up being close to the end effort. And this was not the case.

Then a work buddy told me about Beautiful Soup. Its an HTML/XML parser that is real easy to use and can work with badly formed HTML, like the sort that pmwiki sometimes generates. Its not optimized for speed, but for usability. Thats fine with me, because this is a one-time operation on maybe 150-200 entries.

The final effort worked real nice. Not super fast, but real easy to code. Beautiful Soup meant what I thought would be a quick and simple task remained so.
2 comments
Older Posts Home