download links with python

April 9, 2009

Lately (about 3 days ago) I’ve started learning python, and this is a little script I made for download all the links of a page, you give it the html file (it have to be a local html, it can’t use an url, so save the page and use it) and the extension of the files you want to download, it use wget, so I know there are hundreds of script or software that can do this an maybe so much better but I want to do it just for didactic purpose, so here it is:

#!/usr/bin/env python
import sys
import os

#get two args:
# file name to search in
# extension of link to download

#use ./filename file.html extension_to_search

file = open(sys.argv[1], 'r')
ext = str(sys.argv[2])

extLength = len(ext)
lineas = file.readlines()

for line in lineas:
  if  line.find('href="') != -1:
    hrefpos = line.find('href="')
    #+6 to add the length of href="
    if line[hrefpos+6:].find(ext) != -1:
      lastpos = line[hrefpos+6:].find(ext)
      #extLength added so the string contains the extension
      dfile = line[hrefpos+6:][:lastpos+extLength]
      os.system("wget -c " + dfile)

Cheers

Entry Filed under: Blog. Tags: , .

5 Comments

  • 1. Harshad Joshi  |  April 9, 2009 at 9:57 pm

    Design a web crawler and document indexer… ;)

  • 2. markuz  |  April 9, 2009 at 11:37 pm

    You should take a look to urllib to first download the page and then look for the links and download the targets.

    • 3. istodi  |  April 10, 2009 at 12:36 pm

      urllib mmm I think I need a little more of python expertise ;)

  • 4. zodman  |  April 10, 2009 at 12:00 am

    import urllib

    wget = urllib.urlopen(dfile)
    myfile = open(filename,”w”)
    myfile.writelines(wget.readlines())
    myfile.close()

    easy!

  • 5. linxe  |  April 10, 2009 at 12:09 pm

    #!/usr/bin/perl -w
    # getLinks.pl < file.html
    my $type = ‘.pdf’; # File type to download
    while () { system(“wget -c $1″) if (/href=”\(.+$type\)”/i); }


Blogroll

Tags

bash dbus Debian glp Gnome GNU/Linux Humor Opinion Personal Programacion python Rails reunion Ruby scaffolding shell scripting Software Libre songbird Variado Web Web 2.0 WTF! xchat