download links with python
April 9, 2009
Lately (about 3 days ago) I’ve started learning python, and this is a little script I made for download all the links of a page, you give it the html file (it have to be a local html, it can’t use an url, so save the page and use it) and the extension of the files you want to download, it use wget, so I know there are hundreds of script or software that can do this an maybe so much better but I want to do it just for didactic purpose, so here it is:
#!/usr/bin/env python
import sys
import os
#get two args:
# file name to search in
# extension of link to download
#use ./filename file.html extension_to_search
file = open(sys.argv[1], 'r')
ext = str(sys.argv[2])
extLength = len(ext)
lineas = file.readlines()
for line in lineas:
if line.find('href="') != -1:
hrefpos = line.find('href="')
#+6 to add the length of href="
if line[hrefpos+6:].find(ext) != -1:
lastpos = line[hrefpos+6:].find(ext)
#extLength added so the string contains the extension
dfile = line[hrefpos+6:][:lastpos+extLength]
os.system("wget -c " + dfile)
Cheers
1.
Harshad Joshi | April 9, 2009 at 9:57 pm
Design a web crawler and document indexer…
2.
markuz | April 9, 2009 at 11:37 pm
You should take a look to urllib to first download the page and then look for the links and download the targets.
3.
istodi | April 10, 2009 at 12:36 pm
urllib mmm I think I need a little more of python expertise
4.
zodman | April 10, 2009 at 12:00 am
import urllib
wget = urllib.urlopen(dfile)
myfile = open(filename,”w”)
myfile.writelines(wget.readlines())
myfile.close()
easy!
5.
linxe | April 10, 2009 at 12:09 pm
#!/usr/bin/perl -w
# getLinks.pl < file.html
my $type = ‘.pdf’; # File type to download
while () { system(“wget -c $1″) if (/href=”\(.+$type\)”/i); }