Python web-scraper to download table of transistor counts from WikipediaParsing Wikipedia data in PythonWeb scraper running extremely slowCorrects the encoding of a CSV file in QtWeb Scraper in PythonParsing Wikipedia table with PythonSmall program to download wikipedia articles to pdfComic Image Web ScraperWeb scraper that extracts urls from Amazon and eBay
Does USB version speed matter for input devices?
Translate the French quote "Il n’y a pas d'amour, il n’y a que des preuves d’amour" to English?
How could hearsay be better evidence than direct?
Why does the B-2 Spirit have a pattern of thin white lines?
Is there a word/phrase that can describe playing a musical instrument in a casual way?
Why would a berry have a slow-acting poison?
Is 33 the minimum skill check result for a multiclassed 11 rogue/9 artificer using Reliable Talent, Expertise, Guidance, and Flash of Genius?
Does ません sometimes not negate?
Game company goes bankrupt, another company wants to make a sequel how?
How did Boris Johnson manage to be elected Mayor of London for two terms when London is usually staunchly Labour?
Can span be constexpr?
Is it acceptable for the secretary to have full access to our entire Outlook agenda?
Are there any mentions of nuclear weapons in M*A*S*H?
Why is it so important who the whistleblower in the Trump-Zelensky phone call is?
What else would an hot red wire be for in a split-tab outlet?
Guitar buzzing in G note
Why would one use "enter the name of the project to confirm"?
Solving a knapsack problem with a lot of items
What happens when a land attacks then stops being a creature?
A bob hanging in an accelerating train moves backward. What is the force moving it backward?
What does play with feeling mean?
Contacted by head of school regarding an issue - should I be worried?
Why does the hyperref documentation suggest using gather instead of equation?
Where to stand for this winter view in Grindelwald, Switzerland?
Python web-scraper to download table of transistor counts from Wikipedia
Parsing Wikipedia data in PythonWeb scraper running extremely slowCorrects the encoding of a CSV file in QtWeb Scraper in PythonParsing Wikipedia table with PythonSmall program to download wikipedia articles to pdfComic Image Web ScraperWeb scraper that extracts urls from Amazon and eBay
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;
$begingroup$
I have been looking for answers for how to easily scrape data from Wikipedia into a CSV file with Beautiful Soup. This is the code so far. Is there an easier way to do it?
import urllib.request
from bs4 import BeautifulSoup
import csv
#the websites
urls = ['https://en.wikipedia.org/wiki/Transistor_count']
data =[]
#getting the websites and the data
for url in urls:
## my_url = requests.get(url)
my_url = urllib.request.urlopen(url)
html = my_url.read()
soup = BeautifulSoup(html,'html.parser')
My_table = soup.find('table','class':'wikitable sortable')
My_second_table = My_table.find_next_sibling('table')
with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
fields = ['Title', 'Year']
writer = csv.writer(f, delimiter=',')
writer.writerow(fields)
with open('data.csv', "a", encoding='UTF-8') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
writer.writerow([title, year])
with open('data.csv', "a", encoding='UTF-8') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for tr in My_second_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))
python csv beautifulsoup wikipedia
$endgroup$
add a comment
|
$begingroup$
I have been looking for answers for how to easily scrape data from Wikipedia into a CSV file with Beautiful Soup. This is the code so far. Is there an easier way to do it?
import urllib.request
from bs4 import BeautifulSoup
import csv
#the websites
urls = ['https://en.wikipedia.org/wiki/Transistor_count']
data =[]
#getting the websites and the data
for url in urls:
## my_url = requests.get(url)
my_url = urllib.request.urlopen(url)
html = my_url.read()
soup = BeautifulSoup(html,'html.parser')
My_table = soup.find('table','class':'wikitable sortable')
My_second_table = My_table.find_next_sibling('table')
with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
fields = ['Title', 'Year']
writer = csv.writer(f, delimiter=',')
writer.writerow(fields)
with open('data.csv', "a", encoding='UTF-8') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
writer.writerow([title, year])
with open('data.csv', "a", encoding='UTF-8') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for tr in My_second_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))
python csv beautifulsoup wikipedia
$endgroup$
3
$begingroup$
@AlexV I was suggested to post here by a user on stackoverflow. I only posted here because they told me to after posting on stackoverflow. My code is working but not the way I want. However the issue has been solved so if you want you can remove this post. Thank you anyway.
$endgroup$
– Sofelia
Sep 13 at 13:22
4
$begingroup$
You can also fix the issue in the post, reword the question and leave your code for an actual review.
$endgroup$
– AlexV
Sep 13 at 13:24
4
$begingroup$
@AlexV , Thank you, I reworded it, however I do not know if I need it anymore but I guess it is always nice to get your code reviewed.
$endgroup$
– Sofelia
Sep 13 at 13:27
3
$begingroup$
In general, when I want a table from a web page, I select it with my mouse, paste
$endgroup$
– WGroleau
Sep 15 at 22:45
add a comment
|
$begingroup$
I have been looking for answers for how to easily scrape data from Wikipedia into a CSV file with Beautiful Soup. This is the code so far. Is there an easier way to do it?
import urllib.request
from bs4 import BeautifulSoup
import csv
#the websites
urls = ['https://en.wikipedia.org/wiki/Transistor_count']
data =[]
#getting the websites and the data
for url in urls:
## my_url = requests.get(url)
my_url = urllib.request.urlopen(url)
html = my_url.read()
soup = BeautifulSoup(html,'html.parser')
My_table = soup.find('table','class':'wikitable sortable')
My_second_table = My_table.find_next_sibling('table')
with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
fields = ['Title', 'Year']
writer = csv.writer(f, delimiter=',')
writer.writerow(fields)
with open('data.csv', "a", encoding='UTF-8') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
writer.writerow([title, year])
with open('data.csv', "a", encoding='UTF-8') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for tr in My_second_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))
python csv beautifulsoup wikipedia
$endgroup$
I have been looking for answers for how to easily scrape data from Wikipedia into a CSV file with Beautiful Soup. This is the code so far. Is there an easier way to do it?
import urllib.request
from bs4 import BeautifulSoup
import csv
#the websites
urls = ['https://en.wikipedia.org/wiki/Transistor_count']
data =[]
#getting the websites and the data
for url in urls:
## my_url = requests.get(url)
my_url = urllib.request.urlopen(url)
html = my_url.read()
soup = BeautifulSoup(html,'html.parser')
My_table = soup.find('table','class':'wikitable sortable')
My_second_table = My_table.find_next_sibling('table')
with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
fields = ['Title', 'Year']
writer = csv.writer(f, delimiter=',')
writer.writerow(fields)
with open('data.csv', "a", encoding='UTF-8') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
writer.writerow([title, year])
with open('data.csv', "a", encoding='UTF-8') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for tr in My_second_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))
python csv beautifulsoup wikipedia
python csv beautifulsoup wikipedia
edited Sep 13 at 19:29
200_success
138k21 gold badges175 silver badges448 bronze badges
138k21 gold badges175 silver badges448 bronze badges
asked Sep 13 at 12:44
SofeliaSofelia
2912 silver badges4 bronze badges
2912 silver badges4 bronze badges
3
$begingroup$
@AlexV I was suggested to post here by a user on stackoverflow. I only posted here because they told me to after posting on stackoverflow. My code is working but not the way I want. However the issue has been solved so if you want you can remove this post. Thank you anyway.
$endgroup$
– Sofelia
Sep 13 at 13:22
4
$begingroup$
You can also fix the issue in the post, reword the question and leave your code for an actual review.
$endgroup$
– AlexV
Sep 13 at 13:24
4
$begingroup$
@AlexV , Thank you, I reworded it, however I do not know if I need it anymore but I guess it is always nice to get your code reviewed.
$endgroup$
– Sofelia
Sep 13 at 13:27
3
$begingroup$
In general, when I want a table from a web page, I select it with my mouse, paste
$endgroup$
– WGroleau
Sep 15 at 22:45
add a comment
|
3
$begingroup$
@AlexV I was suggested to post here by a user on stackoverflow. I only posted here because they told me to after posting on stackoverflow. My code is working but not the way I want. However the issue has been solved so if you want you can remove this post. Thank you anyway.
$endgroup$
– Sofelia
Sep 13 at 13:22
4
$begingroup$
You can also fix the issue in the post, reword the question and leave your code for an actual review.
$endgroup$
– AlexV
Sep 13 at 13:24
4
$begingroup$
@AlexV , Thank you, I reworded it, however I do not know if I need it anymore but I guess it is always nice to get your code reviewed.
$endgroup$
– Sofelia
Sep 13 at 13:27
3
$begingroup$
In general, when I want a table from a web page, I select it with my mouse, paste
$endgroup$
– WGroleau
Sep 15 at 22:45
3
3
$begingroup$
@AlexV I was suggested to post here by a user on stackoverflow. I only posted here because they told me to after posting on stackoverflow. My code is working but not the way I want. However the issue has been solved so if you want you can remove this post. Thank you anyway.
$endgroup$
– Sofelia
Sep 13 at 13:22
$begingroup$
@AlexV I was suggested to post here by a user on stackoverflow. I only posted here because they told me to after posting on stackoverflow. My code is working but not the way I want. However the issue has been solved so if you want you can remove this post. Thank you anyway.
$endgroup$
– Sofelia
Sep 13 at 13:22
4
4
$begingroup$
You can also fix the issue in the post, reword the question and leave your code for an actual review.
$endgroup$
– AlexV
Sep 13 at 13:24
$begingroup$
You can also fix the issue in the post, reword the question and leave your code for an actual review.
$endgroup$
– AlexV
Sep 13 at 13:24
4
4
$begingroup$
@AlexV , Thank you, I reworded it, however I do not know if I need it anymore but I guess it is always nice to get your code reviewed.
$endgroup$
– Sofelia
Sep 13 at 13:27
$begingroup$
@AlexV , Thank you, I reworded it, however I do not know if I need it anymore but I guess it is always nice to get your code reviewed.
$endgroup$
– Sofelia
Sep 13 at 13:27
3
3
$begingroup$
In general, when I want a table from a web page, I select it with my mouse, paste
$endgroup$
– WGroleau
Sep 15 at 22:45
$begingroup$
In general, when I want a table from a web page, I select it with my mouse, paste
$endgroup$
– WGroleau
Sep 15 at 22:45
add a comment
|
5 Answers
5
active
oldest
votes
$begingroup$
Is there a way to simplify this code?
Yes. Don't scrape Wikipedia. Your first thought before "should I need to scrape this thing?" should be "Is there an API that can give me the data I want?" In this case, there super is.
There are many informative links such as this StackOverflow question, but in the end reading the API documentation really is the right thing to do. This should get you started:
from pprint import pprint
import requests, wikitextparser
r = requests.get(
'https://en.wikipedia.org/w/api.php',
params=
'action': 'query',
'titles': 'Transistor_count',
'prop': 'revisions',
'rvprop': 'content',
'format': 'json',
)
r.raise_for_status()
pages = r.json()['query']['pages']
body = next(iter(pages.values()))['revisions'][0]['*']
doc = wikitextparser.parse(body)
print(f'len(doc.tables) tables retrieved')
pprint(doc.tables[0].data())
This may seem more roundabout than scraping the page, but API access gets you structured data, which bypasses an HTML rendering step that you shouldn't have to deal with. This structured data is the actual source of the article and is more reliable.
$endgroup$
4
$begingroup$
At the end of the day, you still have to parse the wikitext. It's not clear to me why that's really better/easier than parsing HTML - they're just two different markup formats. If Wikipedia had an actual semantic API just for extracting data from tables in articles, that would be different.
$endgroup$
– Jack M
Sep 15 at 11:04
3
$begingroup$
@JackM - given the readily-available choices: let MediaWiki render to a format intended for browser representation to a human, or let MediaWiki return structured data intended for consumption by either humans or computers, I would choose the latter every time for this kind of application. Not all markup formats are made equal, and here intent is important. It's entirely possible for MediaWiki to change its rendering engine, still returning visually valid web content but breaking all scrapers that make assumptions about HTML.
$endgroup$
– Reinderien
Sep 15 at 13:53
5
$begingroup$
You make some good points. For this application however (a one-off script), parsing the HTML is as good, and it avoids having to spend time learning a new technology and a new tool (wikitext and the wikitext parsing library). Basically, even being presumably more experienced than OP, in their shoes I probably would have chosen to do the same thing in this case, although your general point that you should always look for an API before you try scraping is definitely a lesson they should take on-board.
$endgroup$
– Jack M
Sep 15 at 13:58
$begingroup$
@JackM the scraper would break if the page's UI was updated, so it's a less useful approach if the script is going to be re-run at any point in the future. APIs on the other hand, are often versioned, and tend to change less often than HTML
$endgroup$
– touch my body
Sep 16 at 18:22
add a comment
|
$begingroup$
Let me tell you about IMPORTHTML()
...
So here's all the code you need in Google Sheets:
=IMPORTHTML("https://en.wikipedia.org/wiki/Transistor_count", "table", 2)
The import seems to work fine:
And it's possible to download the table as CSV:
Processor,Transistor count,Date of introduction,Designer,MOS process,Area
"MP944 (20-bit, *6-chip*)",,1970[14] (declassified 1998),Garrett AiResearch,,
"Intel 4004 (4-bit, 16-pin)","2,250",1971,Intel,"10,000 nm",12 mm²
"Intel 8008 (8-bit, 18-pin)","3,500",1972,Intel,"10,000 nm",14 mm²
"NEC μCOM-4 (4-bit, 42-pin)","2,500[17][18]",1973,NEC,"7,500 nm[19]",*?*
Toshiba TLCS-12 (12-bit),"over 11,000[20]",1973,Toshiba,"6,000 nm",32 mm²
"Intel 4040 (4-bit, 16-pin)","3,000",1974,Intel,"10,000 nm",12 mm²
"Motorola 6800 (8-bit, 40-pin)","4,100",1974,Motorola,"6,000 nm",16 mm²
...
...
You'll just need to clean the numbers up, which you'd have to do anyway with the API or by scraping with BeautifulSoup.
$endgroup$
6
$begingroup$
Copy/pasting the raw text into a spreadsheet (in my case, into Numbers.app) automatically picked up the tabular formatting perfectly. Done.
$endgroup$
– Alexander
Sep 14 at 0:09
7
$begingroup$
That's seriously spooky.
$endgroup$
– Volker Siegel
Sep 15 at 5:27
17
$begingroup$
@VolkerSiegel: It looks like Google knows a thing or two about scraping data from websites. :D
$endgroup$
– Eric Duminil
Sep 15 at 8:49
add a comment
|
$begingroup$
Make sure to follow naming conventions. You name two variables inappropriately:
My_table = soup.find('table','class':'wikitable sortable')
My_second_table = My_table.find_next_sibling('table')
Those are just normal variables, not class names, so they should be lower-case:
my_table = soup.find('table','class':'wikitable sortable')
my_second_table = my_table.find_next_sibling('table')
Twice you do
try:
title = tds[0].text.replace('n','')
except:
title = ""
I'd specify what exact exception you want to catch so you don't accidentally hide a "real" error if you start making changes in the future. I'm assuming here you're intending to catch an
AttributeError
.Because you have essentially the same code twice, and because the code is bulky, I'd factor that out into its own function.
Something like:
import bs4
def eliminate_newlines(tag: bs4.element.Tag) -> str: # Maybe pick a better name
try:
return tag.text.replace('n', '')
except AttributeError: # I'm assuming this is what you intend to catch
return ""
Now that with open
block is much neater:
with open('data.csv', "a", encoding='UTF-8') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
title = eliminate_newlines(tds[0])
year = eliminate_newlines(tds[2])
writer.writerow([title, year])
Edit: I was in the shower, and realized that you're actually probably intending to catch an IndexError
in case the page is malformed or something. Same idea though, move that code out into a function to reduce duplication. Something like:
from typing import List
def eliminate_newlines(tags: List[bs4.element.Tag], i: int) -> str:
return tags[i].text.replace('n', '') if len(tags) < i else ""
This could also be done using a condition statement instead of expression. I figured that it's pretty simple though, so a one-liner should be fine.
If you're using a newer version of Python, lines like:
", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
Can make use of f-strings to do in-place string interpolation:
f"tds[0].text.replace('n', ''), tds[2].text.replace('n', '')"
In this particular case, the gain isn't much. They're very helpful for more complicated formatting though.
$endgroup$
add a comment
|
$begingroup$
Using the Wikipedia API
As mentioned in another answer, Wikipedia provides an HTTP API for fetching article content, and you can use it to get the content in much cleaner formats than HTML. Often this is much better (for example, it's a much better choice for this little project I wrote that extracts the first sentence of a Wikipedia article).
However, in your case, you have to parse tables anyway. Whether you're parsing them from HTML or from the Wikipedia API's "wikitext" format, I don't think there's much of a difference. So I consider this subjective.
Using requests, not urllib
Never use urllib, unless you're in an environment where for some reason you can't install external libraries. The requests
library is preferred for fetching HTML.
In your case, to get the HTML, the code would be:
html = requests.get(url).text
with an import requests
up-top. For your simple example, this isn't actually any easier, but it's just a general best-practice with Python programming to always use requests
, not urllib
. It's the preferred library.
What's with all those with blocks?
You don't need three with
blocks. When I glance at this code, three with
blocks make it seem like the code is doing something much more complicated than it actually is - it makes it seem like maybe you're writing multiple CSVs, which you aren't. Just use one, it works the same:
with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
fields = ['Title', 'Year']
writer = csv.writer(f, delimiter=',')
writer.writerow(fields)
for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
writer.writerow([title, year])
for tr in My_second_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))
Are those two tables really that different?
You have two for loops, each one processing a table and writing it to the CSV. The body of the two for loops is different, so at a glance it looks like maybe the two tables have a different format or something... but they don't. You can copy paste the body of the first for loop into the second and it works the same:
for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
writer.writerow([title, year])
for tr in My_second_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
With the above code, the only difference to the resulting CSV is that there aren't spaces after the commas, which I assume is not really important to you. Now that we've established the code doesn't need to be different in the two loops, we can just do this:
table_rows = My_table.find_all('tr')[2:] + My_second_table.find_all('tr')[2:]
for tr in table_rows:
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
writer.writerow([title, year])
Only one for loop. Much easier to understand!
Parsing strings by hand
So that second loop doesn't need to be there at all, but let's look at the code inside of it anyway:
row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))
Um... you just concatenated two strings together with a comma, just to call split
and split them apart at the comma at the very next line. I'm sure now it's pointed out to you you can see this is pointless, but I want to pull you up on one other thing in these two lines of code.
You are essentially trying to parse data by hand with that row.split
, which is always dangerous. This is an important and general lesson about programming. What if the name of the chip had a comma in it? Then row
would contain more commas than just the one that you put in there, and your call to writerow
would end up inserting more than two columns!
Never parse data by hand unless you absolutely have to, and never write data in formats like CSV or JSON by hand unless you absolutely have to. Always use a library, because there are always pathological edge cases like a comma in the chip name that you won't think of and which will break your code. The libraries, if they're been around for a while, have had those bugs ironed out. With these two lines:
row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))
you are attempting to split a table row into its two columns yourself, by hand, which is why you made a mistake (just like anyone would). Whereas in the first loop, the code which does this splitting are the two lines:
title = tds[0].text.replace('n','')
year = tds[2].text.replace('n','')
Here you are relying on BeautifulSoup to have split the columns cleanly into tds[0]
and tds[2]
, which is much safer and is why this code is much better.
Mixing of input parsing and output generating
The code that parses the HTML is mixed together with the code that generates the CSV. This is poor breaking of a problem down into sub-problems. The code that writes the CSV should just be thinking in terms of titles and years, it shouldn't have to know that they come from HTML, and the code that parses the HTML should just be solving the problem of extracting the titles and years, it should have no idea that that data is going to be written to a CSV. In other words, I want that for loop that writes the CSV to look like this:
for (title, year) in rows:
writer.writerow([title, year])
We can do this by rewriting the with block like this:
with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
fields = ['Title', 'Year']
writer = csv.writer(f, delimiter=',')
writer.writerow(fields)
table_rows = My_table.find_all('tr')[2:] + My_second_table.find_all('tr')[2:]
parsed_rows = []
for tr in table_rows:
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
parsed_rows.append((title, year))
for (title, year) in parsed_rows:
writer.writerow([title, year])
Factoring into functions
To make the code more readable and really separate the HTML stuff from the CSV stuff, we can break the script into functions. Here's my complete script.
import requests
from bs4 import BeautifulSoup
import csv
urls = ['https://en.wikipedia.org/wiki/Transistor_count']
data = []
def get_rows(html):
soup = BeautifulSoup(html,'html.parser')
My_table = soup.find('table','class':'wikitable sortable')
My_second_table = My_table.find_next_sibling('table')
table_rows = My_table.find_all('tr')[2:] + My_second_table.find_all('tr')[2:]
parsed_rows = []
for tr in table_rows:
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
parsed_rows.append((title, year))
return parsed_rows
for url in urls:
html = requests.get(url).text
parsed_rows = get_rows(html)
with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
fields = ['Title', 'Year']
writer = csv.writer(f, delimiter=',')
writer.writerow(fields)
for (title, year) in parsed_rows:
writer.writerow([title, year])
What would be even better would be for get_rows
to be a generator rather than a regular function, but that's advanced Python programming. This is fine for now.
$endgroup$
$begingroup$
"In Python 3 therequests
library comes built-in…" It does not, unfortunately. But you can easily install it usingpip install requests
.
$endgroup$
– grooveplex
Sep 15 at 17:46
$begingroup$
@grooveplex Oh, maybe it just comes with Ubuntu? I don't think I've had to installrequests
in recent memory.
$endgroup$
– Jack M
Sep 15 at 19:10
add a comment
|
$begingroup$
When you asked for an easier way, do you mean an easier way that uses “real code” as in the cartoon?
That looks pretty easy, but I think slightly easier is possible.
In general, when I want a table from a web page, I select it with my mouse and paste special (unformatted) into a spreadsheet. Bonus is that I don’t have to deal with NSA’s biggest competitor (Google).
$endgroup$
add a comment
|
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f228969%2fpython-web-scraper-to-download-table-of-transistor-counts-from-wikipedia%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Is there a way to simplify this code?
Yes. Don't scrape Wikipedia. Your first thought before "should I need to scrape this thing?" should be "Is there an API that can give me the data I want?" In this case, there super is.
There are many informative links such as this StackOverflow question, but in the end reading the API documentation really is the right thing to do. This should get you started:
from pprint import pprint
import requests, wikitextparser
r = requests.get(
'https://en.wikipedia.org/w/api.php',
params=
'action': 'query',
'titles': 'Transistor_count',
'prop': 'revisions',
'rvprop': 'content',
'format': 'json',
)
r.raise_for_status()
pages = r.json()['query']['pages']
body = next(iter(pages.values()))['revisions'][0]['*']
doc = wikitextparser.parse(body)
print(f'len(doc.tables) tables retrieved')
pprint(doc.tables[0].data())
This may seem more roundabout than scraping the page, but API access gets you structured data, which bypasses an HTML rendering step that you shouldn't have to deal with. This structured data is the actual source of the article and is more reliable.
$endgroup$
4
$begingroup$
At the end of the day, you still have to parse the wikitext. It's not clear to me why that's really better/easier than parsing HTML - they're just two different markup formats. If Wikipedia had an actual semantic API just for extracting data from tables in articles, that would be different.
$endgroup$
– Jack M
Sep 15 at 11:04
3
$begingroup$
@JackM - given the readily-available choices: let MediaWiki render to a format intended for browser representation to a human, or let MediaWiki return structured data intended for consumption by either humans or computers, I would choose the latter every time for this kind of application. Not all markup formats are made equal, and here intent is important. It's entirely possible for MediaWiki to change its rendering engine, still returning visually valid web content but breaking all scrapers that make assumptions about HTML.
$endgroup$
– Reinderien
Sep 15 at 13:53
5
$begingroup$
You make some good points. For this application however (a one-off script), parsing the HTML is as good, and it avoids having to spend time learning a new technology and a new tool (wikitext and the wikitext parsing library). Basically, even being presumably more experienced than OP, in their shoes I probably would have chosen to do the same thing in this case, although your general point that you should always look for an API before you try scraping is definitely a lesson they should take on-board.
$endgroup$
– Jack M
Sep 15 at 13:58
$begingroup$
@JackM the scraper would break if the page's UI was updated, so it's a less useful approach if the script is going to be re-run at any point in the future. APIs on the other hand, are often versioned, and tend to change less often than HTML
$endgroup$
– touch my body
Sep 16 at 18:22
add a comment
|
$begingroup$
Is there a way to simplify this code?
Yes. Don't scrape Wikipedia. Your first thought before "should I need to scrape this thing?" should be "Is there an API that can give me the data I want?" In this case, there super is.
There are many informative links such as this StackOverflow question, but in the end reading the API documentation really is the right thing to do. This should get you started:
from pprint import pprint
import requests, wikitextparser
r = requests.get(
'https://en.wikipedia.org/w/api.php',
params=
'action': 'query',
'titles': 'Transistor_count',
'prop': 'revisions',
'rvprop': 'content',
'format': 'json',
)
r.raise_for_status()
pages = r.json()['query']['pages']
body = next(iter(pages.values()))['revisions'][0]['*']
doc = wikitextparser.parse(body)
print(f'len(doc.tables) tables retrieved')
pprint(doc.tables[0].data())
This may seem more roundabout than scraping the page, but API access gets you structured data, which bypasses an HTML rendering step that you shouldn't have to deal with. This structured data is the actual source of the article and is more reliable.
$endgroup$
4
$begingroup$
At the end of the day, you still have to parse the wikitext. It's not clear to me why that's really better/easier than parsing HTML - they're just two different markup formats. If Wikipedia had an actual semantic API just for extracting data from tables in articles, that would be different.
$endgroup$
– Jack M
Sep 15 at 11:04
3
$begingroup$
@JackM - given the readily-available choices: let MediaWiki render to a format intended for browser representation to a human, or let MediaWiki return structured data intended for consumption by either humans or computers, I would choose the latter every time for this kind of application. Not all markup formats are made equal, and here intent is important. It's entirely possible for MediaWiki to change its rendering engine, still returning visually valid web content but breaking all scrapers that make assumptions about HTML.
$endgroup$
– Reinderien
Sep 15 at 13:53
5
$begingroup$
You make some good points. For this application however (a one-off script), parsing the HTML is as good, and it avoids having to spend time learning a new technology and a new tool (wikitext and the wikitext parsing library). Basically, even being presumably more experienced than OP, in their shoes I probably would have chosen to do the same thing in this case, although your general point that you should always look for an API before you try scraping is definitely a lesson they should take on-board.
$endgroup$
– Jack M
Sep 15 at 13:58
$begingroup$
@JackM the scraper would break if the page's UI was updated, so it's a less useful approach if the script is going to be re-run at any point in the future. APIs on the other hand, are often versioned, and tend to change less often than HTML
$endgroup$
– touch my body
Sep 16 at 18:22
add a comment
|
$begingroup$
Is there a way to simplify this code?
Yes. Don't scrape Wikipedia. Your first thought before "should I need to scrape this thing?" should be "Is there an API that can give me the data I want?" In this case, there super is.
There are many informative links such as this StackOverflow question, but in the end reading the API documentation really is the right thing to do. This should get you started:
from pprint import pprint
import requests, wikitextparser
r = requests.get(
'https://en.wikipedia.org/w/api.php',
params=
'action': 'query',
'titles': 'Transistor_count',
'prop': 'revisions',
'rvprop': 'content',
'format': 'json',
)
r.raise_for_status()
pages = r.json()['query']['pages']
body = next(iter(pages.values()))['revisions'][0]['*']
doc = wikitextparser.parse(body)
print(f'len(doc.tables) tables retrieved')
pprint(doc.tables[0].data())
This may seem more roundabout than scraping the page, but API access gets you structured data, which bypasses an HTML rendering step that you shouldn't have to deal with. This structured data is the actual source of the article and is more reliable.
$endgroup$
Is there a way to simplify this code?
Yes. Don't scrape Wikipedia. Your first thought before "should I need to scrape this thing?" should be "Is there an API that can give me the data I want?" In this case, there super is.
There are many informative links such as this StackOverflow question, but in the end reading the API documentation really is the right thing to do. This should get you started:
from pprint import pprint
import requests, wikitextparser
r = requests.get(
'https://en.wikipedia.org/w/api.php',
params=
'action': 'query',
'titles': 'Transistor_count',
'prop': 'revisions',
'rvprop': 'content',
'format': 'json',
)
r.raise_for_status()
pages = r.json()['query']['pages']
body = next(iter(pages.values()))['revisions'][0]['*']
doc = wikitextparser.parse(body)
print(f'len(doc.tables) tables retrieved')
pprint(doc.tables[0].data())
This may seem more roundabout than scraping the page, but API access gets you structured data, which bypasses an HTML rendering step that you shouldn't have to deal with. This structured data is the actual source of the article and is more reliable.
edited Sep 14 at 15:26
answered Sep 13 at 13:59
ReinderienReinderien
12.7k22 silver badges50 bronze badges
12.7k22 silver badges50 bronze badges
4
$begingroup$
At the end of the day, you still have to parse the wikitext. It's not clear to me why that's really better/easier than parsing HTML - they're just two different markup formats. If Wikipedia had an actual semantic API just for extracting data from tables in articles, that would be different.
$endgroup$
– Jack M
Sep 15 at 11:04
3
$begingroup$
@JackM - given the readily-available choices: let MediaWiki render to a format intended for browser representation to a human, or let MediaWiki return structured data intended for consumption by either humans or computers, I would choose the latter every time for this kind of application. Not all markup formats are made equal, and here intent is important. It's entirely possible for MediaWiki to change its rendering engine, still returning visually valid web content but breaking all scrapers that make assumptions about HTML.
$endgroup$
– Reinderien
Sep 15 at 13:53
5
$begingroup$
You make some good points. For this application however (a one-off script), parsing the HTML is as good, and it avoids having to spend time learning a new technology and a new tool (wikitext and the wikitext parsing library). Basically, even being presumably more experienced than OP, in their shoes I probably would have chosen to do the same thing in this case, although your general point that you should always look for an API before you try scraping is definitely a lesson they should take on-board.
$endgroup$
– Jack M
Sep 15 at 13:58
$begingroup$
@JackM the scraper would break if the page's UI was updated, so it's a less useful approach if the script is going to be re-run at any point in the future. APIs on the other hand, are often versioned, and tend to change less often than HTML
$endgroup$
– touch my body
Sep 16 at 18:22
add a comment
|
4
$begingroup$
At the end of the day, you still have to parse the wikitext. It's not clear to me why that's really better/easier than parsing HTML - they're just two different markup formats. If Wikipedia had an actual semantic API just for extracting data from tables in articles, that would be different.
$endgroup$
– Jack M
Sep 15 at 11:04
3
$begingroup$
@JackM - given the readily-available choices: let MediaWiki render to a format intended for browser representation to a human, or let MediaWiki return structured data intended for consumption by either humans or computers, I would choose the latter every time for this kind of application. Not all markup formats are made equal, and here intent is important. It's entirely possible for MediaWiki to change its rendering engine, still returning visually valid web content but breaking all scrapers that make assumptions about HTML.
$endgroup$
– Reinderien
Sep 15 at 13:53
5
$begingroup$
You make some good points. For this application however (a one-off script), parsing the HTML is as good, and it avoids having to spend time learning a new technology and a new tool (wikitext and the wikitext parsing library). Basically, even being presumably more experienced than OP, in their shoes I probably would have chosen to do the same thing in this case, although your general point that you should always look for an API before you try scraping is definitely a lesson they should take on-board.
$endgroup$
– Jack M
Sep 15 at 13:58
$begingroup$
@JackM the scraper would break if the page's UI was updated, so it's a less useful approach if the script is going to be re-run at any point in the future. APIs on the other hand, are often versioned, and tend to change less often than HTML
$endgroup$
– touch my body
Sep 16 at 18:22
4
4
$begingroup$
At the end of the day, you still have to parse the wikitext. It's not clear to me why that's really better/easier than parsing HTML - they're just two different markup formats. If Wikipedia had an actual semantic API just for extracting data from tables in articles, that would be different.
$endgroup$
– Jack M
Sep 15 at 11:04
$begingroup$
At the end of the day, you still have to parse the wikitext. It's not clear to me why that's really better/easier than parsing HTML - they're just two different markup formats. If Wikipedia had an actual semantic API just for extracting data from tables in articles, that would be different.
$endgroup$
– Jack M
Sep 15 at 11:04
3
3
$begingroup$
@JackM - given the readily-available choices: let MediaWiki render to a format intended for browser representation to a human, or let MediaWiki return structured data intended for consumption by either humans or computers, I would choose the latter every time for this kind of application. Not all markup formats are made equal, and here intent is important. It's entirely possible for MediaWiki to change its rendering engine, still returning visually valid web content but breaking all scrapers that make assumptions about HTML.
$endgroup$
– Reinderien
Sep 15 at 13:53
$begingroup$
@JackM - given the readily-available choices: let MediaWiki render to a format intended for browser representation to a human, or let MediaWiki return structured data intended for consumption by either humans or computers, I would choose the latter every time for this kind of application. Not all markup formats are made equal, and here intent is important. It's entirely possible for MediaWiki to change its rendering engine, still returning visually valid web content but breaking all scrapers that make assumptions about HTML.
$endgroup$
– Reinderien
Sep 15 at 13:53
5
5
$begingroup$
You make some good points. For this application however (a one-off script), parsing the HTML is as good, and it avoids having to spend time learning a new technology and a new tool (wikitext and the wikitext parsing library). Basically, even being presumably more experienced than OP, in their shoes I probably would have chosen to do the same thing in this case, although your general point that you should always look for an API before you try scraping is definitely a lesson they should take on-board.
$endgroup$
– Jack M
Sep 15 at 13:58
$begingroup$
You make some good points. For this application however (a one-off script), parsing the HTML is as good, and it avoids having to spend time learning a new technology and a new tool (wikitext and the wikitext parsing library). Basically, even being presumably more experienced than OP, in their shoes I probably would have chosen to do the same thing in this case, although your general point that you should always look for an API before you try scraping is definitely a lesson they should take on-board.
$endgroup$
– Jack M
Sep 15 at 13:58
$begingroup$
@JackM the scraper would break if the page's UI was updated, so it's a less useful approach if the script is going to be re-run at any point in the future. APIs on the other hand, are often versioned, and tend to change less often than HTML
$endgroup$
– touch my body
Sep 16 at 18:22
$begingroup$
@JackM the scraper would break if the page's UI was updated, so it's a less useful approach if the script is going to be re-run at any point in the future. APIs on the other hand, are often versioned, and tend to change less often than HTML
$endgroup$
– touch my body
Sep 16 at 18:22
add a comment
|
$begingroup$
Let me tell you about IMPORTHTML()
...
So here's all the code you need in Google Sheets:
=IMPORTHTML("https://en.wikipedia.org/wiki/Transistor_count", "table", 2)
The import seems to work fine:
And it's possible to download the table as CSV:
Processor,Transistor count,Date of introduction,Designer,MOS process,Area
"MP944 (20-bit, *6-chip*)",,1970[14] (declassified 1998),Garrett AiResearch,,
"Intel 4004 (4-bit, 16-pin)","2,250",1971,Intel,"10,000 nm",12 mm²
"Intel 8008 (8-bit, 18-pin)","3,500",1972,Intel,"10,000 nm",14 mm²
"NEC μCOM-4 (4-bit, 42-pin)","2,500[17][18]",1973,NEC,"7,500 nm[19]",*?*
Toshiba TLCS-12 (12-bit),"over 11,000[20]",1973,Toshiba,"6,000 nm",32 mm²
"Intel 4040 (4-bit, 16-pin)","3,000",1974,Intel,"10,000 nm",12 mm²
"Motorola 6800 (8-bit, 40-pin)","4,100",1974,Motorola,"6,000 nm",16 mm²
...
...
You'll just need to clean the numbers up, which you'd have to do anyway with the API or by scraping with BeautifulSoup.
$endgroup$
6
$begingroup$
Copy/pasting the raw text into a spreadsheet (in my case, into Numbers.app) automatically picked up the tabular formatting perfectly. Done.
$endgroup$
– Alexander
Sep 14 at 0:09
7
$begingroup$
That's seriously spooky.
$endgroup$
– Volker Siegel
Sep 15 at 5:27
17
$begingroup$
@VolkerSiegel: It looks like Google knows a thing or two about scraping data from websites. :D
$endgroup$
– Eric Duminil
Sep 15 at 8:49
add a comment
|
$begingroup$
Let me tell you about IMPORTHTML()
...
So here's all the code you need in Google Sheets:
=IMPORTHTML("https://en.wikipedia.org/wiki/Transistor_count", "table", 2)
The import seems to work fine:
And it's possible to download the table as CSV:
Processor,Transistor count,Date of introduction,Designer,MOS process,Area
"MP944 (20-bit, *6-chip*)",,1970[14] (declassified 1998),Garrett AiResearch,,
"Intel 4004 (4-bit, 16-pin)","2,250",1971,Intel,"10,000 nm",12 mm²
"Intel 8008 (8-bit, 18-pin)","3,500",1972,Intel,"10,000 nm",14 mm²
"NEC μCOM-4 (4-bit, 42-pin)","2,500[17][18]",1973,NEC,"7,500 nm[19]",*?*
Toshiba TLCS-12 (12-bit),"over 11,000[20]",1973,Toshiba,"6,000 nm",32 mm²
"Intel 4040 (4-bit, 16-pin)","3,000",1974,Intel,"10,000 nm",12 mm²
"Motorola 6800 (8-bit, 40-pin)","4,100",1974,Motorola,"6,000 nm",16 mm²
...
...
You'll just need to clean the numbers up, which you'd have to do anyway with the API or by scraping with BeautifulSoup.
$endgroup$
6
$begingroup$
Copy/pasting the raw text into a spreadsheet (in my case, into Numbers.app) automatically picked up the tabular formatting perfectly. Done.
$endgroup$
– Alexander
Sep 14 at 0:09
7
$begingroup$
That's seriously spooky.
$endgroup$
– Volker Siegel
Sep 15 at 5:27
17
$begingroup$
@VolkerSiegel: It looks like Google knows a thing or two about scraping data from websites. :D
$endgroup$
– Eric Duminil
Sep 15 at 8:49
add a comment
|
$begingroup$
Let me tell you about IMPORTHTML()
...
So here's all the code you need in Google Sheets:
=IMPORTHTML("https://en.wikipedia.org/wiki/Transistor_count", "table", 2)
The import seems to work fine:
And it's possible to download the table as CSV:
Processor,Transistor count,Date of introduction,Designer,MOS process,Area
"MP944 (20-bit, *6-chip*)",,1970[14] (declassified 1998),Garrett AiResearch,,
"Intel 4004 (4-bit, 16-pin)","2,250",1971,Intel,"10,000 nm",12 mm²
"Intel 8008 (8-bit, 18-pin)","3,500",1972,Intel,"10,000 nm",14 mm²
"NEC μCOM-4 (4-bit, 42-pin)","2,500[17][18]",1973,NEC,"7,500 nm[19]",*?*
Toshiba TLCS-12 (12-bit),"over 11,000[20]",1973,Toshiba,"6,000 nm",32 mm²
"Intel 4040 (4-bit, 16-pin)","3,000",1974,Intel,"10,000 nm",12 mm²
"Motorola 6800 (8-bit, 40-pin)","4,100",1974,Motorola,"6,000 nm",16 mm²
...
...
You'll just need to clean the numbers up, which you'd have to do anyway with the API or by scraping with BeautifulSoup.
$endgroup$
Let me tell you about IMPORTHTML()
...
So here's all the code you need in Google Sheets:
=IMPORTHTML("https://en.wikipedia.org/wiki/Transistor_count", "table", 2)
The import seems to work fine:
And it's possible to download the table as CSV:
Processor,Transistor count,Date of introduction,Designer,MOS process,Area
"MP944 (20-bit, *6-chip*)",,1970[14] (declassified 1998),Garrett AiResearch,,
"Intel 4004 (4-bit, 16-pin)","2,250",1971,Intel,"10,000 nm",12 mm²
"Intel 8008 (8-bit, 18-pin)","3,500",1972,Intel,"10,000 nm",14 mm²
"NEC μCOM-4 (4-bit, 42-pin)","2,500[17][18]",1973,NEC,"7,500 nm[19]",*?*
Toshiba TLCS-12 (12-bit),"over 11,000[20]",1973,Toshiba,"6,000 nm",32 mm²
"Intel 4040 (4-bit, 16-pin)","3,000",1974,Intel,"10,000 nm",12 mm²
"Motorola 6800 (8-bit, 40-pin)","4,100",1974,Motorola,"6,000 nm",16 mm²
...
...
You'll just need to clean the numbers up, which you'd have to do anyway with the API or by scraping with BeautifulSoup.
edited Sep 13 at 21:39
answered Sep 13 at 21:32
Eric DuminilEric Duminil
3,4711 gold badge10 silver badges18 bronze badges
3,4711 gold badge10 silver badges18 bronze badges
6
$begingroup$
Copy/pasting the raw text into a spreadsheet (in my case, into Numbers.app) automatically picked up the tabular formatting perfectly. Done.
$endgroup$
– Alexander
Sep 14 at 0:09
7
$begingroup$
That's seriously spooky.
$endgroup$
– Volker Siegel
Sep 15 at 5:27
17
$begingroup$
@VolkerSiegel: It looks like Google knows a thing or two about scraping data from websites. :D
$endgroup$
– Eric Duminil
Sep 15 at 8:49
add a comment
|
6
$begingroup$
Copy/pasting the raw text into a spreadsheet (in my case, into Numbers.app) automatically picked up the tabular formatting perfectly. Done.
$endgroup$
– Alexander
Sep 14 at 0:09
7
$begingroup$
That's seriously spooky.
$endgroup$
– Volker Siegel
Sep 15 at 5:27
17
$begingroup$
@VolkerSiegel: It looks like Google knows a thing or two about scraping data from websites. :D
$endgroup$
– Eric Duminil
Sep 15 at 8:49
6
6
$begingroup$
Copy/pasting the raw text into a spreadsheet (in my case, into Numbers.app) automatically picked up the tabular formatting perfectly. Done.
$endgroup$
– Alexander
Sep 14 at 0:09
$begingroup$
Copy/pasting the raw text into a spreadsheet (in my case, into Numbers.app) automatically picked up the tabular formatting perfectly. Done.
$endgroup$
– Alexander
Sep 14 at 0:09
7
7
$begingroup$
That's seriously spooky.
$endgroup$
– Volker Siegel
Sep 15 at 5:27
$begingroup$
That's seriously spooky.
$endgroup$
– Volker Siegel
Sep 15 at 5:27
17
17
$begingroup$
@VolkerSiegel: It looks like Google knows a thing or two about scraping data from websites. :D
$endgroup$
– Eric Duminil
Sep 15 at 8:49
$begingroup$
@VolkerSiegel: It looks like Google knows a thing or two about scraping data from websites. :D
$endgroup$
– Eric Duminil
Sep 15 at 8:49
add a comment
|
$begingroup$
Make sure to follow naming conventions. You name two variables inappropriately:
My_table = soup.find('table','class':'wikitable sortable')
My_second_table = My_table.find_next_sibling('table')
Those are just normal variables, not class names, so they should be lower-case:
my_table = soup.find('table','class':'wikitable sortable')
my_second_table = my_table.find_next_sibling('table')
Twice you do
try:
title = tds[0].text.replace('n','')
except:
title = ""
I'd specify what exact exception you want to catch so you don't accidentally hide a "real" error if you start making changes in the future. I'm assuming here you're intending to catch an
AttributeError
.Because you have essentially the same code twice, and because the code is bulky, I'd factor that out into its own function.
Something like:
import bs4
def eliminate_newlines(tag: bs4.element.Tag) -> str: # Maybe pick a better name
try:
return tag.text.replace('n', '')
except AttributeError: # I'm assuming this is what you intend to catch
return ""
Now that with open
block is much neater:
with open('data.csv', "a", encoding='UTF-8') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
title = eliminate_newlines(tds[0])
year = eliminate_newlines(tds[2])
writer.writerow([title, year])
Edit: I was in the shower, and realized that you're actually probably intending to catch an IndexError
in case the page is malformed or something. Same idea though, move that code out into a function to reduce duplication. Something like:
from typing import List
def eliminate_newlines(tags: List[bs4.element.Tag], i: int) -> str:
return tags[i].text.replace('n', '') if len(tags) < i else ""
This could also be done using a condition statement instead of expression. I figured that it's pretty simple though, so a one-liner should be fine.
If you're using a newer version of Python, lines like:
", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
Can make use of f-strings to do in-place string interpolation:
f"tds[0].text.replace('n', ''), tds[2].text.replace('n', '')"
In this particular case, the gain isn't much. They're very helpful for more complicated formatting though.
$endgroup$
add a comment
|
$begingroup$
Make sure to follow naming conventions. You name two variables inappropriately:
My_table = soup.find('table','class':'wikitable sortable')
My_second_table = My_table.find_next_sibling('table')
Those are just normal variables, not class names, so they should be lower-case:
my_table = soup.find('table','class':'wikitable sortable')
my_second_table = my_table.find_next_sibling('table')
Twice you do
try:
title = tds[0].text.replace('n','')
except:
title = ""
I'd specify what exact exception you want to catch so you don't accidentally hide a "real" error if you start making changes in the future. I'm assuming here you're intending to catch an
AttributeError
.Because you have essentially the same code twice, and because the code is bulky, I'd factor that out into its own function.
Something like:
import bs4
def eliminate_newlines(tag: bs4.element.Tag) -> str: # Maybe pick a better name
try:
return tag.text.replace('n', '')
except AttributeError: # I'm assuming this is what you intend to catch
return ""
Now that with open
block is much neater:
with open('data.csv', "a", encoding='UTF-8') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
title = eliminate_newlines(tds[0])
year = eliminate_newlines(tds[2])
writer.writerow([title, year])
Edit: I was in the shower, and realized that you're actually probably intending to catch an IndexError
in case the page is malformed or something. Same idea though, move that code out into a function to reduce duplication. Something like:
from typing import List
def eliminate_newlines(tags: List[bs4.element.Tag], i: int) -> str:
return tags[i].text.replace('n', '') if len(tags) < i else ""
This could also be done using a condition statement instead of expression. I figured that it's pretty simple though, so a one-liner should be fine.
If you're using a newer version of Python, lines like:
", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
Can make use of f-strings to do in-place string interpolation:
f"tds[0].text.replace('n', ''), tds[2].text.replace('n', '')"
In this particular case, the gain isn't much. They're very helpful for more complicated formatting though.
$endgroup$
add a comment
|
$begingroup$
Make sure to follow naming conventions. You name two variables inappropriately:
My_table = soup.find('table','class':'wikitable sortable')
My_second_table = My_table.find_next_sibling('table')
Those are just normal variables, not class names, so they should be lower-case:
my_table = soup.find('table','class':'wikitable sortable')
my_second_table = my_table.find_next_sibling('table')
Twice you do
try:
title = tds[0].text.replace('n','')
except:
title = ""
I'd specify what exact exception you want to catch so you don't accidentally hide a "real" error if you start making changes in the future. I'm assuming here you're intending to catch an
AttributeError
.Because you have essentially the same code twice, and because the code is bulky, I'd factor that out into its own function.
Something like:
import bs4
def eliminate_newlines(tag: bs4.element.Tag) -> str: # Maybe pick a better name
try:
return tag.text.replace('n', '')
except AttributeError: # I'm assuming this is what you intend to catch
return ""
Now that with open
block is much neater:
with open('data.csv', "a", encoding='UTF-8') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
title = eliminate_newlines(tds[0])
year = eliminate_newlines(tds[2])
writer.writerow([title, year])
Edit: I was in the shower, and realized that you're actually probably intending to catch an IndexError
in case the page is malformed or something. Same idea though, move that code out into a function to reduce duplication. Something like:
from typing import List
def eliminate_newlines(tags: List[bs4.element.Tag], i: int) -> str:
return tags[i].text.replace('n', '') if len(tags) < i else ""
This could also be done using a condition statement instead of expression. I figured that it's pretty simple though, so a one-liner should be fine.
If you're using a newer version of Python, lines like:
", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
Can make use of f-strings to do in-place string interpolation:
f"tds[0].text.replace('n', ''), tds[2].text.replace('n', '')"
In this particular case, the gain isn't much. They're very helpful for more complicated formatting though.
$endgroup$
Make sure to follow naming conventions. You name two variables inappropriately:
My_table = soup.find('table','class':'wikitable sortable')
My_second_table = My_table.find_next_sibling('table')
Those are just normal variables, not class names, so they should be lower-case:
my_table = soup.find('table','class':'wikitable sortable')
my_second_table = my_table.find_next_sibling('table')
Twice you do
try:
title = tds[0].text.replace('n','')
except:
title = ""
I'd specify what exact exception you want to catch so you don't accidentally hide a "real" error if you start making changes in the future. I'm assuming here you're intending to catch an
AttributeError
.Because you have essentially the same code twice, and because the code is bulky, I'd factor that out into its own function.
Something like:
import bs4
def eliminate_newlines(tag: bs4.element.Tag) -> str: # Maybe pick a better name
try:
return tag.text.replace('n', '')
except AttributeError: # I'm assuming this is what you intend to catch
return ""
Now that with open
block is much neater:
with open('data.csv', "a", encoding='UTF-8') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
title = eliminate_newlines(tds[0])
year = eliminate_newlines(tds[2])
writer.writerow([title, year])
Edit: I was in the shower, and realized that you're actually probably intending to catch an IndexError
in case the page is malformed or something. Same idea though, move that code out into a function to reduce duplication. Something like:
from typing import List
def eliminate_newlines(tags: List[bs4.element.Tag], i: int) -> str:
return tags[i].text.replace('n', '') if len(tags) < i else ""
This could also be done using a condition statement instead of expression. I figured that it's pretty simple though, so a one-liner should be fine.
If you're using a newer version of Python, lines like:
", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
Can make use of f-strings to do in-place string interpolation:
f"tds[0].text.replace('n', ''), tds[2].text.replace('n', '')"
In this particular case, the gain isn't much. They're very helpful for more complicated formatting though.
edited Sep 13 at 19:15
answered Sep 13 at 17:47
CarcigenicateCarcigenicate
10.4k1 gold badge20 silver badges50 bronze badges
10.4k1 gold badge20 silver badges50 bronze badges
add a comment
|
add a comment
|
$begingroup$
Using the Wikipedia API
As mentioned in another answer, Wikipedia provides an HTTP API for fetching article content, and you can use it to get the content in much cleaner formats than HTML. Often this is much better (for example, it's a much better choice for this little project I wrote that extracts the first sentence of a Wikipedia article).
However, in your case, you have to parse tables anyway. Whether you're parsing them from HTML or from the Wikipedia API's "wikitext" format, I don't think there's much of a difference. So I consider this subjective.
Using requests, not urllib
Never use urllib, unless you're in an environment where for some reason you can't install external libraries. The requests
library is preferred for fetching HTML.
In your case, to get the HTML, the code would be:
html = requests.get(url).text
with an import requests
up-top. For your simple example, this isn't actually any easier, but it's just a general best-practice with Python programming to always use requests
, not urllib
. It's the preferred library.
What's with all those with blocks?
You don't need three with
blocks. When I glance at this code, three with
blocks make it seem like the code is doing something much more complicated than it actually is - it makes it seem like maybe you're writing multiple CSVs, which you aren't. Just use one, it works the same:
with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
fields = ['Title', 'Year']
writer = csv.writer(f, delimiter=',')
writer.writerow(fields)
for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
writer.writerow([title, year])
for tr in My_second_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))
Are those two tables really that different?
You have two for loops, each one processing a table and writing it to the CSV. The body of the two for loops is different, so at a glance it looks like maybe the two tables have a different format or something... but they don't. You can copy paste the body of the first for loop into the second and it works the same:
for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
writer.writerow([title, year])
for tr in My_second_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
With the above code, the only difference to the resulting CSV is that there aren't spaces after the commas, which I assume is not really important to you. Now that we've established the code doesn't need to be different in the two loops, we can just do this:
table_rows = My_table.find_all('tr')[2:] + My_second_table.find_all('tr')[2:]
for tr in table_rows:
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
writer.writerow([title, year])
Only one for loop. Much easier to understand!
Parsing strings by hand
So that second loop doesn't need to be there at all, but let's look at the code inside of it anyway:
row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))
Um... you just concatenated two strings together with a comma, just to call split
and split them apart at the comma at the very next line. I'm sure now it's pointed out to you you can see this is pointless, but I want to pull you up on one other thing in these two lines of code.
You are essentially trying to parse data by hand with that row.split
, which is always dangerous. This is an important and general lesson about programming. What if the name of the chip had a comma in it? Then row
would contain more commas than just the one that you put in there, and your call to writerow
would end up inserting more than two columns!
Never parse data by hand unless you absolutely have to, and never write data in formats like CSV or JSON by hand unless you absolutely have to. Always use a library, because there are always pathological edge cases like a comma in the chip name that you won't think of and which will break your code. The libraries, if they're been around for a while, have had those bugs ironed out. With these two lines:
row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))
you are attempting to split a table row into its two columns yourself, by hand, which is why you made a mistake (just like anyone would). Whereas in the first loop, the code which does this splitting are the two lines:
title = tds[0].text.replace('n','')
year = tds[2].text.replace('n','')
Here you are relying on BeautifulSoup to have split the columns cleanly into tds[0]
and tds[2]
, which is much safer and is why this code is much better.
Mixing of input parsing and output generating
The code that parses the HTML is mixed together with the code that generates the CSV. This is poor breaking of a problem down into sub-problems. The code that writes the CSV should just be thinking in terms of titles and years, it shouldn't have to know that they come from HTML, and the code that parses the HTML should just be solving the problem of extracting the titles and years, it should have no idea that that data is going to be written to a CSV. In other words, I want that for loop that writes the CSV to look like this:
for (title, year) in rows:
writer.writerow([title, year])
We can do this by rewriting the with block like this:
with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
fields = ['Title', 'Year']
writer = csv.writer(f, delimiter=',')
writer.writerow(fields)
table_rows = My_table.find_all('tr')[2:] + My_second_table.find_all('tr')[2:]
parsed_rows = []
for tr in table_rows:
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
parsed_rows.append((title, year))
for (title, year) in parsed_rows:
writer.writerow([title, year])
Factoring into functions
To make the code more readable and really separate the HTML stuff from the CSV stuff, we can break the script into functions. Here's my complete script.
import requests
from bs4 import BeautifulSoup
import csv
urls = ['https://en.wikipedia.org/wiki/Transistor_count']
data = []
def get_rows(html):
soup = BeautifulSoup(html,'html.parser')
My_table = soup.find('table','class':'wikitable sortable')
My_second_table = My_table.find_next_sibling('table')
table_rows = My_table.find_all('tr')[2:] + My_second_table.find_all('tr')[2:]
parsed_rows = []
for tr in table_rows:
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
parsed_rows.append((title, year))
return parsed_rows
for url in urls:
html = requests.get(url).text
parsed_rows = get_rows(html)
with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
fields = ['Title', 'Year']
writer = csv.writer(f, delimiter=',')
writer.writerow(fields)
for (title, year) in parsed_rows:
writer.writerow([title, year])
What would be even better would be for get_rows
to be a generator rather than a regular function, but that's advanced Python programming. This is fine for now.
$endgroup$
$begingroup$
"In Python 3 therequests
library comes built-in…" It does not, unfortunately. But you can easily install it usingpip install requests
.
$endgroup$
– grooveplex
Sep 15 at 17:46
$begingroup$
@grooveplex Oh, maybe it just comes with Ubuntu? I don't think I've had to installrequests
in recent memory.
$endgroup$
– Jack M
Sep 15 at 19:10
add a comment
|
$begingroup$
Using the Wikipedia API
As mentioned in another answer, Wikipedia provides an HTTP API for fetching article content, and you can use it to get the content in much cleaner formats than HTML. Often this is much better (for example, it's a much better choice for this little project I wrote that extracts the first sentence of a Wikipedia article).
However, in your case, you have to parse tables anyway. Whether you're parsing them from HTML or from the Wikipedia API's "wikitext" format, I don't think there's much of a difference. So I consider this subjective.
Using requests, not urllib
Never use urllib, unless you're in an environment where for some reason you can't install external libraries. The requests
library is preferred for fetching HTML.
In your case, to get the HTML, the code would be:
html = requests.get(url).text
with an import requests
up-top. For your simple example, this isn't actually any easier, but it's just a general best-practice with Python programming to always use requests
, not urllib
. It's the preferred library.
What's with all those with blocks?
You don't need three with
blocks. When I glance at this code, three with
blocks make it seem like the code is doing something much more complicated than it actually is - it makes it seem like maybe you're writing multiple CSVs, which you aren't. Just use one, it works the same:
with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
fields = ['Title', 'Year']
writer = csv.writer(f, delimiter=',')
writer.writerow(fields)
for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
writer.writerow([title, year])
for tr in My_second_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))
Are those two tables really that different?
You have two for loops, each one processing a table and writing it to the CSV. The body of the two for loops is different, so at a glance it looks like maybe the two tables have a different format or something... but they don't. You can copy paste the body of the first for loop into the second and it works the same:
for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
writer.writerow([title, year])
for tr in My_second_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
With the above code, the only difference to the resulting CSV is that there aren't spaces after the commas, which I assume is not really important to you. Now that we've established the code doesn't need to be different in the two loops, we can just do this:
table_rows = My_table.find_all('tr')[2:] + My_second_table.find_all('tr')[2:]
for tr in table_rows:
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
writer.writerow([title, year])
Only one for loop. Much easier to understand!
Parsing strings by hand
So that second loop doesn't need to be there at all, but let's look at the code inside of it anyway:
row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))
Um... you just concatenated two strings together with a comma, just to call split
and split them apart at the comma at the very next line. I'm sure now it's pointed out to you you can see this is pointless, but I want to pull you up on one other thing in these two lines of code.
You are essentially trying to parse data by hand with that row.split
, which is always dangerous. This is an important and general lesson about programming. What if the name of the chip had a comma in it? Then row
would contain more commas than just the one that you put in there, and your call to writerow
would end up inserting more than two columns!
Never parse data by hand unless you absolutely have to, and never write data in formats like CSV or JSON by hand unless you absolutely have to. Always use a library, because there are always pathological edge cases like a comma in the chip name that you won't think of and which will break your code. The libraries, if they're been around for a while, have had those bugs ironed out. With these two lines:
row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))
you are attempting to split a table row into its two columns yourself, by hand, which is why you made a mistake (just like anyone would). Whereas in the first loop, the code which does this splitting are the two lines:
title = tds[0].text.replace('n','')
year = tds[2].text.replace('n','')
Here you are relying on BeautifulSoup to have split the columns cleanly into tds[0]
and tds[2]
, which is much safer and is why this code is much better.
Mixing of input parsing and output generating
The code that parses the HTML is mixed together with the code that generates the CSV. This is poor breaking of a problem down into sub-problems. The code that writes the CSV should just be thinking in terms of titles and years, it shouldn't have to know that they come from HTML, and the code that parses the HTML should just be solving the problem of extracting the titles and years, it should have no idea that that data is going to be written to a CSV. In other words, I want that for loop that writes the CSV to look like this:
for (title, year) in rows:
writer.writerow([title, year])
We can do this by rewriting the with block like this:
with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
fields = ['Title', 'Year']
writer = csv.writer(f, delimiter=',')
writer.writerow(fields)
table_rows = My_table.find_all('tr')[2:] + My_second_table.find_all('tr')[2:]
parsed_rows = []
for tr in table_rows:
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
parsed_rows.append((title, year))
for (title, year) in parsed_rows:
writer.writerow([title, year])
Factoring into functions
To make the code more readable and really separate the HTML stuff from the CSV stuff, we can break the script into functions. Here's my complete script.
import requests
from bs4 import BeautifulSoup
import csv
urls = ['https://en.wikipedia.org/wiki/Transistor_count']
data = []
def get_rows(html):
soup = BeautifulSoup(html,'html.parser')
My_table = soup.find('table','class':'wikitable sortable')
My_second_table = My_table.find_next_sibling('table')
table_rows = My_table.find_all('tr')[2:] + My_second_table.find_all('tr')[2:]
parsed_rows = []
for tr in table_rows:
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
parsed_rows.append((title, year))
return parsed_rows
for url in urls:
html = requests.get(url).text
parsed_rows = get_rows(html)
with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
fields = ['Title', 'Year']
writer = csv.writer(f, delimiter=',')
writer.writerow(fields)
for (title, year) in parsed_rows:
writer.writerow([title, year])
What would be even better would be for get_rows
to be a generator rather than a regular function, but that's advanced Python programming. This is fine for now.
$endgroup$
$begingroup$
"In Python 3 therequests
library comes built-in…" It does not, unfortunately. But you can easily install it usingpip install requests
.
$endgroup$
– grooveplex
Sep 15 at 17:46
$begingroup$
@grooveplex Oh, maybe it just comes with Ubuntu? I don't think I've had to installrequests
in recent memory.
$endgroup$
– Jack M
Sep 15 at 19:10
add a comment
|
$begingroup$
Using the Wikipedia API
As mentioned in another answer, Wikipedia provides an HTTP API for fetching article content, and you can use it to get the content in much cleaner formats than HTML. Often this is much better (for example, it's a much better choice for this little project I wrote that extracts the first sentence of a Wikipedia article).
However, in your case, you have to parse tables anyway. Whether you're parsing them from HTML or from the Wikipedia API's "wikitext" format, I don't think there's much of a difference. So I consider this subjective.
Using requests, not urllib
Never use urllib, unless you're in an environment where for some reason you can't install external libraries. The requests
library is preferred for fetching HTML.
In your case, to get the HTML, the code would be:
html = requests.get(url).text
with an import requests
up-top. For your simple example, this isn't actually any easier, but it's just a general best-practice with Python programming to always use requests
, not urllib
. It's the preferred library.
What's with all those with blocks?
You don't need three with
blocks. When I glance at this code, three with
blocks make it seem like the code is doing something much more complicated than it actually is - it makes it seem like maybe you're writing multiple CSVs, which you aren't. Just use one, it works the same:
with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
fields = ['Title', 'Year']
writer = csv.writer(f, delimiter=',')
writer.writerow(fields)
for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
writer.writerow([title, year])
for tr in My_second_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))
Are those two tables really that different?
You have two for loops, each one processing a table and writing it to the CSV. The body of the two for loops is different, so at a glance it looks like maybe the two tables have a different format or something... but they don't. You can copy paste the body of the first for loop into the second and it works the same:
for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
writer.writerow([title, year])
for tr in My_second_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
With the above code, the only difference to the resulting CSV is that there aren't spaces after the commas, which I assume is not really important to you. Now that we've established the code doesn't need to be different in the two loops, we can just do this:
table_rows = My_table.find_all('tr')[2:] + My_second_table.find_all('tr')[2:]
for tr in table_rows:
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
writer.writerow([title, year])
Only one for loop. Much easier to understand!
Parsing strings by hand
So that second loop doesn't need to be there at all, but let's look at the code inside of it anyway:
row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))
Um... you just concatenated two strings together with a comma, just to call split
and split them apart at the comma at the very next line. I'm sure now it's pointed out to you you can see this is pointless, but I want to pull you up on one other thing in these two lines of code.
You are essentially trying to parse data by hand with that row.split
, which is always dangerous. This is an important and general lesson about programming. What if the name of the chip had a comma in it? Then row
would contain more commas than just the one that you put in there, and your call to writerow
would end up inserting more than two columns!
Never parse data by hand unless you absolutely have to, and never write data in formats like CSV or JSON by hand unless you absolutely have to. Always use a library, because there are always pathological edge cases like a comma in the chip name that you won't think of and which will break your code. The libraries, if they're been around for a while, have had those bugs ironed out. With these two lines:
row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))
you are attempting to split a table row into its two columns yourself, by hand, which is why you made a mistake (just like anyone would). Whereas in the first loop, the code which does this splitting are the two lines:
title = tds[0].text.replace('n','')
year = tds[2].text.replace('n','')
Here you are relying on BeautifulSoup to have split the columns cleanly into tds[0]
and tds[2]
, which is much safer and is why this code is much better.
Mixing of input parsing and output generating
The code that parses the HTML is mixed together with the code that generates the CSV. This is poor breaking of a problem down into sub-problems. The code that writes the CSV should just be thinking in terms of titles and years, it shouldn't have to know that they come from HTML, and the code that parses the HTML should just be solving the problem of extracting the titles and years, it should have no idea that that data is going to be written to a CSV. In other words, I want that for loop that writes the CSV to look like this:
for (title, year) in rows:
writer.writerow([title, year])
We can do this by rewriting the with block like this:
with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
fields = ['Title', 'Year']
writer = csv.writer(f, delimiter=',')
writer.writerow(fields)
table_rows = My_table.find_all('tr')[2:] + My_second_table.find_all('tr')[2:]
parsed_rows = []
for tr in table_rows:
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
parsed_rows.append((title, year))
for (title, year) in parsed_rows:
writer.writerow([title, year])
Factoring into functions
To make the code more readable and really separate the HTML stuff from the CSV stuff, we can break the script into functions. Here's my complete script.
import requests
from bs4 import BeautifulSoup
import csv
urls = ['https://en.wikipedia.org/wiki/Transistor_count']
data = []
def get_rows(html):
soup = BeautifulSoup(html,'html.parser')
My_table = soup.find('table','class':'wikitable sortable')
My_second_table = My_table.find_next_sibling('table')
table_rows = My_table.find_all('tr')[2:] + My_second_table.find_all('tr')[2:]
parsed_rows = []
for tr in table_rows:
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
parsed_rows.append((title, year))
return parsed_rows
for url in urls:
html = requests.get(url).text
parsed_rows = get_rows(html)
with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
fields = ['Title', 'Year']
writer = csv.writer(f, delimiter=',')
writer.writerow(fields)
for (title, year) in parsed_rows:
writer.writerow([title, year])
What would be even better would be for get_rows
to be a generator rather than a regular function, but that's advanced Python programming. This is fine for now.
$endgroup$
Using the Wikipedia API
As mentioned in another answer, Wikipedia provides an HTTP API for fetching article content, and you can use it to get the content in much cleaner formats than HTML. Often this is much better (for example, it's a much better choice for this little project I wrote that extracts the first sentence of a Wikipedia article).
However, in your case, you have to parse tables anyway. Whether you're parsing them from HTML or from the Wikipedia API's "wikitext" format, I don't think there's much of a difference. So I consider this subjective.
Using requests, not urllib
Never use urllib, unless you're in an environment where for some reason you can't install external libraries. The requests
library is preferred for fetching HTML.
In your case, to get the HTML, the code would be:
html = requests.get(url).text
with an import requests
up-top. For your simple example, this isn't actually any easier, but it's just a general best-practice with Python programming to always use requests
, not urllib
. It's the preferred library.
What's with all those with blocks?
You don't need three with
blocks. When I glance at this code, three with
blocks make it seem like the code is doing something much more complicated than it actually is - it makes it seem like maybe you're writing multiple CSVs, which you aren't. Just use one, it works the same:
with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
fields = ['Title', 'Year']
writer = csv.writer(f, delimiter=',')
writer.writerow(fields)
for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
writer.writerow([title, year])
for tr in My_second_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))
Are those two tables really that different?
You have two for loops, each one processing a table and writing it to the CSV. The body of the two for loops is different, so at a glance it looks like maybe the two tables have a different format or something... but they don't. You can copy paste the body of the first for loop into the second and it works the same:
for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
writer.writerow([title, year])
for tr in My_second_table.find_all('tr')[2:]: # [2:] is to skip empty and header
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
With the above code, the only difference to the resulting CSV is that there aren't spaces after the commas, which I assume is not really important to you. Now that we've established the code doesn't need to be different in the two loops, we can just do this:
table_rows = My_table.find_all('tr')[2:] + My_second_table.find_all('tr')[2:]
for tr in table_rows:
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
writer.writerow([title, year])
Only one for loop. Much easier to understand!
Parsing strings by hand
So that second loop doesn't need to be there at all, but let's look at the code inside of it anyway:
row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))
Um... you just concatenated two strings together with a comma, just to call split
and split them apart at the comma at the very next line. I'm sure now it's pointed out to you you can see this is pointless, but I want to pull you up on one other thing in these two lines of code.
You are essentially trying to parse data by hand with that row.split
, which is always dangerous. This is an important and general lesson about programming. What if the name of the chip had a comma in it? Then row
would contain more commas than just the one that you put in there, and your call to writerow
would end up inserting more than two columns!
Never parse data by hand unless you absolutely have to, and never write data in formats like CSV or JSON by hand unless you absolutely have to. Always use a library, because there are always pathological edge cases like a comma in the chip name that you won't think of and which will break your code. The libraries, if they're been around for a while, have had those bugs ironed out. With these two lines:
row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))
you are attempting to split a table row into its two columns yourself, by hand, which is why you made a mistake (just like anyone would). Whereas in the first loop, the code which does this splitting are the two lines:
title = tds[0].text.replace('n','')
year = tds[2].text.replace('n','')
Here you are relying on BeautifulSoup to have split the columns cleanly into tds[0]
and tds[2]
, which is much safer and is why this code is much better.
Mixing of input parsing and output generating
The code that parses the HTML is mixed together with the code that generates the CSV. This is poor breaking of a problem down into sub-problems. The code that writes the CSV should just be thinking in terms of titles and years, it shouldn't have to know that they come from HTML, and the code that parses the HTML should just be solving the problem of extracting the titles and years, it should have no idea that that data is going to be written to a CSV. In other words, I want that for loop that writes the CSV to look like this:
for (title, year) in rows:
writer.writerow([title, year])
We can do this by rewriting the with block like this:
with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
fields = ['Title', 'Year']
writer = csv.writer(f, delimiter=',')
writer.writerow(fields)
table_rows = My_table.find_all('tr')[2:] + My_second_table.find_all('tr')[2:]
parsed_rows = []
for tr in table_rows:
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
parsed_rows.append((title, year))
for (title, year) in parsed_rows:
writer.writerow([title, year])
Factoring into functions
To make the code more readable and really separate the HTML stuff from the CSV stuff, we can break the script into functions. Here's my complete script.
import requests
from bs4 import BeautifulSoup
import csv
urls = ['https://en.wikipedia.org/wiki/Transistor_count']
data = []
def get_rows(html):
soup = BeautifulSoup(html,'html.parser')
My_table = soup.find('table','class':'wikitable sortable')
My_second_table = My_table.find_next_sibling('table')
table_rows = My_table.find_all('tr')[2:] + My_second_table.find_all('tr')[2:]
parsed_rows = []
for tr in table_rows:
tds = tr.find_all('td')
try:
title = tds[0].text.replace('n','')
except:
title = ""
try:
year = tds[2].text.replace('n','')
except:
year = ""
parsed_rows.append((title, year))
return parsed_rows
for url in urls:
html = requests.get(url).text
parsed_rows = get_rows(html)
with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
fields = ['Title', 'Year']
writer = csv.writer(f, delimiter=',')
writer.writerow(fields)
for (title, year) in parsed_rows:
writer.writerow([title, year])
What would be even better would be for get_rows
to be a generator rather than a regular function, but that's advanced Python programming. This is fine for now.
edited Sep 15 at 19:14
answered Sep 15 at 12:02
Jack MJack M
2232 silver badges7 bronze badges
2232 silver badges7 bronze badges
$begingroup$
"In Python 3 therequests
library comes built-in…" It does not, unfortunately. But you can easily install it usingpip install requests
.
$endgroup$
– grooveplex
Sep 15 at 17:46
$begingroup$
@grooveplex Oh, maybe it just comes with Ubuntu? I don't think I've had to installrequests
in recent memory.
$endgroup$
– Jack M
Sep 15 at 19:10
add a comment
|
$begingroup$
"In Python 3 therequests
library comes built-in…" It does not, unfortunately. But you can easily install it usingpip install requests
.
$endgroup$
– grooveplex
Sep 15 at 17:46
$begingroup$
@grooveplex Oh, maybe it just comes with Ubuntu? I don't think I've had to installrequests
in recent memory.
$endgroup$
– Jack M
Sep 15 at 19:10
$begingroup$
"In Python 3 the
requests
library comes built-in…" It does not, unfortunately. But you can easily install it using pip install requests
.$endgroup$
– grooveplex
Sep 15 at 17:46
$begingroup$
"In Python 3 the
requests
library comes built-in…" It does not, unfortunately. But you can easily install it using pip install requests
.$endgroup$
– grooveplex
Sep 15 at 17:46
$begingroup$
@grooveplex Oh, maybe it just comes with Ubuntu? I don't think I've had to install
requests
in recent memory.$endgroup$
– Jack M
Sep 15 at 19:10
$begingroup$
@grooveplex Oh, maybe it just comes with Ubuntu? I don't think I've had to install
requests
in recent memory.$endgroup$
– Jack M
Sep 15 at 19:10
add a comment
|
$begingroup$
When you asked for an easier way, do you mean an easier way that uses “real code” as in the cartoon?
That looks pretty easy, but I think slightly easier is possible.
In general, when I want a table from a web page, I select it with my mouse and paste special (unformatted) into a spreadsheet. Bonus is that I don’t have to deal with NSA’s biggest competitor (Google).
$endgroup$
add a comment
|
$begingroup$
When you asked for an easier way, do you mean an easier way that uses “real code” as in the cartoon?
That looks pretty easy, but I think slightly easier is possible.
In general, when I want a table from a web page, I select it with my mouse and paste special (unformatted) into a spreadsheet. Bonus is that I don’t have to deal with NSA’s biggest competitor (Google).
$endgroup$
add a comment
|
$begingroup$
When you asked for an easier way, do you mean an easier way that uses “real code” as in the cartoon?
That looks pretty easy, but I think slightly easier is possible.
In general, when I want a table from a web page, I select it with my mouse and paste special (unformatted) into a spreadsheet. Bonus is that I don’t have to deal with NSA’s biggest competitor (Google).
$endgroup$
When you asked for an easier way, do you mean an easier way that uses “real code” as in the cartoon?
That looks pretty easy, but I think slightly easier is possible.
In general, when I want a table from a web page, I select it with my mouse and paste special (unformatted) into a spreadsheet. Bonus is that I don’t have to deal with NSA’s biggest competitor (Google).
answered Sep 15 at 22:51
WGroleauWGroleau
1372 bronze badges
1372 bronze badges
add a comment
|
add a comment
|
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f228969%2fpython-web-scraper-to-download-table-of-transistor-counts-from-wikipedia%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3
$begingroup$
@AlexV I was suggested to post here by a user on stackoverflow. I only posted here because they told me to after posting on stackoverflow. My code is working but not the way I want. However the issue has been solved so if you want you can remove this post. Thank you anyway.
$endgroup$
– Sofelia
Sep 13 at 13:22
4
$begingroup$
You can also fix the issue in the post, reword the question and leave your code for an actual review.
$endgroup$
– AlexV
Sep 13 at 13:24
4
$begingroup$
@AlexV , Thank you, I reworded it, however I do not know if I need it anymore but I guess it is always nice to get your code reviewed.
$endgroup$
– Sofelia
Sep 13 at 13:27
3
$begingroup$
In general, when I want a table from a web page, I select it with my mouse, paste
$endgroup$
– WGroleau
Sep 15 at 22:45