Bsrgvty

Question

I have been looking for answers for how to easily scrape data from Wikipedia into a CSV file with Beautiful Soup. This is the code so far. Is there an easier way to do it?

import urllib.request
from bs4 import BeautifulSoup
import csv


#the websites
urls = ['https://en.wikipedia.org/wiki/Transistor_count']
data =[]

#getting the websites and the data
for url in urls: 
## my_url = requests.get(url)
 my_url = urllib.request.urlopen(url)

 html = my_url.read()
 soup = BeautifulSoup(html,'html.parser')

 My_table = soup.find('table','class':'wikitable sortable')
 My_second_table = My_table.find_next_sibling('table') 

 with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
 fields = ['Title', 'Year']
 writer = csv.writer(f, delimiter=',')
 writer.writerow(fields)


 with open('data.csv', "a", encoding='UTF-8') as csv_file:
 writer = csv.writer(csv_file, delimiter=',') 
 for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header 
 tds = tr.find_all('td')
 try:
 title = tds[0].text.replace('n','')
 except:
 title = ""
 try:
 year = tds[2].text.replace('n','')
 except:
 year = ""

 writer.writerow([title, year])

 with open('data.csv', "a", encoding='UTF-8') as csv_file:
 writer = csv.writer(csv_file, delimiter=',') 
 for tr in My_second_table.find_all('tr')[2:]: # [2:] is to skip empty and header 
 tds = tr.find_all('td')
 row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
 writer.writerow(row.split(','))

@AlexV I was suggested to post here by a user on stackoverflow. I only posted here because they told me to after posting on stackoverflow. My code is working but not the way I want. However the issue has been solved so if you want you can remove this post. Thank you anyway. — Sep 13 at 13:22
You can also fix the issue in the post, reword the question and leave your code for an actual review. — Sep 13 at 13:24
@AlexV , Thank you, I reworded it, however I do not know if I need it anymore but I guess it is always nice to get your code reviewed. — Sep 13 at 13:27
In general, when I want a table from a web page, I select it with my mouse, paste — Sep 15 at 22:45

score 53 · Accepted Answer · 2019-09-14 15:26:40Z

Is there a way to simplify this code?

Yes. Don't scrape Wikipedia. Your first thought before "should I need to scrape this thing?" should be "Is there an API that can give me the data I want?" In this case, there super is.

There are many informative links such as this StackOverflow question, but in the end reading the API documentation really is the right thing to do. This should get you started:

from pprint import pprint
import requests, wikitextparser

r = requests.get(
 'https://en.wikipedia.org/w/api.php',
 params=
 'action': 'query',
 'titles': 'Transistor_count',
 'prop': 'revisions',
 'rvprop': 'content',
 'format': 'json',
 
)
r.raise_for_status()
pages = r.json()['query']['pages']
body = next(iter(pages.values()))['revisions'][0]['*']
doc = wikitextparser.parse(body)
print(f'len(doc.tables) tables retrieved')

pprint(doc.tables[0].data())

This may seem more roundabout than scraping the page, but API access gets you structured data, which bypasses an HTML rendering step that you shouldn't have to deal with. This structured data is the actual source of the article and is more reliable.

At the end of the day, you still have to parse the wikitext. It's not clear to me why that's really better/easier than parsing HTML - they're just two different markup formats. If Wikipedia had an actual semantic API just for extracting data from tables in articles, that would be different. — Sep 15 at 11:04
@JackM - given the readily-available choices: let MediaWiki render to a format intended for browser representation to a human, or let MediaWiki return structured data intended for consumption by either humans or computers, I would choose the latter every time for this kind of application. Not all markup formats are made equal, and here intent is important. It's entirely possible for MediaWiki to change its rendering engine, still returning visually valid web content but breaking all scrapers that make assumptions about HTML. — Sep 15 at 13:53
You make some good points. For this application however (a one-off script), parsing the HTML is as good, and it avoids having to spend time learning a new technology and a new tool (wikitext and the wikitext parsing library). Basically, even being presumably more experienced than OP, in their shoes I probably would have chosen to do the same thing in this case, although your general point that you should always look for an API before you try scraping is definitely a lesson they should take on-board. — Sep 15 at 13:58
@JackM the scraper would break if the page's UI was updated, so it's a less useful approach if the script is going to be re-run at any point in the future. APIs on the other hand, are often versioned, and tend to change less often than HTML — Sep 16 at 18:22

score 43 · Accepted Answer · 2019-09-13 21:39:27Z

Let me tell you about `IMPORTHTML()`...

enter image description here

So here's all the code you need in Google Sheets:

=IMPORTHTML("https://en.wikipedia.org/wiki/Transistor_count", "table", 2)

The import seems to work fine:
enter image description here

And it's possible to download the table as CSV:

Processor,Transistor count,Date of introduction,Designer,MOS process,Area
"MP944 (20-bit, *6-chip*)",,1970[14] (declassified 1998),Garrett AiResearch,,
"Intel 4004 (4-bit, 16-pin)","2,250",1971,Intel,"10,000 nm",12 mm²
"Intel 8008 (8-bit, 18-pin)","3,500",1972,Intel,"10,000 nm",14 mm²
"NEC μCOM-4 (4-bit, 42-pin)","2,500[17][18]",1973,NEC,"7,500 nm[19]",*?*
Toshiba TLCS-12 (12-bit),"over 11,000[20]",1973,Toshiba,"6,000 nm",32 mm²
"Intel 4040 (4-bit, 16-pin)","3,000",1974,Intel,"10,000 nm",12 mm²
"Motorola 6800 (8-bit, 40-pin)","4,100",1974,Motorola,"6,000 nm",16 mm²
...
...

You'll just need to clean the numbers up, which you'd have to do anyway with the API or by scraping with BeautifulSoup.

Copy/pasting the raw text into a spreadsheet (in my case, into Numbers.app) automatically picked up the tabular formatting perfectly. Done. — Sep 14 at 0:09
@VolkerSiegel: It looks like Google knows a thing or two about scraping data from websites. :D — Sep 15 at 8:49

score 15 · Accepted Answer · 2019-09-13 19:15:00Z

Make sure to follow naming conventions. You name two variables inappropriately:

My_table = soup.find('table','class':'wikitable sortable')
My_second_table = My_table.find_next_sibling('table')

Those are just normal variables, not class names, so they should be lower-case:

my_table = soup.find('table','class':'wikitable sortable')
my_second_table = my_table.find_next_sibling('table')

Twice you do

try:
 title = tds[0].text.replace('n','')
except:
 title = ""

I'd specify what exact exception you want to catch so you don't accidentally hide a "real" error if you start making changes in the future. I'm assuming here you're intending to catch an AttributeError.

Because you have essentially the same code twice, and because the code is bulky, I'd factor that out into its own function.

Something like:

import bs4

def eliminate_newlines(tag: bs4.element.Tag) -> str: # Maybe pick a better name
 try:
 return tag.text.replace('n', '')

 except AttributeError: # I'm assuming this is what you intend to catch
 return ""

Now that with open block is much neater:

with open('data.csv', "a", encoding='UTF-8') as csv_file:
 writer = csv.writer(csv_file, delimiter=',') 
 for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header 
 tds = tr.find_all('td')

 title = eliminate_newlines(tds[0])
 year = eliminate_newlines(tds[2])

 writer.writerow([title, year])

Edit: I was in the shower, and realized that you're actually probably intending to catch an IndexError in case the page is malformed or something. Same idea though, move that code out into a function to reduce duplication. Something like:

from typing import List

def eliminate_newlines(tags: List[bs4.element.Tag], i: int) -> str:
 return tags[i].text.replace('n', '') if len(tags) < i else ""

This could also be done using a condition statement instead of expression. I figured that it's pretty simple though, so a one-liner should be fine.

If you're using a newer version of Python, lines like:

", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))

Can make use of f-strings to do in-place string interpolation:

f"tds[0].text.replace('n', ''), tds[2].text.replace('n', '')"

In this particular case, the gain isn't much. They're very helpful for more complicated formatting though.

score 6 · Accepted Answer · 2019-09-15 19:14:43Z

Using the Wikipedia API

As mentioned in another answer, Wikipedia provides an HTTP API for fetching article content, and you can use it to get the content in much cleaner formats than HTML. Often this is much better (for example, it's a much better choice for this little project I wrote that extracts the first sentence of a Wikipedia article).

However, in your case, you have to parse tables anyway. Whether you're parsing them from HTML or from the Wikipedia API's "wikitext" format, I don't think there's much of a difference. So I consider this subjective.

Using requests, not urllib

Never use urllib, unless you're in an environment where for some reason you can't install external libraries. The requests library is preferred for fetching HTML.

In your case, to get the HTML, the code would be:

html = requests.get(url).text

with an import requests up-top. For your simple example, this isn't actually any easier, but it's just a general best-practice with Python programming to always use requests, not urllib. It's the preferred library.

What's with all those with blocks?

You don't need three with blocks. When I glance at this code, three with blocks make it seem like the code is doing something much more complicated than it actually is - it makes it seem like maybe you're writing multiple CSVs, which you aren't. Just use one, it works the same:

with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
 fields = ['Title', 'Year']
 writer = csv.writer(f, delimiter=',')
 writer.writerow(fields)

 for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
 tds = tr.find_all('td')
 try:
 title = tds[0].text.replace('n','')
 except:
 title = ""
 try:
 year = tds[2].text.replace('n','')
 except:
 year = ""

 writer.writerow([title, year])

 for tr in My_second_table.find_all('tr')[2:]: # [2:] is to skip empty and header
 tds = tr.find_all('td')
 row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
 writer.writerow(row.split(','))

Are those two tables really that different?

You have two for loops, each one processing a table and writing it to the CSV. The body of the two for loops is different, so at a glance it looks like maybe the two tables have a different format or something... but they don't. You can copy paste the body of the first for loop into the second and it works the same:

for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
 tds = tr.find_all('td')
 try:
 title = tds[0].text.replace('n','')
 except:
 title = ""
 try:
 year = tds[2].text.replace('n','')
 except:
 year = ""

 writer.writerow([title, year])

for tr in My_second_table.find_all('tr')[2:]: # [2:] is to skip empty and header
 tds = tr.find_all('td')
 try:
 title = tds[0].text.replace('n','')
 except:
 title = ""
 try:
 year = tds[2].text.replace('n','')
 except:
 year = ""

With the above code, the only difference to the resulting CSV is that there aren't spaces after the commas, which I assume is not really important to you. Now that we've established the code doesn't need to be different in the two loops, we can just do this:

table_rows = My_table.find_all('tr')[2:] + My_second_table.find_all('tr')[2:]

for tr in table_rows:
 tds = tr.find_all('td')
 try:
 title = tds[0].text.replace('n','')
 except:
 title = ""
 try:
 year = tds[2].text.replace('n','')
 except:
 year = ""

 writer.writerow([title, year])

Only one for loop. Much easier to understand!

Parsing strings by hand

So that second loop doesn't need to be there at all, but let's look at the code inside of it anyway:

row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))

Um... you just concatenated two strings together with a comma, just to call split and split them apart at the comma at the very next line. I'm sure now it's pointed out to you you can see this is pointless, but I want to pull you up on one other thing in these two lines of code.

You are essentially trying to parse data by hand with that row.split, which is always dangerous. This is an important and general lesson about programming. What if the name of the chip had a comma in it? Then row would contain more commas than just the one that you put in there, and your call to writerow would end up inserting more than two columns!

Never parse data by hand unless you absolutely have to, and never write data in formats like CSV or JSON by hand unless you absolutely have to. Always use a library, because there are always pathological edge cases like a comma in the chip name that you won't think of and which will break your code. The libraries, if they're been around for a while, have had those bugs ironed out. With these two lines:

row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))

you are attempting to split a table row into its two columns yourself, by hand, which is why you made a mistake (just like anyone would). Whereas in the first loop, the code which does this splitting are the two lines:

title = tds[0].text.replace('n','')
year = tds[2].text.replace('n','')

Here you are relying on BeautifulSoup to have split the columns cleanly into tds[0] and tds[2], which is much safer and is why this code is much better.

Mixing of input parsing and output generating

The code that parses the HTML is mixed together with the code that generates the CSV. This is poor breaking of a problem down into sub-problems. The code that writes the CSV should just be thinking in terms of titles and years, it shouldn't have to know that they come from HTML, and the code that parses the HTML should just be solving the problem of extracting the titles and years, it should have no idea that that data is going to be written to a CSV. In other words, I want that for loop that writes the CSV to look like this:

for (title, year) in rows:
 writer.writerow([title, year])

We can do this by rewriting the with block like this:

with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
 fields = ['Title', 'Year']
 writer = csv.writer(f, delimiter=',')
 writer.writerow(fields)

 table_rows = My_table.find_all('tr')[2:] + My_second_table.find_all('tr')[2:]
 parsed_rows = []
 for tr in table_rows:
 tds = tr.find_all('td')
 try:
 title = tds[0].text.replace('n','')
 except:
 title = ""
 try:
 year = tds[2].text.replace('n','')
 except:
 year = ""
 parsed_rows.append((title, year))

 for (title, year) in parsed_rows:
 writer.writerow([title, year])

Factoring into functions

To make the code more readable and really separate the HTML stuff from the CSV stuff, we can break the script into functions. Here's my complete script.

import requests
from bs4 import BeautifulSoup
import csv

urls = ['https://en.wikipedia.org/wiki/Transistor_count']
data = []

def get_rows(html):
 soup = BeautifulSoup(html,'html.parser')
 My_table = soup.find('table','class':'wikitable sortable')
 My_second_table = My_table.find_next_sibling('table')
 table_rows = My_table.find_all('tr')[2:] + My_second_table.find_all('tr')[2:]
 parsed_rows = []
 for tr in table_rows:
 tds = tr.find_all('td')
 try:
 title = tds[0].text.replace('n','')
 except:
 title = ""
 try:
 year = tds[2].text.replace('n','')
 except:
 year = ""
 parsed_rows.append((title, year))

 return parsed_rows

for url in urls:
 html = requests.get(url).text
 parsed_rows = get_rows(html)

 with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
 fields = ['Title', 'Year']
 writer = csv.writer(f, delimiter=',')
 writer.writerow(fields)
 for (title, year) in parsed_rows:
 writer.writerow([title, year])

What would be even better would be for get_rows to be a generator rather than a regular function, but that's advanced Python programming. This is fine for now.

"In Python 3 the requests library comes built-in…" It does not, unfortunately. But you can easily install it using pip install requests. — Sep 15 at 17:46
@grooveplex Oh, maybe it just comes with Ubuntu? I don't think I've had to install requests in recent memory. — Sep 15 at 19:10

WGroleauWGroleau 1372 bronze badges · Accepted Answer · 2019-09-15 22:51:08Z

When you asked for an easier way, do you mean an easier way that uses “real code” as in the cartoon?

That looks pretty easy, but I think slightly easier is possible.

In general, when I want a table from a web page, I select it with my mouse and paste special (unformatted) into a spreadsheet. Bonus is that I don’t have to deal with NSA’s biggest competitor (Google).

score 53 · Accepted Answer · 2019-09-14 15:26:40Z

Is there a way to simplify this code?

Yes. Don't scrape Wikipedia. Your first thought before "should I need to scrape this thing?" should be "Is there an API that can give me the data I want?" In this case, there super is.

There are many informative links such as this StackOverflow question, but in the end reading the API documentation really is the right thing to do. This should get you started:

from pprint import pprint
import requests, wikitextparser

r = requests.get(
 'https://en.wikipedia.org/w/api.php',
 params=
 'action': 'query',
 'titles': 'Transistor_count',
 'prop': 'revisions',
 'rvprop': 'content',
 'format': 'json',
 
)
r.raise_for_status()
pages = r.json()['query']['pages']
body = next(iter(pages.values()))['revisions'][0]['*']
doc = wikitextparser.parse(body)
print(f'len(doc.tables) tables retrieved')

pprint(doc.tables[0].data())

This may seem more roundabout than scraping the page, but API access gets you structured data, which bypasses an HTML rendering step that you shouldn't have to deal with. This structured data is the actual source of the article and is more reliable.

At the end of the day, you still have to parse the wikitext. It's not clear to me why that's really better/easier than parsing HTML - they're just two different markup formats. If Wikipedia had an actual semantic API just for extracting data from tables in articles, that would be different. — Sep 15 at 11:04
@JackM - given the readily-available choices: let MediaWiki render to a format intended for browser representation to a human, or let MediaWiki return structured data intended for consumption by either humans or computers, I would choose the latter every time for this kind of application. Not all markup formats are made equal, and here intent is important. It's entirely possible for MediaWiki to change its rendering engine, still returning visually valid web content but breaking all scrapers that make assumptions about HTML. — Sep 15 at 13:53
You make some good points. For this application however (a one-off script), parsing the HTML is as good, and it avoids having to spend time learning a new technology and a new tool (wikitext and the wikitext parsing library). Basically, even being presumably more experienced than OP, in their shoes I probably would have chosen to do the same thing in this case, although your general point that you should always look for an API before you try scraping is definitely a lesson they should take on-board. — Sep 15 at 13:58
@JackM the scraper would break if the page's UI was updated, so it's a less useful approach if the script is going to be re-run at any point in the future. APIs on the other hand, are often versioned, and tend to change less often than HTML — Sep 16 at 18:22

score 43 · Accepted Answer · 2019-09-13 21:39:27Z

Let me tell you about `IMPORTHTML()`...

enter image description here

So here's all the code you need in Google Sheets:

=IMPORTHTML("https://en.wikipedia.org/wiki/Transistor_count", "table", 2)

The import seems to work fine:
enter image description here

And it's possible to download the table as CSV:

Processor,Transistor count,Date of introduction,Designer,MOS process,Area
"MP944 (20-bit, *6-chip*)",,1970[14] (declassified 1998),Garrett AiResearch,,
"Intel 4004 (4-bit, 16-pin)","2,250",1971,Intel,"10,000 nm",12 mm²
"Intel 8008 (8-bit, 18-pin)","3,500",1972,Intel,"10,000 nm",14 mm²
"NEC μCOM-4 (4-bit, 42-pin)","2,500[17][18]",1973,NEC,"7,500 nm[19]",*?*
Toshiba TLCS-12 (12-bit),"over 11,000[20]",1973,Toshiba,"6,000 nm",32 mm²
"Intel 4040 (4-bit, 16-pin)","3,000",1974,Intel,"10,000 nm",12 mm²
"Motorola 6800 (8-bit, 40-pin)","4,100",1974,Motorola,"6,000 nm",16 mm²
...
...

You'll just need to clean the numbers up, which you'd have to do anyway with the API or by scraping with BeautifulSoup.

Copy/pasting the raw text into a spreadsheet (in my case, into Numbers.app) automatically picked up the tabular formatting perfectly. Done. — Sep 14 at 0:09
@VolkerSiegel: It looks like Google knows a thing or two about scraping data from websites. :D — Sep 15 at 8:49

score 15 · Accepted Answer · 2019-09-13 19:15:00Z

Make sure to follow naming conventions. You name two variables inappropriately:

My_table = soup.find('table','class':'wikitable sortable')
My_second_table = My_table.find_next_sibling('table')

Those are just normal variables, not class names, so they should be lower-case:

my_table = soup.find('table','class':'wikitable sortable')
my_second_table = my_table.find_next_sibling('table')

Twice you do

try:
 title = tds[0].text.replace('n','')
except:
 title = ""

I'd specify what exact exception you want to catch so you don't accidentally hide a "real" error if you start making changes in the future. I'm assuming here you're intending to catch an AttributeError.

Because you have essentially the same code twice, and because the code is bulky, I'd factor that out into its own function.

Something like:

import bs4

def eliminate_newlines(tag: bs4.element.Tag) -> str: # Maybe pick a better name
 try:
 return tag.text.replace('n', '')

 except AttributeError: # I'm assuming this is what you intend to catch
 return ""

Now that with open block is much neater:

with open('data.csv', "a", encoding='UTF-8') as csv_file:
 writer = csv.writer(csv_file, delimiter=',') 
 for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header 
 tds = tr.find_all('td')

 title = eliminate_newlines(tds[0])
 year = eliminate_newlines(tds[2])

 writer.writerow([title, year])

Edit: I was in the shower, and realized that you're actually probably intending to catch an IndexError in case the page is malformed or something. Same idea though, move that code out into a function to reduce duplication. Something like:

from typing import List

def eliminate_newlines(tags: List[bs4.element.Tag], i: int) -> str:
 return tags[i].text.replace('n', '') if len(tags) < i else ""

This could also be done using a condition statement instead of expression. I figured that it's pretty simple though, so a one-liner should be fine.

If you're using a newer version of Python, lines like:

", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))

Can make use of f-strings to do in-place string interpolation:

f"tds[0].text.replace('n', ''), tds[2].text.replace('n', '')"

In this particular case, the gain isn't much. They're very helpful for more complicated formatting though.

score 6 · Accepted Answer · 2019-09-15 19:14:43Z

Using the Wikipedia API

As mentioned in another answer, Wikipedia provides an HTTP API for fetching article content, and you can use it to get the content in much cleaner formats than HTML. Often this is much better (for example, it's a much better choice for this little project I wrote that extracts the first sentence of a Wikipedia article).

However, in your case, you have to parse tables anyway. Whether you're parsing them from HTML or from the Wikipedia API's "wikitext" format, I don't think there's much of a difference. So I consider this subjective.

Using requests, not urllib

Never use urllib, unless you're in an environment where for some reason you can't install external libraries. The requests library is preferred for fetching HTML.

In your case, to get the HTML, the code would be:

html = requests.get(url).text

with an import requests up-top. For your simple example, this isn't actually any easier, but it's just a general best-practice with Python programming to always use requests, not urllib. It's the preferred library.

What's with all those with blocks?

You don't need three with blocks. When I glance at this code, three with blocks make it seem like the code is doing something much more complicated than it actually is - it makes it seem like maybe you're writing multiple CSVs, which you aren't. Just use one, it works the same:

with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
 fields = ['Title', 'Year']
 writer = csv.writer(f, delimiter=',')
 writer.writerow(fields)

 for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
 tds = tr.find_all('td')
 try:
 title = tds[0].text.replace('n','')
 except:
 title = ""
 try:
 year = tds[2].text.replace('n','')
 except:
 year = ""

 writer.writerow([title, year])

 for tr in My_second_table.find_all('tr')[2:]: # [2:] is to skip empty and header
 tds = tr.find_all('td')
 row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
 writer.writerow(row.split(','))

Are those two tables really that different?

You have two for loops, each one processing a table and writing it to the CSV. The body of the two for loops is different, so at a glance it looks like maybe the two tables have a different format or something... but they don't. You can copy paste the body of the first for loop into the second and it works the same:

for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header
 tds = tr.find_all('td')
 try:
 title = tds[0].text.replace('n','')
 except:
 title = ""
 try:
 year = tds[2].text.replace('n','')
 except:
 year = ""

 writer.writerow([title, year])

for tr in My_second_table.find_all('tr')[2:]: # [2:] is to skip empty and header
 tds = tr.find_all('td')
 try:
 title = tds[0].text.replace('n','')
 except:
 title = ""
 try:
 year = tds[2].text.replace('n','')
 except:
 year = ""

With the above code, the only difference to the resulting CSV is that there aren't spaces after the commas, which I assume is not really important to you. Now that we've established the code doesn't need to be different in the two loops, we can just do this:

table_rows = My_table.find_all('tr')[2:] + My_second_table.find_all('tr')[2:]

for tr in table_rows:
 tds = tr.find_all('td')
 try:
 title = tds[0].text.replace('n','')
 except:
 title = ""
 try:
 year = tds[2].text.replace('n','')
 except:
 year = ""

 writer.writerow([title, year])

Only one for loop. Much easier to understand!

Parsing strings by hand

So that second loop doesn't need to be there at all, but let's look at the code inside of it anyway:

row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))

Um... you just concatenated two strings together with a comma, just to call split and split them apart at the comma at the very next line. I'm sure now it's pointed out to you you can see this is pointless, but I want to pull you up on one other thing in these two lines of code.

You are essentially trying to parse data by hand with that row.split, which is always dangerous. This is an important and general lesson about programming. What if the name of the chip had a comma in it? Then row would contain more commas than just the one that you put in there, and your call to writerow would end up inserting more than two columns!

Never parse data by hand unless you absolutely have to, and never write data in formats like CSV or JSON by hand unless you absolutely have to. Always use a library, because there are always pathological edge cases like a comma in the chip name that you won't think of and which will break your code. The libraries, if they're been around for a while, have had those bugs ironed out. With these two lines:

row = ", ".format(tds[0].text.replace('n',''), tds[2].text.replace('n',''))
writer.writerow(row.split(','))

you are attempting to split a table row into its two columns yourself, by hand, which is why you made a mistake (just like anyone would). Whereas in the first loop, the code which does this splitting are the two lines:

title = tds[0].text.replace('n','')
year = tds[2].text.replace('n','')

Here you are relying on BeautifulSoup to have split the columns cleanly into tds[0] and tds[2], which is much safer and is why this code is much better.

Mixing of input parsing and output generating

The code that parses the HTML is mixed together with the code that generates the CSV. This is poor breaking of a problem down into sub-problems. The code that writes the CSV should just be thinking in terms of titles and years, it shouldn't have to know that they come from HTML, and the code that parses the HTML should just be solving the problem of extracting the titles and years, it should have no idea that that data is going to be written to a CSV. In other words, I want that for loop that writes the CSV to look like this:

for (title, year) in rows:
 writer.writerow([title, year])

We can do this by rewriting the with block like this:

with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
 fields = ['Title', 'Year']
 writer = csv.writer(f, delimiter=',')
 writer.writerow(fields)

 table_rows = My_table.find_all('tr')[2:] + My_second_table.find_all('tr')[2:]
 parsed_rows = []
 for tr in table_rows:
 tds = tr.find_all('td')
 try:
 title = tds[0].text.replace('n','')
 except:
 title = ""
 try:
 year = tds[2].text.replace('n','')
 except:
 year = ""
 parsed_rows.append((title, year))

 for (title, year) in parsed_rows:
 writer.writerow([title, year])

Factoring into functions

To make the code more readable and really separate the HTML stuff from the CSV stuff, we can break the script into functions. Here's my complete script.

import requests
from bs4 import BeautifulSoup
import csv

urls = ['https://en.wikipedia.org/wiki/Transistor_count']
data = []

def get_rows(html):
 soup = BeautifulSoup(html,'html.parser')
 My_table = soup.find('table','class':'wikitable sortable')
 My_second_table = My_table.find_next_sibling('table')
 table_rows = My_table.find_all('tr')[2:] + My_second_table.find_all('tr')[2:]
 parsed_rows = []
 for tr in table_rows:
 tds = tr.find_all('td')
 try:
 title = tds[0].text.replace('n','')
 except:
 title = ""
 try:
 year = tds[2].text.replace('n','')
 except:
 year = ""
 parsed_rows.append((title, year))

 return parsed_rows

for url in urls:
 html = requests.get(url).text
 parsed_rows = get_rows(html)

 with open('data.csv', 'w',encoding='UTF-8', newline='') as f:
 fields = ['Title', 'Year']
 writer = csv.writer(f, delimiter=',')
 writer.writerow(fields)
 for (title, year) in parsed_rows:
 writer.writerow([title, year])

What would be even better would be for get_rows to be a generator rather than a regular function, but that's advanced Python programming. This is fine for now.

"In Python 3 the requests library comes built-in…" It does not, unfortunately. But you can easily install it using pip install requests. — Sep 15 at 17:46
@grooveplex Oh, maybe it just comes with Ubuntu? I don't think I've had to install requests in recent memory. — Sep 15 at 19:10

WGroleauWGroleau 1372 bronze badges · Accepted Answer · 2019-09-15 22:51:08Z

When you asked for an easier way, do you mean an easier way that uses “real code” as in the cartoon?

That looks pretty easy, but I think slightly easier is possible.

In general, when I want a table from a web page, I select it with my mouse and paste special (unformatted) into a spreadsheet. Bonus is that I don’t have to deal with NSA’s biggest competitor (Google).

5 Answers 5

Let me tell you about IMPORTHTML()...

Using the Wikipedia API

Using requests, not urllib

What's with all those with blocks?

Are those two tables really that different?

Parsing strings by hand

Mixing of input parsing and output generating

Factoring into functions

Your Answer

Sign up or log in

Post as a guest

Post as a guest

5 Answers 5

5 Answers 5

Let me tell you about IMPORTHTML()...

Let me tell you about IMPORTHTML()...

Let me tell you about IMPORTHTML()...

Let me tell you about IMPORTHTML()...

Using the Wikipedia API

Using requests, not urllib

What's with all those with blocks?

Are those two tables really that different?

Parsing strings by hand

Mixing of input parsing and output generating

Factoring into functions

Using the Wikipedia API

Using requests, not urllib

What's with all those with blocks?

Are those two tables really that different?

Parsing strings by hand

Mixing of input parsing and output generating

Factoring into functions

Using the Wikipedia API

Using requests, not urllib

What's with all those with blocks?

Are those two tables really that different?

Parsing strings by hand

Mixing of input parsing and output generating

Factoring into functions

Using the Wikipedia API

Using requests, not urllib

What's with all those with blocks?

Are those two tables really that different?

Parsing strings by hand

Mixing of input parsing and output generating

Factoring into functions

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tamil (spriik) Luke uk diar | Nawigatjuun

5 Answers
5

Let me tell you about `IMPORTHTML()`...

5 Answers
5

5 Answers
5

Let me tell you about `IMPORTHTML()`...

Let me tell you about `IMPORTHTML()`...

Let me tell you about `IMPORTHTML()`...

Let me tell you about `IMPORTHTML()`...