Eighth Python Meetup was conducted at Leapfrog Technology yesterday (December 13 2015). The event was successfully conducted. The meetup was all about web scrapping using different technologies. This meetup was mainly focused on how we can use web scrapping in python for general purpose level and in commercial purpose like in data processing technology like Grepsr.
Sudip Kafle, Co-Founder and CTO at Phunka Technologies who is an active member Python User Group Nepal started the meet giving the brief introduction about the Python community and how it was started back in 2013 at Pulchwok Campus. After that, Krishna Sunuwar, CEO of Ontreat gave his presentation on different web scrapping technologies he used commercially to build his products. He gave a brief introduction to web scrapping tools and technology. He gave a speech on how he used to use Python for business purpose in scrapping to reflect one’s content to many other sites. He also said that the Python language is best because it has a rich library and has everything a developer needs to work with it and become productive. He talked about some of the things to consider while scrapping data from the website.
1. Network (Connecting to other websites)
Network libraries like urllib2, requests and mechanize. He stated that he mainly uses requests and mechanize, but the use of the library depends on what the developer needs. He stated some libraries have more advanced features that makes the life of developer at ease.
2. HTML Parser (To parse the response HTML)
Sunuwar presented different HTML parser libraries to parse the HTML content and extract the content of the web. He gave demos of different libraries like BeautifulSoup (3rd party library), lxml, HTMLParser (core library) and regex. He told advantages and disadvantages of using these libraries. “Among all the libraries regex has all the features that other libraries don’t have” he stated.
3. Data Saving
On this topic, Sunuwar presented how can we save the data, and proceed further for data processing.
After this, he talked about the challenges we may face while scrapping data from the different websites. Some challenges we may face are according to Sunuwar ares:
1. Throttle Limit
Some websites detecting as a bot and limiting the request or banning the IP of the requesting servers.
2. IP Banning
Same stated as above, more request can lead to banning of the IP.
3. Authentication Required and Captcha
Some websites needs some authentication to view the content and some needs captcha verification. This can lead into some serious problem as not being able to view the content.
To overcome the challenges, Sunuwar gave some tips like Rotating User-Agent to get rid of the throttle limit of the website. When we rotate the User-Agent, we can trick websites to think that the request is coming from different users. Similarly, using proxy IPs can solve IP bans issues and using DeathByCaptha methods to bypass the captcha in the website. At last, he gave demos of Scrappy Framework, which we can use to build WebCrawler/Web Spider/WebBot.
The last speaker was Diwaker Ghimire, Facilitator of Web Development with Python and Django at Leapfrog Academy. He gave short hands presentation on Jupyter Notebook, previously known as Ipython Notebook and how it can be used to write python codes efficiently. He gave a demo on how to write code, save the work and present work as slides.
After meetup was over, coffee & cookies were offered to all the participants in a networking session, where informal talk was among the participants and presenters.
Sagar Giri is a 3rd year student studying B.Sc.CSIT. He is python enthusiast and python developer. He has 1 year experience in working as a front end developer in Makalu Plan Analytics Product (A data visualization product). He has worked in Grails Framework using Java and Groovy as major language. Besides programming, he is a movie/ series lover.