asp.net抓网页源代码的代码很普遍,大家都会用,不过今天我在使用asp.net抓网页源代码时,遇到了一个小小的困扰,那就是我在请求header里添加了GZip的内容编码后,一直返回乱码的问题。不过最终还是把这个小问题给解决了,现在记录一下。
asp.net使用gzip抓取网页
普遍情况下,asp.net抓取网页源码时并不使用gzip,而是直接抓。关键代码如下:
string PageUrl = "http://www.webkaka.com/";
WebRequest request = WebRequest.Create(PageUrl);
WebResponse response = request.GetResponse();
Stream resStream = response.GetResponseStream();
Encoding enc = Encoding.GetEncoding("GB2312");
StreamReader sr = new StreamReader(resStream, enc);
string strHtml = sr.ReadToEnd();
resStream.Close();
sr.Close();
在请求header里添加了GZip的内容编码:
string PageUrl = "http://www.webkaka.com/";
WebRequest request = WebRequest.Create(PageUrl);
request.Headers.Add("Accept-Encoding", "gzip,deflate");
WebResponse response = request.GetResponse();
Stream resStream = response.GetResponseStream();
Encoding enc = Encoding.GetEncoding("GB2312");
StreamReader sr = new StreamReader(resStream, enc);
string strHtml = sr.ReadToEnd();
resStream.Close();
sr.Close();
但是,这样的代码,获得的网页源代码是乱码的,确切来说,是经过了GZip压缩的字符串,因此必须要进一步处理,把这些乱码还原成可读的html代码。
最终实现代码如下:
string PageUrl = "http://www.webkaka.com/";
WebRequest request = WebRequest.Create(PageUrl);
request.Headers.Add("Accept-Encoding", "gzip,deflate");
request.AutomaticDecompression = DecompressionMethods.GZip;
WebResponse response = request.GetResponse();
Stream resStream = response.GetResponseStream();
Encoding enc = Encoding.GetEncoding("GB2312");
StreamReader sr = new StreamReader(resStream, enc);
string strHtml = sr.ReadToEnd();
resStream.Close();
sr.Close();
使用WebClient获得网页源代码能否添加GZip?
asp.net获得网页源代码,还有另一个方法,那就是使用WebClient类,使用起来貌似比WebRequest更加简单,代码如下:
string PageUrl = "http://www.webkaka.com/";
WebClient wc = new WebClient();
wc.Credentials = CredentialCache.DefaultCredentials;
Encoding enc = Encoding.GetEncoding("GB2312");
Byte[] pageData = wc.DownloadData(PageUrl);
string strHtml = enc.GetString(pageData);
这个方法能否也在header头添加("Accept-Encoding", "gzip,deflate")这样的语句呢?例如:
WebClient wc = new WebClient();
wc .Headers[HttpRequestHeader.AcceptEncoding] = "gzip";
本人查了下,貌似WebClient不支持GZip,要想WebClient支持GZip,可能需要做些额外的代码编写,或者添加某个组件。
然而,WebClient不直接暴露属性,必须派生它在底层的HttpWebRequest设置属性。
class MyWebClient : WebClient
{
protected override WebRequest GetWebRequest(Uri address)
{
HttpWebRequest request = base.GetWebRequest(address) as HttpWebRequest;
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
return request;
}
}
本人未测试,有兴趣的童鞋可以测试下上述代码提供的思路。*^_^*
from:http://www.webkaka.com/tutorial/asp.net/2015/021120/