09 | May | 2018 | LongSheng

也谈前后端完全分离后，关于SEO优化的方案

如果用了nodejs作为中间件，您就不需要往下看了，不存在SEO的问题。此时前端不再只是前端，变成了全栈。此文章指的是用Ajax调用API后渲染页面的情况，这时查看html源码，也是没有数据的，所以搜索引擎也收录不到有用的数据，更不用说更新了。我的思路大概如下：不修改原项目的源码。只针对搜索引擎做优化。用Java的开源项目HtmlUnit做中转，HtmlUnit模拟了流行的浏览器内核，却没有界面。经过转换的页面会输出ajax填充数据后的html源码。由此得步骤如下：先架设HtmlUnit转换项目。虽然HtmlUnit也有.Net版本，但测试后效率不高。还是Java的原项目效率高。所以就直接用Java的，不懂Java也没关系，几十行代码搞定。可以参考我前几篇文章。根据你项目的语言做相应的拦截器，Java的语言就不说了，可以省略步骤1，直接导包开用就行了。.Net的话最好做HttpModule，一是不污染原项目；二是性能也高。php的话，如果用laravel/yii/tp等框架，本来就是拦截器机制。准备拦截器要用到的各搜索引擎UserAgent里的关键标识，发一下我的：Baiduspider,Googlebot,bingbot,360Spider,Sogou web spider,Yahoo! Slurp,YoudaoBot,Sosospider。看名字也能猜到各自是哪个搜索引擎吧。然后检测到是搜索引擎来了，就调用HtmlUnit转换出完整的html源码；然后有数据就会被收录了。希望对各位同仁有些帮助。有不明白的、有意见的，欢迎通过本博客底部的邮箱和我联系~~

龙生 09 May 2018

View Details

asp.net一个已实现的登陆过滤器

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Web;

using System.Text.RegularExpressions;

namespace MyMook

{

public class MyHttpModule : IHttpModule

{

public void Dispose()

{

}

public void Init(HttpApplication application)

{

application.AcquireRequestState += new EventHandler(context_AcquireRequestState);

// application.BeginRequest += new EventHandler(context_AcquireRequestState);

//这里面要注意千万不要写成BeginRequest，那样就会无法获得session

}

void context_AcquireRequestState(Object source, EventArgs e)

{

HttpApplication application = (HttpApplication)source;

HttpContext context = application.Context;

string path=context.Request.Path;

if (!context.Request.CurrentExecutionFilePathExtension.Equals(".aspx") && !context.Request.CurrentExecutionFilePathExtension.Equals(".ashx") ) {

return;

}//此处保证只过滤aspx/ashx/htm的请求

Match m = Regex.Match(path,@"/WebLogin/+");

if (m.Success) {

return;

}//不过滤文件夹WebLogin中的内容

try

{

object user = context.Session["user"];

if (user == null)

{

context.Response.Redirect("~/WebLogin/Login.aspx");

}

else {

return;

}

catch {

context.Response.Redirect("~/WebLogin/Login.aspx");

}

在web.config中：

<system.webServer>

<add name="<span class="attribute-value">MyHttpModule"</span> <span class="attribute">type</span>=<span class="attribute-value">"MyMook.MyHttpModule,MyMook</span>"/>

</modules>

</system.webServer>

要点： 1.注册事件时，不要写application.BeginRequest，这样会导致无法获得Session.

1 2	application.AcquireRequestState += new EventHandler(context_AcquireRequestState); // application.BeginRequest += new EventHandler(context_AcquireRequestState);

from:https://blog.csdn.net/touch_the_world/article/details/37936297

龙生 09 May 2018

View Details

搜索引擎蜘蛛爬虫 User Agent 一览

今天分析研究了两个网站的 Apache 日志，分析日志虽然很无聊，但却是很有意义的事情，比如跟踪 SPAM 的 User Agent。顺便整理出一些搜索引擎爬虫的 User Agent，在这里分享一下，也欢迎补充。微软 “msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)” msnbot，大多数已经被bingbot替代了，现在偶尔还可以看到。 “Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)” bing，必应搜搜 “Sosospider+(+http://help.soso.com/webspider.htm)” 腾讯搜搜 “Sosoimagespider+(+http://help.soso.com/soso-image-spider.htm)” 搜搜图片雅虎 “Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)” 雅虎英文 “Yahoo! Slurp China” “Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html)” 雅虎中国搜狗 “http://pic.sogou.com” “Sogou Pic Spider/3.0(+http://www.sogou.com/docs/help/webmasters.htm#07)” 搜狗图片 “Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)” 搜狗，搜狗的蜘蛛程序做的很不好，总是进入死循环，已经分别在 robots.txt 和设置中屏蔽掉 Google “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” Google “Googlebot-Image/1.0″ Google图片搜索 “Mediapartners-Google” 未知 “FeedBurner/1.0 (http://www.FeedBurner.com)” feedburner “AdsBot-Google-Mobile (+http://www.google.com/mobile/adsbot.html) Mozilla (iPhone; U; CPU iPhone OS 3 0 like Mac OS X) AppleWebKit (KHTML, like Gecko) Mobile Safari” Adwords移动网络百度 “Baiduspider-image+(+http://www.baidu.com/search/spider.htm)” 百度图片 “Mozilla/5.0 […]

龙生 09 May 2018

View Details

各大搜索引擎蜘蛛的UserAgent

GOOGLE ——————————————————————— 66.249.70.212 – – [11/Jan/2009:00:03:35 -0700] "GET www.vidun.com/user-f2fc990265c712c49d51a18a32b39f0c.html?umid=f2fc990265c712c49d51a18a32b39f0c HTTP/1.1" 200 8148 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" Referer: "" UserAgent: "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.70.212 – – [11/Jan/2009:03:27:23 -0700] "GET www.youxigao.com/images/pink/demo.gif HTTP/1.1" 200 2367 "-" "Googlebot-Image/1.0" Referer: "" UserAgent: "Googlebot-Image/1.0" 209.85.238.7 – – [11/Jan/2009:00:02:58 -0700] "GET www.youxigao.com/rss/c/1009 HTTP/1.1" 404 37 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 10 subscribers; feed-id=8474979256887526569)" Referer: "" UserAgent: "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 10 subscribers; feed-id=8474979256887526569)" 百度 ——————————————————————— 60.28.22.38 – – [11/Jan/2009:01:28:09 -0700] "GET www.vidun.com/vwsoft-vwantileechs-download.html?pr=vwantileechs&vi=download HTTP/1.1" 200 27406 "http://www.vidun.com/" "Baiduspider+(+http://www.baidu.com/search/spider.htm)" Referer: "" UserAgent: "Baiduspider+(+http://www.baidu.com/search/spider.htm)" YAHOO ——————————————————————— 202.160.180.81 – – [11/Jan/2009:00:02:44 -0700] […]

龙生 09 May 2018

View Details

HtmlUnit爬取Ajax动态生成的网页以及自动调用页面javascript函数

HtmlUnit官网的介绍： HtmlUnit是一款基于Java的没有图形界面的浏览器程序。它模仿HTML document并且提供API让开发人员像是在一个正常的浏览器上操作一样，获取网页内容，填充表单，点击超链接等等。它非常好的支持JavaScript并且仍在不断改进，同时能够解析非常复杂的AJAX库，通过不同的配置来模拟Chrome、Firefox和IE浏览器。本文针对一个足彩网站抓取的例子，来熟悉HtmlUnit WebClient wc = new WebClient(BrowserVersion.FIREFOX_38); wc.getOptions().setJavaScriptEnabled(true); //启用JS解释器，默认为true wc.setJavaScriptTimeout(100000);//设置JS执行的超时时间 wc.getOptions().setCssEnabled(false); //禁用css支持 wc.getOptions().setThrowExceptionOnScriptError(false); //js运行错误时，是否抛出异常 wc.getOptions().setTimeout(10000); //设置连接超时时间，这里是10S。如果为0，则无限期等待 wc.setAjaxController(new NicelyResynchronizingAjaxController());//设置支持AJAX wc.setWebConnection( new WebConnectionWrapper(wc) { public WebResponse getResponse(WebRequest request) throws IOException { …… } } ); HtmlPage page = wc.getPage("http://XXXX.com/"); FileWriter fileWriter = new FileWriter("D:\\text.html"); String str = ""; //获取页面的XML代码 str = page.asXml(); fileWriter.write( str ); //关闭webclient wc.close(); fileWriter.close(); 解决数据乱码问题该网站数据是由js动态载入，并且js有2种编码： <script language="javascript" src="XXX.js" charset="gb2312"></script> <script language="javascript" src="XXX.js" charset="utf-8"></script> 可以通过重写WebConnectionWrapper类的getResponse方法来修改返回值例如，对bfdata.js的返回结果做修改 wc.setWebConnection( new WebConnectionWrapper(wc) { public WebResponse getResponse(WebRequest request) throws IOException { WebResponse response = super.getResponse(request); if […]

龙生 09 May 2018

View Details

springmvc中输出字符串

/**

* 输出文字

* @param response

* @param s

public static void responseOut(HttpServletResponse response,String s){

response.setContentType("text/html;charset=UTF-8");

response.setCharacterEncoding("UTF-8");

try (

PrintWriter pw = response.getWriter()

){

pw.write(s);

} catch (IOException e) {

e.printStackTrace();

}

from:https://www.cnblogs.com/yanqin/p/7463294.html

龙生 09 May 2018

View Details

htmlunit模拟登录

PS:我只用到了这一句 webClient.getOptions().setThrowExceptionOnScriptError(false); htmlunit jar项目路径http://sourceforge.net/projects/htmlunit/files/htmlunit/ demo代码如下

public class AutoLogin {

/** 登录页面 */

private static final String LOGIN_URL = "http://website/login.aspx";

/** 任务列表页面 */

private static final String TASK_LIST_URL = "http://website/Banli.aspx";

/**

* @param args

* @throws Exception

public static void main(String[] args) throws Exception {

testHomePage();

}

public static void testHomePage() throws Exception {

final WebClient webClient = new WebClient(BrowserVersion.INTERNET_EXPLORER_8);

webClient.getOptions().setThrowExceptionOnScriptError(false); //此行必须要加

webClient.getOptions().setCssEnabled(false);

// webClient.getOptions().setJavaScriptEnabled(true);

// webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);

webClient.getOptions().setTimeout(300000);

// 获取首页

HtmlPage page = (HtmlPage) webClient.getPage(LOGIN_URL);

// 根据form的名字获取页面表单，也可以通过索引来获取：page.getForms().get(0)

final HtmlForm form = page.getFormByName("form1");

// 用户名/密码

HtmlTextInput textUserName = form.getInputByName("txtUserName");

textUserName.setText("username");

HtmlPasswordInput txtPwd = form.getInputByName("txtPwd");

txtPwd.setText("pass");

//调用JS触发登录按钮

Page page1 = page.executeJavaScript("$('#btn').click()").getNewPage();

page1 = webClient.getPage(TASK_LIST_URL);

System.out.println("*************************************************************************************");

System.out.println(page1.getWebResponse().getContentAsString());

System.out.println("*************************************************************************************");

System.out.println("");

System.out.println("Cookies : " + webClient.getCookieManager().getCookies().toString());

}

搞不清ASP.NET内部什么逻辑，试了很多方法都不行，查看了无所网站，无意中看到一个这个配置http://stackoverflow.com/questions/20352284/scraping-aspx-page-using-htmlunit

import java.net.MalformedURLException;

import com.gargoylesoftware.htmlunit.BrowserVersion;

import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;

import com.gargoylesoftware.htmlunit.WebClient;

import com.gargoylesoftware.htmlunit.html.HtmlElement;

import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class teste {

public static void main(String args[]) throws FailingHttpStatusCodeException, MalformedURLException, IOException

{

HtmlPage page = null;

String url = "http://www.bmfbovespa.com.br/cias-listadas/empresas-listadas/BuscaEmpresaListada.aspx?Idioma=pt-br";

WebClient webClient = new WebClient(BrowserVersion.FIREFOX_17);

webClient.getOptions().setThrowExceptionOnScriptError(false);

webClient.getOptions().setCssEnabled(false);

webClient.getOptions().setJavaScriptEnabled(false);

webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);

webClient.getOptions().setTimeout(30000);

page = webClient.getPage( url );

System.out.println("Current page: Empresas Listadas | BM&FBOVESPA");

HtmlElement theElement1 = (HtmlElement) page.getElementById("ctl00_contentPlaceHolderConteudo_BuscaNomeEmpresa1_btnTodas");

page = theElement1.click();

System.out.println(page.asText());

System.out.println("Test has completed successfully");

}

最后测试下来，如果不加 webClient.getOptions().setThrowExceptionOnScriptError(false);就一直报这个错误

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

Exception in thread "main" ======= EXCEPTION START ========

Exception class=[java.lang.RuntimeException]

com.gargoylesoftware.htmlunit.ScriptException: Exception invoking click

at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:954)

at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628)

at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:513)

at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:836)

at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:812)

at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:800)

at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptIfPossible(HtmlPage.java:910)

at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScript(HtmlPage.java:878)

at com.suypower.AutoLogin12345.testHomePage(AutoLogin12345.java:48)

at com.suypower.AutoLogin12345.main(AutoLogin12345.java:23)

Caused by: java.lang.RuntimeException: Exception invoking click

at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:181)

at net.sourceforge.htmlunit.corejs.javascript.FunctionObject.call(FunctionObject.java:449)

at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1536)

at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:798)

at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105)

at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:411)

at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:309)

at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3286)

at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:115)

at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:827)

at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:939)

... 9 more

Caused by: com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot read property "nodeName" from null (http://xxxx/305000772#7)

at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:954)

at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628)

at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:513)

at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:836)

at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:812)

at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:800)

at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptIfPossible(HtmlPage.java:910)

at com.gargoylesoftware.htmlunit.html.HtmlScript.executeInlineScriptIfNeeded(HtmlScript.java:354)

at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:415)

at com.gargoylesoftware.htmlunit.html.HtmlScript$3.execute(HtmlScript.java:271)

at com.gargoylesoftware.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:293)

at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:799)

at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)

at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:756)

at org.cyberneko.html.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1170)

at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1072)

at org.cyberneko.html.filters.DefaultFilter.endElement(DefaultFilter.java:206)

at org.cyberneko.html.filters.NamespaceBinder.endElement(NamespaceBinder.java:330)

at org.cyberneko.html.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3126)

at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2093)

at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:920)

at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)

at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)

at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)

at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:1039)

at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:252)

at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:198)

at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:271)

at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:159)

at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:478)

at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:352)

at com.gargoylesoftware.htmlunit.html.BaseFrameElement.loadInnerPageIfPossible(BaseFrameElement.java:183)

at com.gargoylesoftware.htmlunit.html.BaseFrameElement.loadInnerPage(BaseFrameElement.java:121)

at com.gargoylesoftware.htmlunit.html.HtmlPage.loadFrames(HtmlPage.java:1893)

at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:227)

at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:485)

at com.gargoylesoftware.htmlunit.WebClient.loadDownloadedResponses(WebClient.java:2135)

at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.doProcessPostponedActions(JavaScriptEngine.java:982)

at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.processPostponedActions(JavaScriptEngine.java:1072)

at com.gargoylesoftware.htmlunit.html.DomElement.click(DomElement.java:789)

at com.gargoylesoftware.htmlunit.html.HtmlImageInput.click(HtmlImageInput.java:152)

at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLInputElement.click(HTMLInputElement.java:477)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:153)

... 19 more

Caused by: net.sourceforge.htmlunit.corejs.javascript.EcmaError: TypeError: Cannot read property "nodeName" from null (http://xxxx/305000772#7)

at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3935)

at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3919)

at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError(ScriptRuntime.java:3944)

at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError2(ScriptRuntime.java:3960)

at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.undefReadError(ScriptRuntime.java:3971)

at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getObjectProp(ScriptRuntime.java:1519)

at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1243)

at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:798)

at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:118)

at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:827)

at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:939)

... 65 more

Enclosed exception:

java.lang.RuntimeException: Exception invoking click