ZooKeeper
ZooKeeper是一个分布式的,开放源码的分布式应用程序协调服务,是Google的Chubby一个开源的实现,是Hadoop和Hbase的重要组件。它是一个为分布式应用提供一致性服务的软件,提供的功能包括:配置维护、域名服务、分布式同步、组服务等。 ZooKeeper的目标就是封装好复杂易出错的关键服务,将简单易用的接口和性能高效、功能稳定的系统提供给用户。 ZooKeeper包含一个简单的原语集, 提供Java和C的接口。 ZooKeeper代码版本中,提供了分布式独享锁、选举、队列的接口,代码在zookeeper-3.4.3\src\recipes。其中分布锁和队列有Java和C两个版本,选举只有Java版本。
View Details也谈前后端完全分离后,关于SEO优化的方案
如果用了nodejs作为中间件,您就不需要往下看了,不存在SEO的问题。此时前端不再只是前端,变成了全栈。 此文章指的是用Ajax调用API后渲染页面的情况,这时查看html源码,也是没有数据的,所以搜索引擎也收录不到有用的数据,更不用说更新了。 我的思路大概如下: 不修改原项目的源码。 只针对搜索引擎做优化。 用Java的开源项目HtmlUnit做中转,HtmlUnit模拟了流行的浏览器内核,却没有界面。经过转换的页面会输出ajax填充数据后的html源码。 由此得步骤如下: 先架设HtmlUnit转换项目。虽然HtmlUnit也有.Net版本,但测试后效率不高。还是Java的原项目效率高。所以就直接用Java的,不懂Java也没关系,几十行代码搞定。可以参考我前几篇文章。 根据你项目的语言做相应的拦截器,Java的语言就不说了,可以省略步骤1,直接导包开用就行了。.Net的话最好做HttpModule,一是不污染原项目;二是性能也高。php的话,如果用laravel/yii/tp等框架,本来就是拦截器机制。 准备拦截器要用到的各搜索引擎UserAgent里的关键标识,发一下我的:Baiduspider,Googlebot,bingbot,360Spider,Sogou web spider,Yahoo! Slurp,YoudaoBot,Sosospider。看名字也能猜到各自是哪个搜索引擎吧。 然后检测到是搜索引擎来了,就调用HtmlUnit转换出完整的html源码;然后有数据就会被收录了。 希望对各位同仁有些帮助。有不明白的、有意见的,欢迎通过本博客底部的邮箱和我联系~~
View Detailsasp.net一个已实现的登陆过滤器
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Web; using System.Text.RegularExpressions; namespace MyMook { public class MyHttpModule : IHttpModule { public void Dispose() { } public void Init(HttpApplication application) { application.AcquireRequestState += new EventHandler(context_AcquireRequestState); // application.BeginRequest += new EventHandler(context_AcquireRequestState); //这里面要注意千万不要写成BeginRequest,那样就会无法获得session } void context_AcquireRequestState(Object source, EventArgs e) { HttpApplication application = (HttpApplication)source; HttpContext context = application.Context; string path=context.Request.Path; if (!context.Request.CurrentExecutionFilePathExtension.Equals(".aspx") && !context.Request.CurrentExecutionFilePathExtension.Equals(".ashx") ) { return; }//此处保证只过滤aspx/ashx/htm的请求 Match m = Regex.Match(path,@"/WebLogin/+"); if (m.Success) { return; }//不过滤文件夹WebLogin中的内容 try { object user = context.Session["user"]; if (user == null) { context.Response.Redirect("~/WebLogin/Login.aspx"); } else { return; } } catch { context.Response.Redirect("~/WebLogin/Login.aspx"); } } } } |
在web.config中:
1 2 3 4 5 |
<system.webServer> <modules> <add name="<span class="attribute-value">MyHttpModule"</span> <span class="attribute">type</span>=<span class="attribute-value">"MyMook.MyHttpModule,MyMook</span>"/> </modules> </system.webServer> |
要点: 1.注册事件时,不要写application.BeginRequest,这样会导致无法获得Session.
1 2 |
application.AcquireRequestState += new EventHandler(context_AcquireRequestState); // application.BeginRequest += new EventHandler(context_AcquireRequestState); |
from:https://blog.csdn.net/touch_the_world/article/details/37936297
View Details搜索引擎蜘蛛爬虫 User Agent 一览
今天分析研究了两个网站的 Apache 日志,分析日志虽然很无聊,但却是很有意义的事情,比如跟踪 SPAM 的 User Agent。顺便整理出一些搜索引擎爬虫的 User Agent,在这里分享一下,也欢迎补充。 微软 “msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)” msnbot,大多数已经被bingbot替代了,现在偶尔还可以看到。 “Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)” bing,必应 搜搜 “Sosospider+(+http://help.soso.com/webspider.htm)” 腾讯搜搜 “Sosoimagespider+(+http://help.soso.com/soso-image-spider.htm)” 搜搜图片 雅虎 “Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)” 雅虎英文 “Yahoo! Slurp China” “Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html)” 雅虎中国 搜狗 “http://pic.sogou.com” “Sogou Pic Spider/3.0(+http://www.sogou.com/docs/help/webmasters.htm#07)” 搜狗图片 “Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)” 搜狗,搜狗的蜘蛛程序做的很不好,总是进入死循环,已经分别在 robots.txt 和 设置中屏蔽掉 Google “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” Google “Googlebot-Image/1.0″ Google图片搜索 “Mediapartners-Google” 未知 “FeedBurner/1.0 (http://www.FeedBurner.com)” feedburner “AdsBot-Google-Mobile (+http://www.google.com/mobile/adsbot.html) Mozilla (iPhone; U; CPU iPhone OS 3 0 like Mac OS X) AppleWebKit (KHTML, like Gecko) Mobile Safari” Adwords移动网络 百度 “Baiduspider-image+(+http://www.baidu.com/search/spider.htm)” 百度图片 “Mozilla/5.0 […]
View Details各大搜索引擎蜘蛛的UserAgent
GOOGLE ——————————————————————— 66.249.70.212 – – [11/Jan/2009:00:03:35 -0700] "GET www.vidun.com/user-f2fc990265c712c49d51a18a32b39f0c.html?umid=f2fc990265c712c49d51a18a32b39f0c HTTP/1.1" 200 8148 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" Referer: "" UserAgent: "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 66.249.70.212 – – [11/Jan/2009:03:27:23 -0700] "GET www.youxigao.com/images/pink/demo.gif HTTP/1.1" 200 2367 "-" "Googlebot-Image/1.0" Referer: "" UserAgent: "Googlebot-Image/1.0" 209.85.238.7 – – [11/Jan/2009:00:02:58 -0700] "GET www.youxigao.com/rss/c/1009 HTTP/1.1" 404 37 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 10 subscribers; feed-id=8474979256887526569)" Referer: "" UserAgent: "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 10 subscribers; feed-id=8474979256887526569)" 百度 ——————————————————————— 60.28.22.38 – – [11/Jan/2009:01:28:09 -0700] "GET www.vidun.com/vwsoft-vwantileechs-download.html?pr=vwantileechs&vi=download HTTP/1.1" 200 27406 "http://www.vidun.com/" "Baiduspider+(+http://www.baidu.com/search/spider.htm)" Referer: "" UserAgent: "Baiduspider+(+http://www.baidu.com/search/spider.htm)" YAHOO ——————————————————————— 202.160.180.81 – – [11/Jan/2009:00:02:44 -0700] […]
View DetailsHtmlUnit爬取Ajax动态生成的网页以及自动调用页面javascript函数
HtmlUnit官网的介绍: HtmlUnit是一款基于Java的没有图形界面的浏览器程序。它模仿HTML document并且提供API让开发人员像是在一个正常的浏览器上操作一样,获取网页内容,填充表单,点击超链接等等。 它非常好的支持JavaScript并且仍在不断改进,同时能够解析非常复杂的AJAX库,通过不同的配置来模拟Chrome、Firefox和IE浏览器。 本文针对一个足彩网站抓取的例子,来熟悉HtmlUnit WebClient wc = new WebClient(BrowserVersion.FIREFOX_38); wc.getOptions().setJavaScriptEnabled(true); //启用JS解释器,默认为true wc.setJavaScriptTimeout(100000);//设置JS执行的超时时间 wc.getOptions().setCssEnabled(false); //禁用css支持 wc.getOptions().setThrowExceptionOnScriptError(false); //js运行错误时,是否抛出异常 wc.getOptions().setTimeout(10000); //设置连接超时时间 ,这里是10S。如果为0,则无限期等待 wc.setAjaxController(new NicelyResynchronizingAjaxController());//设置支持AJAX wc.setWebConnection( new WebConnectionWrapper(wc) { public WebResponse getResponse(WebRequest request) throws IOException { …… } } ); HtmlPage page = wc.getPage("http://XXXX.com/"); FileWriter fileWriter = new FileWriter("D:\\text.html"); String str = ""; //获取页面的XML代码 str = page.asXml(); fileWriter.write( str ); //关闭webclient wc.close(); fileWriter.close(); 解决数据乱码问题 该网站数据是由js动态载入,并且js有2种编码: <script language="javascript" src="XXX.js" charset="gb2312"></script> <script language="javascript" src="XXX.js" charset="utf-8"></script> 可以通过重写WebConnectionWrapper类的getResponse方法来修改返回值 例如,对bfdata.js的返回结果做修改 wc.setWebConnection( new WebConnectionWrapper(wc) { public WebResponse getResponse(WebRequest request) throws IOException { WebResponse response = super.getResponse(request); if […]
View Detailsspringmvc中输出字符串
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
/** * 输出文字 * @param response * @param s */ public static void responseOut(HttpServletResponse response,String s){ response.setContentType("text/html;charset=UTF-8"); response.setCharacterEncoding("UTF-8"); try ( PrintWriter pw = response.getWriter() ){ pw.write(s); } catch (IOException e) { e.printStackTrace(); } } |
from:https://www.cnblogs.com/yanqin/p/7463294.html
View Detailshtmlunit模拟登录
PS:我只用到了这一句 webClient.getOptions().setThrowExceptionOnScriptError(false); htmlunit jar项目路径http://sourceforge.net/projects/htmlunit/files/htmlunit/ demo代码如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
public class AutoLogin { /** 登录页面 */ private static final String LOGIN_URL = "http://website/login.aspx"; /** 任务列表页面 */ private static final String TASK_LIST_URL = "http://website/Banli.aspx"; /** * @param args * @throws Exception */ public static void main(String[] args) throws Exception { testHomePage(); } public static void testHomePage() throws Exception { final WebClient webClient = new WebClient(BrowserVersion.INTERNET_EXPLORER_8); webClient.getOptions().setThrowExceptionOnScriptError(false); //此行必须要加 webClient.getOptions().setCssEnabled(false); // webClient.getOptions().setJavaScriptEnabled(true); // webClient.getOptions().setThrowExceptionOnFailingStatusCode(false); webClient.getOptions().setTimeout(300000); // 获取首页 HtmlPage page = (HtmlPage) webClient.getPage(LOGIN_URL); // 根据form的名字获取页面表单,也可以通过索引来获取:page.getForms().get(0) final HtmlForm form = page.getFormByName("form1"); // 用户名/密码 HtmlTextInput textUserName = form.getInputByName("txtUserName"); textUserName.setText("username"); HtmlPasswordInput txtPwd = form.getInputByName("txtPwd"); txtPwd.setText("pass"); //调用JS触发登录按钮 Page page1 = page.executeJavaScript("$('#btn').click()").getNewPage(); page1 = webClient.getPage(TASK_LIST_URL); System.out.println("*************************************************************************************"); System.out.println(page1.getWebResponse().getContentAsString()); System.out.println("*************************************************************************************"); System.out.println(""); System.out.println("Cookies : " + webClient.getCookieManager().getCookies().toString()); } } |
搞不清ASP.NET内部什么逻辑,试了很多方法都不行,查看了无所网站,无意中看到一个这个配置http://stackoverflow.com/questions/20352284/scraping-aspx-page-using-htmlunit
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
import java.net.MalformedURLException; import com.gargoylesoftware.htmlunit.BrowserVersion; import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; import com.gargoylesoftware.htmlunit.WebClient; import com.gargoylesoftware.htmlunit.html.HtmlElement; import com.gargoylesoftware.htmlunit.html.HtmlPage; public class teste { public static void main(String args[]) throws FailingHttpStatusCodeException, MalformedURLException, IOException { HtmlPage page = null; String url = "http://www.bmfbovespa.com.br/cias-listadas/empresas-listadas/BuscaEmpresaListada.aspx?Idioma=pt-br"; WebClient webClient = new WebClient(BrowserVersion.FIREFOX_17); webClient.getOptions().setThrowExceptionOnScriptError(false); webClient.getOptions().setCssEnabled(false); webClient.getOptions().setJavaScriptEnabled(false); webClient.getOptions().setThrowExceptionOnFailingStatusCode(false); webClient.getOptions().setTimeout(30000); page = webClient.getPage( url ); System.out.println("Current page: Empresas Listadas | BM&FBOVESPA"); HtmlElement theElement1 = (HtmlElement) page.getElementById("ctl00_contentPlaceHolderConteudo_BuscaNomeEmpresa1_btnTodas"); page = theElement1.click(); System.out.println(page.asText()); System.out.println("Test has completed successfully"); } } |
最后测试下来,如果不加 webClient.getOptions().setThrowExceptionOnScriptError(false);就一直报这个错误
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
Exception in thread "main" ======= EXCEPTION START ======== Exception class=[java.lang.RuntimeException] com.gargoylesoftware.htmlunit.ScriptException: Exception invoking click at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:954) at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:513) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:836) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:812) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:800) at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptIfPossible(HtmlPage.java:910) at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScript(HtmlPage.java:878) at com.suypower.AutoLogin12345.testHomePage(AutoLogin12345.java:48) at com.suypower.AutoLogin12345.main(AutoLogin12345.java:23) Caused by: java.lang.RuntimeException: Exception invoking click at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:181) at net.sourceforge.htmlunit.corejs.javascript.FunctionObject.call(FunctionObject.java:449) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1536) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:798) at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:411) at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:309) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3286) at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:115) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:827) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:939) ... 9 more Caused by: com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot read property "nodeName" from null (http://xxxx/305000772#7) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:954) at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:513) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:836) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:812) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:800) at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptIfPossible(HtmlPage.java:910) at com.gargoylesoftware.htmlunit.html.HtmlScript.executeInlineScriptIfNeeded(HtmlScript.java:354) at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:415) at com.gargoylesoftware.htmlunit.html.HtmlScript$3.execute(HtmlScript.java:271) at com.gargoylesoftware.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:293) at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:799) at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:756) at org.cyberneko.html.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1170) at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1072) at org.cyberneko.html.filters.DefaultFilter.endElement(DefaultFilter.java:206) at org.cyberneko.html.filters.NamespaceBinder.endElement(NamespaceBinder.java:330) at org.cyberneko.html.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3126) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2093) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:920) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:1039) at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:252) at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:198) at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:271) at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:159) at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:478) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:352) at com.gargoylesoftware.htmlunit.html.BaseFrameElement.loadInnerPageIfPossible(BaseFrameElement.java:183) at com.gargoylesoftware.htmlunit.html.BaseFrameElement.loadInnerPage(BaseFrameElement.java:121) at com.gargoylesoftware.htmlunit.html.HtmlPage.loadFrames(HtmlPage.java:1893) at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:227) at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:485) at com.gargoylesoftware.htmlunit.WebClient.loadDownloadedResponses(WebClient.java:2135) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.doProcessPostponedActions(JavaScriptEngine.java:982) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.processPostponedActions(JavaScriptEngine.java:1072) at com.gargoylesoftware.htmlunit.html.DomElement.click(DomElement.java:789) at com.gargoylesoftware.htmlunit.html.HtmlImageInput.click(HtmlImageInput.java:152) at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLInputElement.click(HTMLInputElement.java:477) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:153) ... 19 more Caused by: net.sourceforge.htmlunit.corejs.javascript.EcmaError: TypeError: Cannot read property "nodeName" from null (http://xxxx/305000772#7) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3935) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3919) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError(ScriptRuntime.java:3944) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError2(ScriptRuntime.java:3960) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.undefReadError(ScriptRuntime.java:3971) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getObjectProp(ScriptRuntime.java:1519) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1243) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:798) at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:118) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:827) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:939) ... 65 more Enclosed exception: java.lang.RuntimeException: Exception invoking click at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:181) at net.sourceforge.htmlunit.corejs.javascript.FunctionObject.call(FunctionObject.java:449) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1536) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:798) at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:411) at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:309) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3286) at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:115) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:827) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:939) at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:513) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:836) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:812) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:800) at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptIfPossible(HtmlPage.java:910) at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScript(HtmlPage.java:878) at com.suypower.AutoLogin12345.testHomePage(AutoLogin12345.java:48) at com.suypower.AutoLogin12345.main(AutoLogin12345.java:23) Caused by: com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot read property "nodeName" from null (http://xxx/305000772#7) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:954) at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:513) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:836) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:812) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:800) at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptIfPossible(HtmlPage.java:910) at com.gargoylesoftware.htmlunit.html.HtmlScript.executeInlineScriptIfNeeded(HtmlScript.java:354) at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:415) at com.gargoylesoftware.htmlunit.html.HtmlScript$3.execute(HtmlScript.java:271) at com.gargoylesoftware.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:293) at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:799) at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:756) at org.cyberneko.html.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1170) at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1072) at org.cyberneko.html.filters.DefaultFilter.endElement(DefaultFilter.java:206) at org.cyberneko.html.filters.NamespaceBinder.endElement(NamespaceBinder.java:330) at org.cyberneko.html.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3126) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2093) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:920) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:1039) at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:252) at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:198) at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:271) at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:159) at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:478) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:352) at com.gargoylesoftware.htmlunit.html.BaseFrameElement.loadInnerPageIfPossible(BaseFrameElement.java:183) at com.gargoylesoftware.htmlunit.html.BaseFrameElement.loadInnerPage(BaseFrameElement.java:121) at com.gargoylesoftware.htmlunit.html.HtmlPage.loadFrames(HtmlPage.java:1893) at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:227) at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:485) at com.gargoylesoftware.htmlunit.WebClient.loadDownloadedResponses(WebClient.java:2135) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.doProcessPostponedActions(JavaScriptEngine.java:982) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.processPostponedActions(JavaScriptEngine.java:1072) at com.gargoylesoftware.htmlunit.html.DomElement.click(DomElement.java:789) at com.gargoylesoftware.htmlunit.html.HtmlImageInput.click(HtmlImageInput.java:152) at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLInputElement.click(HTMLInputElement.java:477) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:153) ... 19 more Caused by: net.sourceforge.htmlunit.corejs.javascript.EcmaError: TypeError: Cannot read property "nodeName" from null (http://xxx/305000772#7) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3935) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3919) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError(ScriptRuntime.java:3944) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError2(ScriptRuntime.java:3960) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.undefReadError(ScriptRuntime.java:3971) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getObjectProp(ScriptRuntime.java:1519) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1243) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:798) at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:118) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:827) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:939) ... 65 more ======= EXCEPTION END ======== |
希望能帮助到你,晚安! from:https://www.cnblogs.com/yimu/p/LOVE_HCJ.html
View Details模拟浏览器的神器 – HtmlUnit
PS:下面这个低本息我测试成功了,高版本怎么试都有问题。 随着Web的发展,RIA越来越多,JavaScript和Complex AJAX Libraries给网络爬虫带来了极大的挑战,解析页面的时候需要模拟浏览器执行JavaScript才能获得需要的文本内容。 好在有一个Java开源项目HtmlUnit,它能模拟Firefox、IE、Chrome等浏览器,不但可以用来测试Web应用,还可以用来解析包含JS的页面以提取信息。 下面看看HtmlUnit的效果如何: 首先,建立一个maven工程,引入junit依赖和HtmlUnit依赖:
1 2 3 4 5 6 7 8 9 10 11 |
<dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.8.2</version> <scope>test</scope> </dependency> <dependency> <groupId>net.sourceforge.htmlunit</groupId> <artifactId>htmlunit</artifactId> <version>2.14</version> </dependency> |
其次,写一个junit单元测试来使用HtmlUnit提取页面信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
/** * 使用HtmlUnit模拟浏览器执行JS来获取网页内容 * @author 杨尚川 */ public class HtmlUnitTest { @Test public void homePage() throws Exception { final WebClient webClient = new WebClient(BrowserVersion.INTERNET_EXPLORER_11); final HtmlPage page = webClient.getPage("http://yangshangchuan.iteye.com"); Assert.assertEquals("杨尚川的博客 - ITeye技术网站", page.getTitleText()); final String pageAsXml = page.asXml(); Assert.assertTrue(pageAsXml.contains("杨尚川,系统架构设计师,系统分析师,2013年度优秀开源项目APDPlat发起人,资深Nutch搜索引擎专家。多年专业的软件研发经验,从事过管理信息系统(MIS)开发、移动智能终端(Win CE、Android、Java ME)开发、搜索引擎(nutch、lucene、solr、elasticsearch)开发、大数据分析处理(Hadoop、Hbase、Pig、Hive)等工作。目前为独立咨询顾问,专注于大数据、搜索引擎等相关技术,为客户提供Nutch、Lucene、Hadoop、Solr、ElasticSearch、HBase、Pig、Hive、Gora等框架的解决方案、技术支持、技术咨询以及培训等服务。")); final String pageAsText = page.asText(); Assert.assertTrue(pageAsText.contains("[置顶] 国内首套免费的《Nutch相关框架视频教程》(1-20)")); webClient.closeAllWindows(); } @Test public void homePage_Firefox() throws Exception { final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24); final HtmlPage page = webClient.getPage("http://yangshangchuan.iteye.com"); Assert.assertEquals("杨尚川的博客 - ITeye技术网站", page.getTitleText()); webClient.closeAllWindows(); } @Test public void getElements() throws Exception { final WebClient webClient = new WebClient(BrowserVersion.CHROME); final HtmlPage page = webClient.getPage("http://yangshangchuan.iteye.com"); final HtmlDivision div = page.getHtmlElementById("blog_actions"); //获取子元素 Iterator<DomElement> iter = div.getChildElements().iterator(); while(iter.hasNext()){ System.out.println(iter.next().getTextContent()); } //获取所有输出链接 for(HtmlAnchor anchor : page.getAnchors()){ System.out.println(anchor.getTextContent()+" : "+anchor.getAttribute("href")); } webClient.closeAllWindows(); } @Test public void xpath() throws Exception { final WebClient webClient = new WebClient(); final HtmlPage page = webClient.getPage("http://yangshangchuan.iteye.com"); //获取所有博文标题 final List<HtmlAnchor> titles = (List<HtmlAnchor>)page.getByXPath("/html/body/div[2]/div[2]/div/div[16]/div/h3/a"); for(HtmlAnchor title : titles){ System.out.println(title.getTextContent()+" : "+title.getAttribute("href")); } //获取博主信息 final HtmlDivision div = (HtmlDivision) page.getByXPath("//div[@id='blog_owner_name']").get(0); System.out.println(div.getTextContent()); webClient.closeAllWindows(); } @Test public void submittingForm() throws Exception { final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24); final HtmlPage page = webClient.getPage("http://www.oschina.net"); // Form没有name和id属性 final HtmlForm form = page.getForms().get(0); final HtmlTextInput textField = form.getInputByName("q"); final HtmlButton button = form.getButtonByName(""); textField.setValueAttribute("APDPlat"); final HtmlPage resultPage = button.click(); final String pageAsText = resultPage.asText(); Assert.assertTrue(pageAsText.contains("找到约")); Assert.assertTrue(pageAsText.contains("条结果")); webClient.closeAllWindows(); } } |
最后,我们运行单元测试, 全部通过测试! from:http://yangshangchuan.iteye.com/blog/2036809
View Details前后端分离的思考与实践(一)
也谈基于 Node.js 的全栈式开发(基于 Node.js 的前后端分离) 前言 为了解决传统 Web 开发模式带来的各种问题,我们进行了许多尝试,但由于前/后端的物理鸿沟,尝试的方案都大同小异。痛定思痛,今天我们重新思考了“前后端”的定义,引入前端同学都熟悉的 Node.js,试图探索一条全新的前后端分离模式。 随着不同终端(Pad/Mobile/PC)的兴起,对开发人员的要求越来越高,纯浏览器端的响应式已经不能满足用户体验的高要求,我们往往需要针对不同的终端开发定制的版本。为了提升开发效率,前后端分离的需求越来越被重视,后端负责业务/数据接口,前端负责展现/交互逻辑,同一份数据接口,我们可以定制开发多个版本。 这个话题最近被讨论得比较多,阿里有些 BU 也在进行一些尝试。讨论了很久之后,我们团队决定探索一套基于 Node.js 的前后端分离方案,过程中有一些不断变化的认识以及思考,记录在这里,也希望看到的同学参与讨论,帮我们完善。 一、什么是前后端分离? 最开始组内讨论的过程中我发现,每个人对前后端分离的理解不一样,为了保证能在同一个频道讨论,先就什么是”前后端分离”达成一致。 大家一致认同的前后端分离的例子就是 SPA(Single Page Application),所有用到的展现数据都是后端通过异步接口(AJAX/JSONP)的方式提供的,前端只管展现。 从某种意义上来说,SPA 确实做到了前后端分离,但这种方式存在两个问题: WEB 服务中,SPA 类占的比例很少。很多场景下还有同步/同步+异步混合的模式,SPA 不能作为一种通用的解决方案。 现阶段的 SPA 开发模式,接口通常是按照展现逻辑来提供的,有时候为了提高效率,后端会帮我们处理一些展现逻辑,这就意味着后端还是涉足了 View 层的工作,不是真正的前后端分离。 SPA 式的前后端分离,是从物理层做区分(认为只要是客户端的就是前端,服务器端的就是后端),这种分法已经无法满足我们前后端分离的需求,我们认为从职责上划分才能满足目前我们的使用场景: 前端:负责 View 和 Controller 层。 后端:只负责 Model 层,业务处理/数据等。 为什么去做这种职责的划分,后面会继续探讨。 二、为什么要前后端分离? 关于这个问题,玉伯的文章《Web 研发模式演变》中解释得非常全面,我们再大概理一下: 2.1 现有开发模式的适用场景 玉伯提到的几种开发模式,各有各的适用场景,没有哪一种完全取代另外一种。 比如后端为主的 MVC,做一些同步展现的业务效率很高,但是遇到同步异步结合的页面,与后端开发沟通起来就会比较麻烦。 AJAX 为主 SPA 型开发模式,比较适合开发 APP 类型的场景,但是只适合做 APP,因为 SEO 等问题不好解决,对于很多类型的系统,这种开发方式也过重。 2.2 前后端职责不清 在业务逻辑复杂的系统里,我们最怕维护前后端混杂在一起的代码,因为没有约束,M-V-C每一层都可能出现别的层的代码,日积月累,完全没有维护性可言。 虽然前后端分离没办法完全解决这种问题,但是可以大大缓解。因为从物理层次上保证了你不可能这么做。 2.3 开发效率问题 淘宝的 Web 基本上都是基于 MVC 框架 webx,架构决定了前端只能依赖后端。 所以我们的开发模式依然是,前端写好静态 demo,后端翻译成 VM 模版,这种模式的问题就不说了,被吐槽了很久。 直接基于后端环境开发也很痛苦,配置安装使用都很麻烦。 为了解决这个问题,我们发明了各种工具,比如 VMarket,但是前端还是要写 VM,而且依赖后端数据,效率依然不高。 另外,后端也没法摆脱对展现的强关注,从而专心于业务逻辑层的开发。 2.4 对前端发挥的局限 性能优化如果只在前端做空间非常有限,于是我们经常需要后端合作才能碰撞出火花,但由于后端框架限制,我们很难使用 Comet、Bigpipe 等技术方案来优化性能。 为了解决以上提到的一些问题,我们进行了很多尝试,开发了各种工具,但始终没有太多起色,主要是因为我们只能在后端给我们划分的那一小块空间去发挥。只有真正做到前后端分离,我们才能彻底解决以上问题。 三、怎么做前后端分离? 怎么做前后端分离,其实第一节中已经有了答案: 前端:负责 […]
View Details