最受欢迎的国产开源项目都是什么模样?选用什么开源协议?使用哪种语言?实现了什么功能? 我们选取了码云 Gitee.com 平台 144 个优质开源项目,为你深入剖析国内 Top 开源项目。 (项目选取标准:入选 GVP——码云年度最有价值开源项目计划,或获得超过 1000 个 star。) 1.开源协议(license)分布——宽松型是首选 1.1 宽松式协议是首选,Apache-2.0 占比 45.14% Apache-2.0 是码云Gitee 上开源作者的首选,占到了 45.14%,JFinal 、t-io、iBase4J 等正是采用此协议。另一个则是 MIT,占比 17.36%,zheng、layui 等正是此类代表。 宽松式协议允许用户任意使用软件,从而确保大家能从软件中得到最多的利益。 Apache-2.0 是宽松式协议(permissive license)的代表,它还包含了贡献者向用户提供专利授权相关的条款,使用 Apache-2.0 的知名软件有 Android、 Apache、Swift 等。 MIT 协议几乎对用户没有什么限制,只要保留版权声明和许可声明且不要求开发者承担责任,这也成为其深受欢迎的重要原因,国际上使用 MIT 协议的知名软件有 JQuery、.NET Core 、Rails 等。 1.2 限制式协议 LGPL、GPL、AGPL 紧随其后 LGPL、GPL、AGPL 这三项协议的采用占比为 2.78%、8.35%、1.39%,它们的限制性由弱到强,分别规定: 如果项目采用动态链接调用 LGPL 许可协议的库,项目可以不用开源; 如果项目包含了 GPL 协议的代码,那么整个项目都必须使用 GPL 许可协议; 如果云服务(即 SAAS)用到的代码是 AGPL 许可协议,那么云服务的代码也必须开源。 限制式协议的初衷是帮助开源项目获得成功,从条款上进行了详细的限制,防止部分开发者修改代码却不回馈社区的情况。但凡事皆有两面,高复杂性的许可协议,不仅限制了他人,也会限制作者自身,或许这种风险正是从限制式协议到宽松式协议转变的原因。 1.3 对开源许可协议的认知和应用有待提升 在参与统计的 144 个 Top 开源项目中,未选择开源协议的项目占到了 24.31%;在码云推荐过的 7000 多个开源项目中,未选择开源协议的占比 43.95% ;而在码云上所有的开源项目中,这一数据扩大到了 77.12% 。 规则与约束是实现真正自由的前提,开源崇尚“自由、开放、分享”,更需要大家自觉遵守规则,才能实现更高的效率。 码云在此呼吁广大开源作者善用开源协议,让开源项目的发展更加规范和健康,关于如何选择开源协议,可以参考这里。 2.编程语言分布——Java 一骑绝尘 2.1 Java 类项目占据半壁江山,高达 65.73% Java 类编程语言在 Top 开源项目中可以说遥遥领先,占到 65.73%。包括了 guns、nutz、jeecg 等优质的开源项目,此类框架式的项目受到了众多开发者的喜爱。 Java 到 2018 年已经有 22 年的历史了,它在实用性、性能、向后兼容性以及跨平台性等方面都有着优秀的表现,在技术快速更迭的今天,往后的十年甚至二十年 Java 是否能一直保持这种优势呢?让我们拭目以待。 2.2 PHP […]
View Details项目上有个小需求,要限制访问者的IP,屏蔽未授权的请求。该场景使用过滤器来做再合适不过了。 SecurityFilter.java:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
public class SecurityFilter implements Filter { private Log log = LogFactory.getLog(SecurityFilter.class); private List<String> whitelist = new ArrayList<String>(); private List<String> regexlist = new ArrayList<String>(); private static final String _JSON_CONTENT = "application/json; charset=UTF-8"; private static final String _HTML_CONTENT = "text/html; charset=UTF-8"; private static final String _403_JSON = "{'code': '403', 'msg': '访问被拒绝,客户端未授权!'}"; private static final String _403_HTML = "<html><body><div style='text-align:center'><h1 style='margin-top: 10px;'>403 Forbidden!</h1><hr><span>@lichmama</span></div></body></html>"; @Override public void destroy() { } @Override public void doFilter(ServletRequest servletrequest, ServletResponse servletresponse, FilterChain filterchain) throws IOException, ServletException { HttpServletRequest request = (HttpServletRequest) servletrequest; HttpServletResponse response = (HttpServletResponse) servletresponse; if (isSecurityRequest(request)) { filterchain.doFilter(request, response); } else { log.info("拒绝来自[" + request.getRemoteAddr() + "]的访问请求:" + request.getRequestURI()); response.setStatus(403); if (isAjaxRequest(request)) { response.setContentType(_JSON_CONTENT); response.getWriter().print(_403_JSON); } else { response.setContentType(_HTML_CONTENT); response.getWriter().print(_403_HTML); } } } @Override public void init(FilterConfig filterconfig) throws ServletException { String allowedIP = filterconfig.getInitParameter("allowedIP"); if (allowedIP != null && allowedIP.length() > 0) { for (String item : allowedIP.split(",\\s*")) { // 支持通配符* if (item.contains("*")) { String regex = item.replace(".", "\\.").replace("*", "\\d{1,3}"); regexlist.add(regex); } else { whitelist.add(item); } } } } /** * 判断当前请求是否来自可信任的地址 * * @param request * @return */ private boolean isSecurityRequest(HttpServletRequest request) { String ip = request.getRemoteAddr(); for (String item : whitelist) { if (ip.equals(item)) return true; } for (String item : regexlist) { if (ip.matches(item)) return true; } return false; } /** * 判断请求是否是AJAX请求 * @param request * @return */ private boolean isAjaxRequest(HttpServletRequest request) { String header = request.getHeader("X-Requested-With"); if (header != null && header.length() > 0) { if ("XMLHttpRequest".equalsIgnoreCase(header)) return true; } return false; } } |
web.xml增加配置:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
<filter> <filter-name>securityFilter</filter-name> <filter-class>com.lichmama.webdemo.filter.SecurityFilter</filter-class> <init-param> <param-name>allowedIP</param-name> <param-value>192.168.5.*</param-value> </init-param> </filter> <filter-mapping> <filter-name>securityFilter</filter-name> <url-pattern>/*</url-pattern> </filter-mapping> |
from:https://www.cnblogs.com/lichmama/p/7063587.html
View Details本文介绍了Java获取此次请求URL以及获取服务器根路径的方法,并且进行举例说明,感兴趣的朋友可以学习借鉴下文的内容。 一、 获取此次请求的URL 1 2 3 4 5 6 String requestUrl = request.getScheme() //当前链接使用的协议 +"://" + request.getServerName()//服务器地址 + ":" + request.getServerPort() //端口号 + request.getContextPath() //应用名称,如果应用名称为 + request.getServletPath() //请求的相对url + "?" + request.getQueryString(); //请求参数 举例: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 http://127.0.0.1:8080/world/index.jsp?name=lilei&sex=1 <Context path="world" docBase="/home/webapps" debug="0" reloadable="true"/> request.getScheme() = "http"; request.getServerName() = "127.0.0.1"; request.getServerPort() = "8080"; request.getContextPath() = "world"; request.getServletPath() = "index.jsp"; request.getQueryString() = "name=lilei&sex=1"; http://127.0.0.1:8080/world/index.jsp?name=lilei&sex=1 <Context path="" docBase="/home/webapps" debug="0" reloadable="true"/> request.getScheme() = "http"; request.getServerName() = […]
View Details|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
/**java获取客户端*/ public static void getPlatform(HttpServletRequest request){ /**User Agent中文名为用户代理,简称 UA,它是一个特殊字符串头,使得服务器 能够识别客户使用的操作系统及版本、CPU 类型、浏览器及版本、浏览器渲染引擎、浏览器语言、浏览器插件等*/ String agent= request.getHeader("user-agent"); //客户端类型常量 String type = ""; if(agent.contains("iPhone")||agent.contains("iPod")||agent.contains("iPad")){ type = "ios"; } else if(agent.contains("Android") || agent.contains("Linux")) { type = "apk"; } else if(agent.indexOf("micromessenger") > 0){ type = "wx"; }else { type = "pc"; } return pc; } |
from:https://blog.csdn.net/mr_caoshuai/article/details/78284010
View DetailsMyBatis 本是apache的一个开源项目iBatis, 2010年这个项目由apache software foundation 迁移到了google code,并且改名为MyBatis 。2013年11月迁移到Github。 iBATIS一词来源于“internet”和“abatis”的组合,是一个基于Java的持久层框架。iBATIS提供的持久层框架包括SQL Maps和Data Access Objects(DAOs)。 MyBatis 是一款优秀的持久层框架,它支持定制化 SQL、存储过程以及高级映射。MyBatis 避免了几乎所有的 JDBC 代码和手动设置参数以及获取结果集。MyBatis 可以使用简单的 XML 或注解来配置和映射原生信息,将接口和 Java 的 POJOs(Plain Old Java Objects,普通的 Java对象)映射成数据库中的记录。
View Details如果用了nodejs作为中间件,您就不需要往下看了,不存在SEO的问题。此时前端不再只是前端,变成了全栈。 此文章指的是用Ajax调用API后渲染页面的情况,这时查看html源码,也是没有数据的,所以搜索引擎也收录不到有用的数据,更不用说更新了。 我的思路大概如下: 不修改原项目的源码。 只针对搜索引擎做优化。 用Java的开源项目HtmlUnit做中转,HtmlUnit模拟了流行的浏览器内核,却没有界面。经过转换的页面会输出ajax填充数据后的html源码。 由此得步骤如下: 先架设HtmlUnit转换项目。虽然HtmlUnit也有.Net版本,但测试后效率不高。还是Java的原项目效率高。所以就直接用Java的,不懂Java也没关系,几十行代码搞定。可以参考我前几篇文章。 根据你项目的语言做相应的拦截器,Java的语言就不说了,可以省略步骤1,直接导包开用就行了。.Net的话最好做HttpModule,一是不污染原项目;二是性能也高。php的话,如果用laravel/yii/tp等框架,本来就是拦截器机制。 准备拦截器要用到的各搜索引擎UserAgent里的关键标识,发一下我的:Baiduspider,Googlebot,bingbot,360Spider,Sogou web spider,Yahoo! Slurp,YoudaoBot,Sosospider。看名字也能猜到各自是哪个搜索引擎吧。 然后检测到是搜索引擎来了,就调用HtmlUnit转换出完整的html源码;然后有数据就会被收录了。 希望对各位同仁有些帮助。有不明白的、有意见的,欢迎通过本博客底部的邮箱和我联系~~
View DetailsHtmlUnit官网的介绍: HtmlUnit是一款基于Java的没有图形界面的浏览器程序。它模仿HTML document并且提供API让开发人员像是在一个正常的浏览器上操作一样,获取网页内容,填充表单,点击超链接等等。 它非常好的支持JavaScript并且仍在不断改进,同时能够解析非常复杂的AJAX库,通过不同的配置来模拟Chrome、Firefox和IE浏览器。 本文针对一个足彩网站抓取的例子,来熟悉HtmlUnit WebClient wc = new WebClient(BrowserVersion.FIREFOX_38); wc.getOptions().setJavaScriptEnabled(true); //启用JS解释器,默认为true wc.setJavaScriptTimeout(100000);//设置JS执行的超时时间 wc.getOptions().setCssEnabled(false); //禁用css支持 wc.getOptions().setThrowExceptionOnScriptError(false); //js运行错误时,是否抛出异常 wc.getOptions().setTimeout(10000); //设置连接超时时间 ,这里是10S。如果为0,则无限期等待 wc.setAjaxController(new NicelyResynchronizingAjaxController());//设置支持AJAX wc.setWebConnection( new WebConnectionWrapper(wc) { public WebResponse getResponse(WebRequest request) throws IOException { …… } } ); HtmlPage page = wc.getPage("http://XXXX.com/"); FileWriter fileWriter = new FileWriter("D:\\text.html"); String str = ""; //获取页面的XML代码 str = page.asXml(); fileWriter.write( str ); //关闭webclient wc.close(); fileWriter.close(); 解决数据乱码问题 该网站数据是由js动态载入,并且js有2种编码: <script language="javascript" src="XXX.js" charset="gb2312"></script> <script language="javascript" src="XXX.js" charset="utf-8"></script> 可以通过重写WebConnectionWrapper类的getResponse方法来修改返回值 例如,对bfdata.js的返回结果做修改 wc.setWebConnection( new WebConnectionWrapper(wc) { public WebResponse getResponse(WebRequest request) throws IOException { WebResponse response = super.getResponse(request); if […]
View Details|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
/** * 输出文字 * @param response * @param s */ public static void responseOut(HttpServletResponse response,String s){ response.setContentType("text/html;charset=UTF-8"); response.setCharacterEncoding("UTF-8"); try ( PrintWriter pw = response.getWriter() ){ pw.write(s); } catch (IOException e) { e.printStackTrace(); } } |
from:https://www.cnblogs.com/yanqin/p/7463294.html
View DetailsPS:我只用到了这一句 webClient.getOptions().setThrowExceptionOnScriptError(false); htmlunit jar项目路径http://sourceforge.net/projects/htmlunit/files/htmlunit/ demo代码如下
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
public class AutoLogin { /** 登录页面 */ private static final String LOGIN_URL = "http://website/login.aspx"; /** 任务列表页面 */ private static final String TASK_LIST_URL = "http://website/Banli.aspx"; /** * @param args * @throws Exception */ public static void main(String[] args) throws Exception { testHomePage(); } public static void testHomePage() throws Exception { final WebClient webClient = new WebClient(BrowserVersion.INTERNET_EXPLORER_8); webClient.getOptions().setThrowExceptionOnScriptError(false); //此行必须要加 webClient.getOptions().setCssEnabled(false); // webClient.getOptions().setJavaScriptEnabled(true); // webClient.getOptions().setThrowExceptionOnFailingStatusCode(false); webClient.getOptions().setTimeout(300000); // 获取首页 HtmlPage page = (HtmlPage) webClient.getPage(LOGIN_URL); // 根据form的名字获取页面表单,也可以通过索引来获取:page.getForms().get(0) final HtmlForm form = page.getFormByName("form1"); // 用户名/密码 HtmlTextInput textUserName = form.getInputByName("txtUserName"); textUserName.setText("username"); HtmlPasswordInput txtPwd = form.getInputByName("txtPwd"); txtPwd.setText("pass"); //调用JS触发登录按钮 Page page1 = page.executeJavaScript("$('#btn').click()").getNewPage(); page1 = webClient.getPage(TASK_LIST_URL); System.out.println("*************************************************************************************"); System.out.println(page1.getWebResponse().getContentAsString()); System.out.println("*************************************************************************************"); System.out.println(""); System.out.println("Cookies : " + webClient.getCookieManager().getCookies().toString()); } } |
搞不清ASP.NET内部什么逻辑,试了很多方法都不行,查看了无所网站,无意中看到一个这个配置http://stackoverflow.com/questions/20352284/scraping-aspx-page-using-htmlunit
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
import java.net.MalformedURLException; import com.gargoylesoftware.htmlunit.BrowserVersion; import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; import com.gargoylesoftware.htmlunit.WebClient; import com.gargoylesoftware.htmlunit.html.HtmlElement; import com.gargoylesoftware.htmlunit.html.HtmlPage; public class teste { public static void main(String args[]) throws FailingHttpStatusCodeException, MalformedURLException, IOException { HtmlPage page = null; String url = "http://www.bmfbovespa.com.br/cias-listadas/empresas-listadas/BuscaEmpresaListada.aspx?Idioma=pt-br"; WebClient webClient = new WebClient(BrowserVersion.FIREFOX_17); webClient.getOptions().setThrowExceptionOnScriptError(false); webClient.getOptions().setCssEnabled(false); webClient.getOptions().setJavaScriptEnabled(false); webClient.getOptions().setThrowExceptionOnFailingStatusCode(false); webClient.getOptions().setTimeout(30000); page = webClient.getPage( url ); System.out.println("Current page: Empresas Listadas | BM&FBOVESPA"); HtmlElement theElement1 = (HtmlElement) page.getElementById("ctl00_contentPlaceHolderConteudo_BuscaNomeEmpresa1_btnTodas"); page = theElement1.click(); System.out.println(page.asText()); System.out.println("Test has completed successfully"); } } |
最后测试下来,如果不加 webClient.getOptions().setThrowExceptionOnScriptError(false);就一直报这个错误
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
Exception in thread "main" ======= EXCEPTION START ======== Exception class=[java.lang.RuntimeException] com.gargoylesoftware.htmlunit.ScriptException: Exception invoking click at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:954) at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:513) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:836) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:812) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:800) at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptIfPossible(HtmlPage.java:910) at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScript(HtmlPage.java:878) at com.suypower.AutoLogin12345.testHomePage(AutoLogin12345.java:48) at com.suypower.AutoLogin12345.main(AutoLogin12345.java:23) Caused by: java.lang.RuntimeException: Exception invoking click at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:181) at net.sourceforge.htmlunit.corejs.javascript.FunctionObject.call(FunctionObject.java:449) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1536) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:798) at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:411) at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:309) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3286) at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:115) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:827) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:939) ... 9 more Caused by: com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot read property "nodeName" from null (http://xxxx/305000772#7) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:954) at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:513) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:836) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:812) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:800) at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptIfPossible(HtmlPage.java:910) at com.gargoylesoftware.htmlunit.html.HtmlScript.executeInlineScriptIfNeeded(HtmlScript.java:354) at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:415) at com.gargoylesoftware.htmlunit.html.HtmlScript$3.execute(HtmlScript.java:271) at com.gargoylesoftware.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:293) at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:799) at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:756) at org.cyberneko.html.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1170) at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1072) at org.cyberneko.html.filters.DefaultFilter.endElement(DefaultFilter.java:206) at org.cyberneko.html.filters.NamespaceBinder.endElement(NamespaceBinder.java:330) at org.cyberneko.html.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3126) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2093) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:920) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:1039) at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:252) at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:198) at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:271) at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:159) at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:478) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:352) at com.gargoylesoftware.htmlunit.html.BaseFrameElement.loadInnerPageIfPossible(BaseFrameElement.java:183) at com.gargoylesoftware.htmlunit.html.BaseFrameElement.loadInnerPage(BaseFrameElement.java:121) at com.gargoylesoftware.htmlunit.html.HtmlPage.loadFrames(HtmlPage.java:1893) at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:227) at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:485) at com.gargoylesoftware.htmlunit.WebClient.loadDownloadedResponses(WebClient.java:2135) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.doProcessPostponedActions(JavaScriptEngine.java:982) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.processPostponedActions(JavaScriptEngine.java:1072) at com.gargoylesoftware.htmlunit.html.DomElement.click(DomElement.java:789) at com.gargoylesoftware.htmlunit.html.HtmlImageInput.click(HtmlImageInput.java:152) at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLInputElement.click(HTMLInputElement.java:477) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:153) ... 19 more Caused by: net.sourceforge.htmlunit.corejs.javascript.EcmaError: TypeError: Cannot read property "nodeName" from null (http://xxxx/305000772#7) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3935) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3919) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError(ScriptRuntime.java:3944) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError2(ScriptRuntime.java:3960) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.undefReadError(ScriptRuntime.java:3971) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getObjectProp(ScriptRuntime.java:1519) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1243) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:798) at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:118) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:827) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:939) ... 65 more Enclosed exception: java.lang.RuntimeException: Exception invoking click at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:181) at net.sourceforge.htmlunit.corejs.javascript.FunctionObject.call(FunctionObject.java:449) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1536) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:798) at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:411) at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:309) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3286) at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:115) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:827) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:939) at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:513) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:836) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:812) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:800) at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptIfPossible(HtmlPage.java:910) at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScript(HtmlPage.java:878) at com.suypower.AutoLogin12345.testHomePage(AutoLogin12345.java:48) at com.suypower.AutoLogin12345.main(AutoLogin12345.java:23) Caused by: com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot read property "nodeName" from null (http://xxx/305000772#7) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:954) at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:513) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:836) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:812) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:800) at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptIfPossible(HtmlPage.java:910) at com.gargoylesoftware.htmlunit.html.HtmlScript.executeInlineScriptIfNeeded(HtmlScript.java:354) at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:415) at com.gargoylesoftware.htmlunit.html.HtmlScript$3.execute(HtmlScript.java:271) at com.gargoylesoftware.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:293) at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:799) at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:756) at org.cyberneko.html.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1170) at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1072) at org.cyberneko.html.filters.DefaultFilter.endElement(DefaultFilter.java:206) at org.cyberneko.html.filters.NamespaceBinder.endElement(NamespaceBinder.java:330) at org.cyberneko.html.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3126) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2093) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:920) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:1039) at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:252) at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:198) at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:271) at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:159) at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:478) at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:352) at com.gargoylesoftware.htmlunit.html.BaseFrameElement.loadInnerPageIfPossible(BaseFrameElement.java:183) at com.gargoylesoftware.htmlunit.html.BaseFrameElement.loadInnerPage(BaseFrameElement.java:121) at com.gargoylesoftware.htmlunit.html.HtmlPage.loadFrames(HtmlPage.java:1893) at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:227) at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:485) at com.gargoylesoftware.htmlunit.WebClient.loadDownloadedResponses(WebClient.java:2135) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.doProcessPostponedActions(JavaScriptEngine.java:982) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.processPostponedActions(JavaScriptEngine.java:1072) at com.gargoylesoftware.htmlunit.html.DomElement.click(DomElement.java:789) at com.gargoylesoftware.htmlunit.html.HtmlImageInput.click(HtmlImageInput.java:152) at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLInputElement.click(HTMLInputElement.java:477) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:153) ... 19 more Caused by: net.sourceforge.htmlunit.corejs.javascript.EcmaError: TypeError: Cannot read property "nodeName" from null (http://xxx/305000772#7) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3935) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3919) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError(ScriptRuntime.java:3944) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError2(ScriptRuntime.java:3960) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.undefReadError(ScriptRuntime.java:3971) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getObjectProp(ScriptRuntime.java:1519) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1243) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:798) at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:118) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:827) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:939) ... 65 more ======= EXCEPTION END ======== |
希望能帮助到你,晚安! from:https://www.cnblogs.com/yimu/p/LOVE_HCJ.html
View DetailsPS:下面这个低本息我测试成功了,高版本怎么试都有问题。 随着Web的发展,RIA越来越多,JavaScript和Complex AJAX Libraries给网络爬虫带来了极大的挑战,解析页面的时候需要模拟浏览器执行JavaScript才能获得需要的文本内容。 好在有一个Java开源项目HtmlUnit,它能模拟Firefox、IE、Chrome等浏览器,不但可以用来测试Web应用,还可以用来解析包含JS的页面以提取信息。 下面看看HtmlUnit的效果如何: 首先,建立一个maven工程,引入junit依赖和HtmlUnit依赖:
|
1 2 3 4 5 6 7 8 9 10 11 |
<dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.8.2</version> <scope>test</scope> </dependency> <dependency> <groupId>net.sourceforge.htmlunit</groupId> <artifactId>htmlunit</artifactId> <version>2.14</version> </dependency> |
其次,写一个junit单元测试来使用HtmlUnit提取页面信息:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
/** * 使用HtmlUnit模拟浏览器执行JS来获取网页内容 * @author 杨尚川 */ public class HtmlUnitTest { @Test public void homePage() throws Exception { final WebClient webClient = new WebClient(BrowserVersion.INTERNET_EXPLORER_11); final HtmlPage page = webClient.getPage("http://yangshangchuan.iteye.com"); Assert.assertEquals("杨尚川的博客 - ITeye技术网站", page.getTitleText()); final String pageAsXml = page.asXml(); Assert.assertTrue(pageAsXml.contains("杨尚川,系统架构设计师,系统分析师,2013年度优秀开源项目APDPlat发起人,资深Nutch搜索引擎专家。多年专业的软件研发经验,从事过管理信息系统(MIS)开发、移动智能终端(Win CE、Android、Java ME)开发、搜索引擎(nutch、lucene、solr、elasticsearch)开发、大数据分析处理(Hadoop、Hbase、Pig、Hive)等工作。目前为独立咨询顾问,专注于大数据、搜索引擎等相关技术,为客户提供Nutch、Lucene、Hadoop、Solr、ElasticSearch、HBase、Pig、Hive、Gora等框架的解决方案、技术支持、技术咨询以及培训等服务。")); final String pageAsText = page.asText(); Assert.assertTrue(pageAsText.contains("[置顶] 国内首套免费的《Nutch相关框架视频教程》(1-20)")); webClient.closeAllWindows(); } @Test public void homePage_Firefox() throws Exception { final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24); final HtmlPage page = webClient.getPage("http://yangshangchuan.iteye.com"); Assert.assertEquals("杨尚川的博客 - ITeye技术网站", page.getTitleText()); webClient.closeAllWindows(); } @Test public void getElements() throws Exception { final WebClient webClient = new WebClient(BrowserVersion.CHROME); final HtmlPage page = webClient.getPage("http://yangshangchuan.iteye.com"); final HtmlDivision div = page.getHtmlElementById("blog_actions"); //获取子元素 Iterator<DomElement> iter = div.getChildElements().iterator(); while(iter.hasNext()){ System.out.println(iter.next().getTextContent()); } //获取所有输出链接 for(HtmlAnchor anchor : page.getAnchors()){ System.out.println(anchor.getTextContent()+" : "+anchor.getAttribute("href")); } webClient.closeAllWindows(); } @Test public void xpath() throws Exception { final WebClient webClient = new WebClient(); final HtmlPage page = webClient.getPage("http://yangshangchuan.iteye.com"); //获取所有博文标题 final List<HtmlAnchor> titles = (List<HtmlAnchor>)page.getByXPath("/html/body/div[2]/div[2]/div/div[16]/div/h3/a"); for(HtmlAnchor title : titles){ System.out.println(title.getTextContent()+" : "+title.getAttribute("href")); } //获取博主信息 final HtmlDivision div = (HtmlDivision) page.getByXPath("//div[@id='blog_owner_name']").get(0); System.out.println(div.getTextContent()); webClient.closeAllWindows(); } @Test public void submittingForm() throws Exception { final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24); final HtmlPage page = webClient.getPage("http://www.oschina.net"); // Form没有name和id属性 final HtmlForm form = page.getForms().get(0); final HtmlTextInput textField = form.getInputByName("q"); final HtmlButton button = form.getButtonByName(""); textField.setValueAttribute("APDPlat"); final HtmlPage resultPage = button.click(); final String pageAsText = resultPage.asText(); Assert.assertTrue(pageAsText.contains("找到约")); Assert.assertTrue(pageAsText.contains("条结果")); webClient.closeAllWindows(); } } |
最后,我们运行单元测试, 全部通过测试! from:http://yangshangchuan.iteye.com/blog/2036809
View Details