Simple design is zen

Dan North (a.k.a. the BDD guy) tweets:

SOLID is well known Object-Oriented Design principles, Uncle Bob introduced this acronym in his book. If someone ask you about design principles in a technical interview, this probably is the most sound answer. But I totally get Dan’s point, those acronym is cool but hard to apply in practice. I mean everyone who is good at code know what is good design, and those principles does apply well. But in most of time, this is like a review of stock market. It make sense, but you can’t use the review to predict the market. I feel they are confirmation bias.

Please take a look on Dan’s deck “Why Every Element of SOLID is Wrong”, it’s humorous. His point is “Write Simple Code”, which I dig. This is a way to say don’t trust prediction of stock market, use your common sense. It’s not because any simple code is good code, it is just saying that you should “Chop Wood Carry Water”.

The goal of Zen, his master taught him, was to “archive a void … noiseless, colorless, heatless void” – to get to that state of emptiness, whether it was on the mound or in the batter’s box or at practice.

Before that, Zhuang Zhou, the Chinese philosopher, said, “Tao is in the emptiness. Emptiness is the fast of the mind.”

Quote from the book Stillness Is the Key

I feel the “Write Simple Code” tag line is in the spirit of zen. Always want to apply best practice, design principles feels like ego. We have an impression those pattern matching process make perfect code. But the order is wrong. We can refactor code to use patterns, but pay upfront cost to use pattern may be a waste. If we can put down our ego, we can just go back to our emptiness and write simple code. Make It Work Make It Right Make It Fast. Go back to basic, Chop Wood, Carry Water, Chop Wood, Chary Water.

rMBP kernal_task cpu spikes when connecting more than 1 external monitors

rMBP kernal_task cpu spikes when connecting more than 1 external monitors

This “bug” bother me frequently, and I don’t know why? I didn’t thought that’s related to connecting more than 1 external monitors, so I always search for “kernel_task high cpu”, which leads to similar fix than what I found lately.

TL;DR;

  • If this problem is highly correlated with connecting to more than 1 monitor and you don’t really need that much monitor. Then the easy fix is throw 1 of your monitor for other task, and working with only 1 external monitor.
  • There’s a scientific way to figure out which kernel extension cause the issue. If you like to fix it in this please read following text.

So there’s some falty kernel extension which trick the kernel_task, and kernel_task try to steal CPU time to cool CPU down. Which make the system very un-responsive. This issue normally took couple of hours of my time, I can’t get back to productivity.

I thought those 3 things may fix it:

But those method is like Voodoo 🙂 I even tweeted SMC reset fix my problem, but it turns out not true.

Until I finally found the issue is correlated of connecting to more than 1 monitors. Yes, do not connect to external monitor or connecting to only 1 monitor fix the problem. And my daily setup is connecting the rMBP (MacBookPro11,3, MacBook Pro Retina, 15-inch, Late 2013) to 2 monitors: 24 Inches Dell and 30 Inches Dell. I found unplug the monitor fix the kernal_task CPU usage issue in couple of minutes.

So I change my google keyword, and find the better answer immediately.

Those article actually is mostly overlap with How to fix kernel_task CPU usage on El Capitan. The difference is it tells you that you need to “poke each of the kext to figure out which one is the lemon”. This make it more like a science 🙂

Mac have serious software quality issue. But their intention is good, you can seamlessly upgrade from major OS versions. It doesn’t bother about “Drivers update” like windows (although windows 10 addressed the driver issues pretty much already). But it doesn’t mean you will load wrong thing into your kernel extension which ultimately cause weird issue. Apple also have bad knowledge base, you can’t find anything useful on their “wiki”. But noise in user forum, and Apple so very bad at indexing them.

OK, rant finish. To me this is what I learned:

  • Be careful to listen to the problem. Try to understand the symptom better.
  • Avoid voodoo fix.
  • When you find low quality result, change your search keyword. Even sometime it’s hard to craft better keyword. But the keyword is the key in search engine era.

And how to fix this issue on your macbook?

  • See if it’s related to connecting to more than 1 monitors. Because if unplug solve the problem, then you don’t need to fix it when there’s other high priority tasks.
  • Perform SMC + PRAM reset. This is almost free.
  • Disable SIP, and re-enable it when you fix it.
  • Disable the kext 1 by 1 as [Technology] kernel_task consumes almost 100% of CPU on Mac OS X suggests. Don’t follow other instruction to delete your exact mac’s plist. Just disable that kext as whole, which is easier to recover and save your google time too.
  • Make sure you Time Machine your Mac.

Finally, I need to say OS X breaks so often recently. And maybe I should say it never works correctly, it almost always accidentally barely works. Those components breaks for me frequently:

  • kernel_task, that what I complains above.
  • Keychain is broken frequently. And the keychain first aid is gone.
  • Disk integrity and permission breaks frequently, and you need to restart your computer to recovery mode to check/fix.
  • Spotlight breaks frequently.
  • The network (and VPN) is not as stable as before. But this might be a false feeling due to lack of trust.
  • Updates no longer always make the OS better.

Dan Abramov’s Redux lessons are great

Screen Shot 2015-12-29 at 4.56.29 PM

My awesome colleague Dustan Kasten recommend Dan Abramov‘s Getting Started with Redux on egghead.io.

It includes 30 short video lessons, which are great example of Refactoring to Patterns. Although there’s not a goal to form any patterns. The goal is just making sense in coding. The process of refactoring make me feel exciting. Nowdays the javascript community is full a micro-libraries, like the react-redux. This video actually shows the intention of refactoring and extract reuseful code as tiny library.

Thanks Dan and egghead.io!

Why I prefer using new/prototype/this to ‘createClass’?

Here is my thoughts on “factory” vs. “new/class”. First I agree that class is not necessary when we use prototype system in javascript. And prototype is superior than class, that’s why we love javascript.

But when we design a system, we need some tool to help use minify the side effects between your internal API calls. Making zero side effects system is possible, but that doesn’t make too much sense. Closure is a important feature in javascript, the closure is mutable, we use that commonly to reserve state in function. This is not pure functions anymore, but that’s an sweet spot in the middle of the spectrum from pure functional programming to none-pure functional programming. So the rules of thumb here is localize the side effects, and make the side effects physically close to functions. That’s one style we use to describe how Object Oriented Design marry with functional programming happily.

The previous chapter help us explain why the prototype is good. We can use the object to localize their local states (side effects), and the prototype is the functions which apply side effects on them. This style help us get rid of the dirts from classical OO’s class system. So here is a new question, do you think the class and new harmful in javascript? My answer is NO.

So the intention of the functional and localized side effects is a goodwill, and we should think about what’s the right tool/pattern for us to achieve that? And there’re 2 common patterns in javascript to achieve that.

  1. Factory:
var User = function() {
    var privateState = {}

    var setPrivateState = function(value1) {
        privateState.state1 = value1;
    }

    return {
        publicMethod: function(value1) {
            setPrivateState(value1);
            this.otherPublicMethod();
        },
        otherPublicMethod: function() {}
    }
}
var user = User();
  1. Use function, prototype and new:
var User = function constructor() {
    this.privateState = {}
}

User.prototype = Object.create({
    _setPrivateState: function(value1) {
        this.privateState.state1 = value1;
    },
    otherPublicMethod: function() {}
});

User.prototype.publicMethod = function(value1) {
    this._setPrivateState(value1);
    this.otherPublicMethod();
}

var user = new User();

The Factory use closure simulate private methods and variable, the reference of this is inexplicit reference the returned object literal itself. The magical stuff is that the Factory way don’t need the new keyword, which is the reason why someone love it.

The prototype way lost the private method/variable (you can do that by define some private method inside the constructor, but let’s put that aside). But the benefit is explicit this binding in new, and having a prototype chain in new (that’s how the prototype chain works). Although prototype chain is commonly used for simulating inherit (which is bad), but we can also doing mixin by prototype (which shares all methods on prototype; in comparison, mixing is normally done as method copy between objects in Factory way).

Because the prototype is a killer feature IMHO, so I lean towards using the new/prototype/class pattern. I don’t have strong opinion on the keyword new (because I have some friends don’t like it), but let’s review what’s new is doing?

  1. A new object is created, inheriting from foo.prototype.
  2. The constructor function foo is called with the specified arguments and this bound to the newly created object. new foo is equivalent to new foo(), i.e. if no argument list is specified, foo is called without arguments.
  3. The object returned by the constructor function becomes the result of the whole new expression. If the constructor function doesn’t explicitly return an object, the object created in step 1 is used instead. (Normally constructors don’t return a value, but they can choose to do so if they want to override the normal object creation process.)

So I think new is still quite useful 🙂

Going back to the implementation side. Using the new/prototype/class is my choice, but there’s some drawback. The prototype is very flexible, and we may misuse it. I’m especially against using that for inheritance reuse. Because most people believe composition over inheritance (for reuse).

So what I need?

  • Make clear that we need a constructor for a object factory (I don’t call it class)
  • We define a set of own methods for that object, and they should be assign into constructor’s prototype
  • We’d like to use mixin (on prototype) to reuse

And probably you know that React is hot in our community. And react have a method React.createClass, which is doing exactly that 3 thing I describe above. It’s like a factory of object factory, put some restriction to you, but showing you a schema of a object factory. I like it. But you don’t need React to use createClass, you can do it with couple of lines of code.

var _ = require(‘underscore’);

var createClass = function(options) {
    options = options || {};

    var constructor = options.hasOwnProperty(‘constructor’) ? options.constructor : (function() {});
    delete options.constructor;
    var mixins = options.mixins;
    delete options.mixins;

    options.mixin = _.mixin;
    constructor.prototype = options;

    if (mixins) {
        mixins.forEach(function(mixin) {
            _.extend(constructor.prototype, mixin);
        });
    }

    return constructor;
}

I used methods in underscore, you can copy them out and have a standalone createClass.

How about class support in coffeescript and ES6? I’m a believer of both. I use coffeescript for years, and I love it’s class implementation. The class support doesn’t give you mixin out of box, because mixin is a personal choice. So the class is just a factory of object factory, which is same as what we introduce above.

The how is not very important in this post, because you can write your own (better) implementation easily. The more important is why we doing this, the core is adopt functional and localized side effects design style.

SQLAlchemy gevent Mysql python drivers comparison

sqlalchemy-gevent-mysql-drivers-comparison

Compare different mysql drivers working with SQLAlchemy and gevent, see if they support cooperative multitasking using coroutines.

The main purpose of this test is to find which mysql driver has best concurrency performance when we use SQLAlchemy and gevent together.
So we won’t test umysql, which isn’t compatible with DBAPI 2.0.

Code is here. Thank you CMGS.

Result example

100 sessions, 20 concurrency, each session take 0.5 seconds (by select sleep)

mysql://root:@localhost:3306/mysql_drivers_test total 100 (20) 50.5239 seconds
mysql+pymysql://root:@localhost:3306/mysql_drivers_test total 100 (20) 2.6847 seconds
mysql+oursql://root:@localhost:3306/mysql_drivers_test total 100 (20) 50.4289 seconds
mysql+mysqlconnector://root:@localhost:3306/mysql_drivers_test total 100 (20) 2.6682 seconds

With greenify ptached MySQLdb (mysql-python).

mysql://root:@localhost:3306/mysql_drivers_test total 100 (20) 2.5790 seconds
mysql+pymysql://root:@localhost:3306/mysql_drivers_test total 100 (20) 2.6618 seconds
mysql+oursql://root:@localhost:3306/mysql_drivers_test total 100 (20) 50.4437 seconds
mysql+mysqlconnector://root:@localhost:3306/mysql_drivers_test total 100 (20) 2.6340 seconds

Pure python driver support gevent’s monkey patch, so they support cooperative multitasking using coroutines.
That means the main thread won’t be block by MySQL calls when you use PyMySQL or mysql-connector-python.

1000 sessions, 100 concurrency, each session only have 1 insert and 1 select

mysql://root:@localhost:3306/mysql_drivers_test total 1000 (100) 10.1098 seconds
mysql+pymysql://root:@localhost:3306/mysql_drivers_test total 1000 (100) 26.8285 seconds
mysql+oursql://root:@localhost:3306/mysql_drivers_test total 1000 (100) 6.4626 seconds
mysql+mysqlconnector://root:@localhost:3306/mysql_drivers_test total 1000 (100) 22.4569 seconds

Oursql is faster than MySQLdb in this case.
In pure python driver, mysql-connector-python is a bit faster than PyMySQL.
use greenify or not won’t affect the testing result in this user scenario.

Setup

mkvirtualenv mysql_drivers_test # workon mysql_drivers_test
pip install --upgrade setuptools pip cython
pip install -r requirements.txt
python mysql_drivers_comparison.py

Test with greenify MySQL-python (on OSX with homebrew).

mkvirtualenv mysql_drivers_test # workon mysql_drivers_test
pip install --upgrade setuptools pip cython
pip install -r requirements.txt
git clone https://github.com/CMGS/greenify
cd greenify
cmake -G 'Unix Makefiles' -D CMAKE_INSTALL_PREFIX=$VIRTUAL_ENV CMakeLists.txt
make & make install
cd ..
export LIBGREENIFY_PREFIX=$VIRTUAL_ENV
pip install git+git://github.com/CMGS/greenify.git#egg=greenify # if you are using zsh, use \#egg=greenify
git clone https://github.com/CMGS/mysql-connector-c
cd mysql-connector-c
export DYLD_LIBRARY_PATH=$VIRTUAL_ENV/lib
cmake -G 'Unix Makefiles' -D GREENIFY_INCLUDE_DIR=$VIRTUAL_ENV/include -D GREENIFY_LIB_DIR=$VIRTUAL_ENV/lib -D WITH_GREENIFY=1 -D CMAKE_INSTALL_PREFIX=$VIRTUAL_ENV CMakeLists.txt
make & make install
cd ..
git clone https://github.com/CMGS/MySQL-python.git
cd MySQL-python
export DYLD_LIBRARY_PATH=$VIRTUAL_ENV/lib
export LIBRARY_DIRS=$VIRTUAL_ENV/lib
export INCLUDE_DIRS=$VIRTUAL_ENV/include
unlink /usr/local/lib/libmysqlclient.18.dylib
ln -s $VIRTUAL_ENV/lib/libmysql.16.dylib /usr/local/lib/libmysqlclient.18.dylib
python setup.py install
brew switch mysql [version] # brew switch mysql 5.6.15 on my env, brew info mysql to check which version is available on your env
cd ..
python mysql_drivers_comparison.py

If the greenify doesn’t work for you, you can use otool -L _mysql.so in your $VIRTUAL_ENV/lib/python2.7/site-packages MySQL-python folder. Can’t find otool even after you installed XCode’s command line tools? Follow this link.

I need say thank you to CMGS. He guided me how to install greenify and how it works, he also help me triage the issues I met (include the otool part). Make greenify and mysql work on OSX make no sense, you shold do it on your application server which probably will be a linux, hope you will figure out how to.

ThoughtWorks 将于 2014 年 3 月 8 日举行的第十一届 B’QConf (北京软件质量大会)

本 Post 是代朋友转发。

亲爱的软件业朋友们:

第十届B’QConf(北京软件质量大会)是否还在你的脑海回旋,爆棚的场面和激烈的讨论是否始终让你记忆犹新?

由ThoughtWorks中国和阿里巴巴集团共同主办,中国软件测试经理联盟协办的第十一届 B’QConf(北京软件质量大会)将于2014年03月08日如期举行。B’QConf 邀请IT测试及质量相关人员共聚一堂,一起分享经验,探讨话题,结识业界朋友。无论你是B’QConf的老伙伴,新伙伴,还是大伙伴,小伙伴,只要你热爱测试,关注质量,期待学习,喜欢分享,都欢迎加入我们的活动。

本期精彩话题:

  • 话题一:中小型项目的web性能测试模型

中小型项目的性能测试和大型项目不同。在时间和资源都有限的情况下,我们来聊聊如何选定性能测试指标,怎样设计性能测试用例,最终出一个什么样的结果,才对项目最有价值。

演讲嘉宾:孙弘 ThoughtWorks QA

  • 话题二:Android自动化测试实践

本分享将介绍针对Android手机app测试方面的一套自动化测试框架,该框架能解决一般Android系统上的自动化问题,原理以及使用方法。同时会根据结合示例讲解。

演讲嘉宾:赵婉萍(菁菁)阿里妈妈-无线

  • 话题三:虚拟机初探

日常工作中,对于多个不同的环境有需求,拥有多个物理机也不太现实。我们尝试利用虚拟化技术来解决这一问题。在一些调研之后,决定使用KVM来定义、管理我们的虚拟机,本话题会分享一些KVM上对于虚拟机的定义、配置和性能方面的一些经验。

演讲嘉宾:杨锐 ThoughtWorks QA

  • 话题四:大数据的质量监控和保障

分享将介绍大数据时代下分布式系统的监控和数据仓库,数据挖掘的测试方法和原理。

演讲嘉宾:李春元(春元)阿里妈妈-CNZZ

  • 报名:我要报名第十一届B’QConf
  • 时间:2014年 3 月 8 日 13:30 – 17:30 (现场提供茶点,水果,软饮。)
  • 地点:北京市东城区东直门南大街3号国华投资大厦11层1105室 (ThoughtWorks 北京办公室)

地铁路线: 2号线东直门站D口出,来福士商场旁边国华投资大厦

公交路线: 106,107,117,24,612,635等20多条公交车到东直门内/东直门外/东直门枢纽总站下车都可
自驾路线: 东直门桥西南角来福士商场旁边国华投资大厦

更多消息请大家关注 B’QConf的微博

Places to find essential new things

Had a discussion with friends this morning, about how to explore new things, learn new staff. Then I got a list for myself:

  • new.me digest email (daily)
  • zhihu.com and quora.com weekly digest email
  • HN (Hacker news), best realtime tech world aggregation
  • Github trends, place to find intresting ideas, or high quality projects
  • twitter geek list (not twitter timeline, you need pick your own geek list)
  • Techmeme, technology industry headlines
  • Theverge, for high quality reviews and news post with good quality of pictures
  • Reader (digg reader, feedly, or what-ever), baseline is you need to follow intresting people here

Some of my best coworkers and friends contribute those:

  • Lifehacker, for interesting howtos
  • ScienceDaily, Latest Science News
  • reddit, everything fun
  • google plus, depends on your circle quality and time you spend
  • Engadget

Compae redis and memcache as cache in python

I hate posting micro benchmark of library, it’s kinda stupid when they are not the bottle-neck. Now, I will hate my self 😀

The verdit is: using memcache and redis has similar performance. In our system, we use redis a fast data-structure store, so use it as cache save a moving parts.

  • cPickle 50000 rounds, used 1.61123394966
  • msgpack 50000 rounds, used 0.296160936356
  • memcached with pylibmc save/load 15000 rounds, used 1.68719887733
  • memcached python-memcached save/load 15000 rounds, used 3.92766404152
  • redis with hiredis save/load 15000 rounds, used 2.76974511147

Acctually we are using redis + msgpack now.

By the way, our new leaderboard service is based on gevent patched flask + redis + sqlalchemy tech stack. A aws 8 core instance give us 900rps, 1.5x throughput of a ruby version (with fiber and em) write by my ex-coworker @flyerhzm.

Output ISO 8601 format datetime string in UTC timezone

I hate timezone. Especially for python, since python’s timezone is not in standard library, I always need to install pytz. I know it make sense, since timezone db changes sometimes, and it make no sense to put it in python standard library. But this make life harder. Every time when I work on timezone aware datetime, it will #FML.

The formal way is name your timezone in config file, but for the case of output iso 8601 format datetime in UTC timezone, it’s not elegant to ask for this information. The tricky part is get your local timezone. Since timezone is not in system environment variable, so it’s kind of hack to let it work with small code footprint. I know I’m stupid, but let me post my solution here. If you know better solution, please let me know.

try:
    local_tz = pytz.build_tzinfo('localtime', open('/etc/localtime', 'rb'))
except:
    from .poorman_tz import LocalTimezone
    local_tz = LocalTimezone()


def isoformat(dt):
    if dt:
        return local_tz.localize(dt).replace(microsecond=0).astimezone(pytz.utc).isoformat()
    return None

The LocalTimezone is from python’s datetime document, just copy that section of code and you get this poorman’s implementation of LocalTimezone. Thanks my coworker Stan pointing me out the solution to build a timezone from /etc/localtime, also for this smart guy answer that stackoverflow thread. BTW: python’s datetime has microsecond information, I don’t need them so I replace them with 0, it makes the datetime string shorter.

Web语义化真的是Web开发人员表现出来的可贵的人性!

知乎上面扯得比较长 ,所以也发到自己博客一份。

语义网是让机器可以理解数据。语义网技术,它包括一套描述语言和推理逻辑。它包通过一些格式对本体(Ontology)进行描述。如W3C的 RDF就是这样一种描述规范,它描述这些数据所表达的含义还有这些词之间可能产生的关系(动词?),那么计算机就可以通过查询(推理规则)来产生我们需要的数据视图了。也就是说如果你对计算机提问,因为计算机理解数据,所以可以推理出你所想要的答案,即使这个答案不是预先准备好的。大部分的语义网的表示规范都基于XML,因为它是一种完备的通用描述语言。

HTML选择文本协议是因为文本协议便于人类与计算机阅读。其实要注意一个重要的历史细节,是email激发并帮助产生了互联网技术。email相关的第一个RFC(RFC 561)在1973年就有了,而TBL大神在1989年提出超链接技术才标志着WWW的产生。

使用文本协议是因为原先传递消息不是为了让计算机存储和理解的。最早的email就和我们现在的短信的想法差不多,两个计算机同时在线用modem传递一些文本信息给使用计算机的人去读,这个时候的文本信息没有链接(没有链接就不是网),纯粹就是文本块。

而后来为了让email里面能保存非文本的数据,并且不破坏原来的协议兼容性,所以催生了非常非常重要的MIME协议(Multipurpose Internet Mail Extensions)。

因为有了email的协议族(包括传输协议和MIME)以后,在计算机之间通过纯文本消息体已经可以交换各种数据了。但是这个时候传递的只是数据,数据之间是没有关系的。

Hypertext是在文本协议上面扩展了表示文档关系(超链接)的能力,它就让原先的文本变成了网络(关系)。这种文本表述协议HTML的第一个RFC是1995年的RFC 1866。而我们可爱的HTTP在1996年才有了HTTP 1.0(RFC 1945)。

你基本上可以按照RFC来排这些技术的辈份……

强调这个历史是想说明计算机可理解不是这组协议的目的,因为计算机可以解析二进制,用二进制更高效(传输和解析)。这些协议最早是为了人类可读而设计的,所以都基于对计算机不那么有好的文本协议。文本协议人类调试起来会高效很多。

那么这和语义网有什么关系呢?

因为HTML不是为了计算机可读而优化,所以HTML的解析实际上是一个比较头疼的地方,这个是所有写过HTML解析器的朋友都知道的。HTML很多时候真的是一锅粥,模糊的语义很多时候靠猜。

所以,有了著名XHTML,它的目的是让HTML套上XML的外衣。XML是啥呢?XML的最初目的就是设计一种计算机和人类都可读的协议,由于人也就能读文本,所以它是一种文本描述语言。让HTML符合XML的规范就计算机(机器人)和人类都皆大欢喜了。而后有了XHTML 1.0,当时的“网站重构”活动所有经历过那个时代的朋友都亲身体会到了。多了一些强制的写法,写XHMTL解析器的朋友们就不用哭泣了。不过后来大家发现写XHTML的朋友经常会有语法错误,另外一些朋友则对HTML 4.x灵活(模糊)的语法恋恋不忘。

由于还是有很多朋友发现XHTML让计算机真的跟容易理解文本的结构,所以那些人继续狂热的做XHMTL 2.0,但是这是个不归路。而且对于消费HTML的大部分人类来说这都没有爱,所以最后这个标准被抛弃了。

人类理解文本字面以外更重要的是把这些概念抽出来理解,人类需要知道文档的结构是什么。研究协议的人们都是高端人才,天天写paper……(我这个是纯演绎)所以他们觉得应该让HTML能够很好的展现他们的写的字(文本)的这种章回的结构,所以就把文档的结构的隐喻放在了HTML的文档模型(BOM和DOM)上面,所以HTML协议里面是包括tag和tag所表述的文档中的隐喻的定义的,这样人类阅读这些文本的时候就把文本和一个文档(某个paper)的结构映射出来了,那么浏览器就可以把他渲染的让这些标准制定者高兴了。

但是后来大家发现光有文档没有索引不行,所以搜索引擎越来越重要。可是搜索引擎不是都像Yahoo那样是人肉编辑的,后来的主流搜索引擎都是基于对查询文本和网络上的文本的相关度进行搜索的。但是网络上的文本要取下来并按照文档结构解析是需要机器人(爬虫)的。所以机器人读网页的权利被越来越多的重视起来,那些搜索引擎优化不就是想骗过这些机器人的算法么。

当然这里又回到一开始的语义网了,因为人们的查询不光是字面匹配,人们希望使用更聪明的搜索引擎。那么搜索引擎应该知道用户的意图,这不是什么人工智能,而是一些基于统计的算法。但是这些算法都和语义网中的一些东西有相关性,因为人们需要得到数据,并且找到这些数据的本体,通过一些预先定义好的本体之间的关系进行逻辑推演(目前都是写死的算法,而没有使用语义网里面的推演系统)。也就是说这个模型从概念上和语义网相似,但是由于技术上还不太可行,所以走了其它的路。不过从理解文本的这个地方来说,所有的现代搜索引擎都有这方面的逻辑。它希望把搜集到的文本描述成一种可以推演的数据,在语义网里面描述这些数据的方式之一是RDF。RDF基于XML,而HTML中的XHTML是一种XML。通过HTML的attribute储存语义网数据叫RDFa(Resource Description Framework – in – attributes),这就把HTML/XHTML和语义网技术拉到了一起,当然光表示数据只是语义网的一部分。

和RDFa相似的东西还有microformats(老早我就力挺microformats,不过后来这东西被好多人断言说已经死了,还好后来micro-data火起来了),它把语义数据放在node text或者属性里面,并且通过css class来表达数据结构。但是他们只表示了结构,我们还没法映射到本体。那么Microdata就是这样的尝试,它定义了一些词汇表,表达某一些常见的格式,通过这些词汇表就对应到了数据的本体。

到这里这个圈子就转好了。语义化需要让数据和表述的本体的映射成为可能,那么结构首先要可以表达出来,并且通过一些结构的约定俗成(或者直接声明)让计算机可以找到这些结构的本体,然后计算机就可以通过本体的关系来进行逻辑演绎。目前我们能真的达到约定俗称的东西还很少,大家看看micro data。但是先不要说终极理想,也就是让我们的大网成为语义网。我们目前可用的技术里面应该充分的考虑到可怜的机器人非常弱的理解能力,尽量说一些约定俗称的东西,这样机器人就可以帮我们进行一些我们人类不太擅长但是它很擅长的推理计算了。那么最好我们能够让我们的文档在描述相关的本体的时候使用计算机更容易理解的结构,这就是语义化。也就是说用某一种模式来表达计算机可以理解的词汇,这就是HTML的语义化。

当然现在又一个巨大的问题,那就是HTML的文档模型和我们平常要表达的映射直接没有隐喻的关系,而且这个差距是巨大的。我们开发Web Application无处不受这种限制的影响,所以性急的人们才把HTML5变成一个永远演化的协议,来保证我们更及时的把我们想要的一些新的结构、语义加入到HTML这个文本描述协议里面。

那么最后强调一下,语义化真的不是为了我们人类。语义化是我们人类博爱的体现,我们也要照顾可怜的机器人,让他们能够很好的通过自描述的结构逐渐掌握我们人类的词汇,理解我们人来在说什么,这样它就可以更好的为我们服务。语义化真的是Web开发人员表现出来的可贵的人性!

相关词条链接: