SQLAlchemy gevent Mysql python drivers comparison

sqlalchemy-gevent-mysql-drivers-comparison

Compare different mysql drivers working with SQLAlchemy and gevent, see if they support cooperative multitasking using coroutines.

The main purpose of this test is to find which mysql driver has best concurrency performance when we use SQLAlchemy and gevent together.
So we won’t test umysql, which isn’t compatible with DBAPI 2.0.

Code is here. Thank you CMGS.

Result example

100 sessions, 20 concurrency, each session take 0.5 seconds (by select sleep)

mysql://root:@localhost:3306/mysql_drivers_test total 100 (20) 50.5239 seconds
mysql+pymysql://root:@localhost:3306/mysql_drivers_test total 100 (20) 2.6847 seconds
mysql+oursql://root:@localhost:3306/mysql_drivers_test total 100 (20) 50.4289 seconds
mysql+mysqlconnector://root:@localhost:3306/mysql_drivers_test total 100 (20) 2.6682 seconds

With greenify ptached MySQLdb (mysql-python).

mysql://root:@localhost:3306/mysql_drivers_test total 100 (20) 2.5790 seconds
mysql+pymysql://root:@localhost:3306/mysql_drivers_test total 100 (20) 2.6618 seconds
mysql+oursql://root:@localhost:3306/mysql_drivers_test total 100 (20) 50.4437 seconds
mysql+mysqlconnector://root:@localhost:3306/mysql_drivers_test total 100 (20) 2.6340 seconds

Pure python driver support gevent’s monkey patch, so they support cooperative multitasking using coroutines.
That means the main thread won’t be block by MySQL calls when you use PyMySQL or mysql-connector-python.

1000 sessions, 100 concurrency, each session only have 1 insert and 1 select

mysql://root:@localhost:3306/mysql_drivers_test total 1000 (100) 10.1098 seconds
mysql+pymysql://root:@localhost:3306/mysql_drivers_test total 1000 (100) 26.8285 seconds
mysql+oursql://root:@localhost:3306/mysql_drivers_test total 1000 (100) 6.4626 seconds
mysql+mysqlconnector://root:@localhost:3306/mysql_drivers_test total 1000 (100) 22.4569 seconds

Oursql is faster than MySQLdb in this case.
In pure python driver, mysql-connector-python is a bit faster than PyMySQL.
use greenify or not won’t affect the testing result in this user scenario.

Setup

mkvirtualenv mysql_drivers_test # workon mysql_drivers_test
pip install --upgrade setuptools pip cython
pip install -r requirements.txt
python mysql_drivers_comparison.py

Test with greenify MySQL-python (on OSX with homebrew).

mkvirtualenv mysql_drivers_test # workon mysql_drivers_test
pip install --upgrade setuptools pip cython
pip install -r requirements.txt
git clone https://github.com/CMGS/greenify
cd greenify
cmake -G 'Unix Makefiles' -D CMAKE_INSTALL_PREFIX=$VIRTUAL_ENV CMakeLists.txt
make & make install
cd ..
export LIBGREENIFY_PREFIX=$VIRTUAL_ENV
pip install git+git://github.com/CMGS/greenify.git#egg=greenify # if you are using zsh, use \#egg=greenify
git clone https://github.com/CMGS/mysql-connector-c
cd mysql-connector-c
export DYLD_LIBRARY_PATH=$VIRTUAL_ENV/lib
cmake -G 'Unix Makefiles' -D GREENIFY_INCLUDE_DIR=$VIRTUAL_ENV/include -D GREENIFY_LIB_DIR=$VIRTUAL_ENV/lib -D WITH_GREENIFY=1 -D CMAKE_INSTALL_PREFIX=$VIRTUAL_ENV CMakeLists.txt
make & make install
cd ..
git clone https://github.com/CMGS/MySQL-python.git
cd MySQL-python
export DYLD_LIBRARY_PATH=$VIRTUAL_ENV/lib
export LIBRARY_DIRS=$VIRTUAL_ENV/lib
export INCLUDE_DIRS=$VIRTUAL_ENV/include
unlink /usr/local/lib/libmysqlclient.18.dylib
ln -s $VIRTUAL_ENV/lib/libmysql.16.dylib /usr/local/lib/libmysqlclient.18.dylib
python setup.py install
brew switch mysql [version] # brew switch mysql 5.6.15 on my env, brew info mysql to check which version is available on your env
cd ..
python mysql_drivers_comparison.py

If the greenify doesn’t work for you, you can use otool -L _mysql.so in your $VIRTUAL_ENV/lib/python2.7/site-packages MySQL-python folder. Can’t find otool even after you installed XCode’s command line tools? Follow this link.

I need say thank you to CMGS. He guided me how to install greenify and how it works, he also help me triage the issues I met (include the otool part). Make greenify and mysql work on OSX make no sense, you shold do it on your application server which probably will be a linux, hope you will figure out how to.

ThoughtWorks 将于 2014 年 3 月 8 日举行的第十一届 B’QConf (北京软件质量大会)

本 Post 是代朋友转发。

亲爱的软件业朋友们:

第十届B’QConf(北京软件质量大会)是否还在你的脑海回旋,爆棚的场面和激烈的讨论是否始终让你记忆犹新?

由ThoughtWorks中国和阿里巴巴集团共同主办,中国软件测试经理联盟协办的第十一届 B’QConf(北京软件质量大会)将于2014年03月08日如期举行。B’QConf 邀请IT测试及质量相关人员共聚一堂,一起分享经验,探讨话题,结识业界朋友。无论你是B’QConf的老伙伴,新伙伴,还是大伙伴,小伙伴,只要你热爱测试,关注质量,期待学习,喜欢分享,都欢迎加入我们的活动。

本期精彩话题:

  • 话题一:中小型项目的web性能测试模型

中小型项目的性能测试和大型项目不同。在时间和资源都有限的情况下,我们来聊聊如何选定性能测试指标,怎样设计性能测试用例,最终出一个什么样的结果,才对项目最有价值。

演讲嘉宾:孙弘 ThoughtWorks QA

  • 话题二:Android自动化测试实践

本分享将介绍针对Android手机app测试方面的一套自动化测试框架,该框架能解决一般Android系统上的自动化问题,原理以及使用方法。同时会根据结合示例讲解。

演讲嘉宾:赵婉萍(菁菁)阿里妈妈-无线

  • 话题三:虚拟机初探

日常工作中,对于多个不同的环境有需求,拥有多个物理机也不太现实。我们尝试利用虚拟化技术来解决这一问题。在一些调研之后,决定使用KVM来定义、管理我们的虚拟机,本话题会分享一些KVM上对于虚拟机的定义、配置和性能方面的一些经验。

演讲嘉宾:杨锐 ThoughtWorks QA

  • 话题四:大数据的质量监控和保障

分享将介绍大数据时代下分布式系统的监控和数据仓库,数据挖掘的测试方法和原理。

演讲嘉宾:李春元(春元)阿里妈妈-CNZZ

  • 报名:我要报名第十一届B’QConf
  • 时间:2014年 3 月 8 日 13:30 – 17:30 (现场提供茶点,水果,软饮。)
  • 地点:北京市东城区东直门南大街3号国华投资大厦11层1105室 (ThoughtWorks 北京办公室)

地铁路线: 2号线东直门站D口出,来福士商场旁边国华投资大厦

公交路线: 106,107,117,24,612,635等20多条公交车到东直门内/东直门外/东直门枢纽总站下车都可
自驾路线: 东直门桥西南角来福士商场旁边国华投资大厦

更多消息请大家关注 B’QConf的微博

Places to find essential new things

Had a discussion with friends this morning, about how to explore new things, learn new staff. Then I got a list for myself:

  • new.me digest email (daily)
  • zhihu.com and quora.com weekly digest email
  • HN (Hacker news), best realtime tech world aggregation
  • Github trends, place to find intresting ideas, or high quality projects
  • twitter geek list (not twitter timeline, you need pick your own geek list)
  • Techmeme, technology industry headlines
  • Theverge, for high quality reviews and news post with good quality of pictures
  • Reader (digg reader, feedly, or what-ever), baseline is you need to follow intresting people here

Some of my best coworkers and friends contribute those:

  • Lifehacker, for interesting howtos
  • ScienceDaily, Latest Science News
  • reddit, everything fun
  • google plus, depends on your circle quality and time you spend
  • Engadget

Recommendation for my coworker Stan

I wrote a recommendation for my previous coworker Stan on Linkedin, but another coworker said it’s not proper and too funny for this serious site. So I guess I should rewrite it, but I will leave it here 😀

To be honest, Stan makes me cry, I can’t catch up with him. But it’s a honor to work with his brilliant mind. He is kind, he didn’t call us dummy. He has great abstraction ability, which IMHO is the most precious skill a software craftsman should have. He has good sense on new technology, and he build his own sharp toolsets and leverage other folks in his team. One thing I think he may improve is, he doesn’t drink beer with us, so he will always be sober when we are drunk.

Compae redis and memcache as cache in python

I hate posting micro benchmark of library, it’s kinda stupid when they are not the bottle-neck. Now, I will hate my self 😀

The verdit is: using memcache and redis has similar performance. In our system, we use redis a fast data-structure store, so use it as cache save a moving parts.

  • cPickle 50000 rounds, used 1.61123394966
  • msgpack 50000 rounds, used 0.296160936356
  • memcached with pylibmc save/load 15000 rounds, used 1.68719887733
  • memcached python-memcached save/load 15000 rounds, used 3.92766404152
  • redis with hiredis save/load 15000 rounds, used 2.76974511147

Acctually we are using redis + msgpack now.

By the way, our new leaderboard service is based on gevent patched flask + redis + sqlalchemy tech stack. A aws 8 core instance give us 900rps, 1.5x throughput of a ruby version (with fiber and em) write by my ex-coworker @flyerhzm.

Draft letter to Caltrain office

I break the law by accident. This is a big deal for me. I definitely wrong in this case, but I also found there is some issue on Caltrain system. So instead of being down forever, I will write a letter to Caltrain office. My friend tell me this is like burn a letter to immortal, no one will read or reply. But I trust in goodwill, so i will open this letter on my blog. If you read this by accident, please learn that break the law is painful, you should train yourself not making stupid mistake, this will make everyone happy, and make a better world. This letter is just a first cut draft, I will correct grammer mistake and improve weired sentence accordinly in those days.

Hello caltrain officers,

I’m a caltrain commuter live in Hillsdale, I start using caltrain from last December, and I like it. Caltrain has cleaner cabin than BART and muni, and the passenger on it is more polite. My company sponsor us commute by Caltrain through clipper direct, and most of my coworkers has Caltrian monthly pass auto loaded.

But I got a citation recently, which make me . The story happends that on Jun 3rd, it’s a Monday, the first working day of Jun. I and my coworker was discussing a working topic of our current project, and we take on Caltrain together from Hillsdale. Then 2 conductors came to us, and check our ticket. We show our clipper card as usual, but it says no. And we notice that’s first working day of a Month, so we should tag on the clipper card. And explained this to the conductors individually, they just gave both of us a citation of fare envation. I haven’t argue, because I’m a adult. The conductor refuse to read our monthy pass online order (which we showed), that means we don’t have nothing to show as prove of payment.

I said this almost ruin my life, it’s because I break the law, which make me desperate. I’m a alien worker, so I’m new here, I tried to behave as a moral citizen, be nice to everyone, and be well discipline. But when I got this citation, this dream collapse.

But I don’t want to be depressed forever, I should tell you my feeling, and let’s figure out what’s the best way to prevent this kind of situation happend in the future. Which will make the world a better place.

The monthly pass is not a way to save money, I think the ultimate goal is to make our daily commute smoothier. Because buying ticket is a pain at rushing hour, so clipper is a express way. Paying monthly save most of us tons of time. I cherrish the intention. But comes to reality, there is a gap. Other commute system like BART and MUNI has clipper gate, so when you come across it you need to swipe the card. But caltrain only has that at SF’s 4th and king station, there’s conductor standing there to check our ticket. But other stations is open, and some station only has few clipper tagging machine. So most monthly pass commuter doesn’t tag at most of days in a month, only tag it at very first day. So the gap is the habit we use the public transportation, for Caltrain we didn’t tag daily, but for MUNI and BART we tag daily even we have montly pass. That means we need to remind ourself to tag on first day, which is not work precisly as a clock.

So I hope caltrain can give us a notification on those critical day (very first working day each month). The clipper has our email, most caltrain station has message board (LED screen), both of them can work as a reminder for us. Which will help us not making this mistake. For a system need good self-discipline a reminder will reduce stupidity very much.

The clipper direct allow us setting up a recurring order of monthly pass, I don’t know why it can’t pre-load them as those montly pass we bought from walgreen? If it load automatically, we won’t make a mistake like that, this will free us from having a pressure on loading monthly pass. In our case, we acctually paid for monthly pass. But the ticket checking machine failed to load the monthly pass information, which make us be in a situation of no valid ticket, and that’s breaking the law. Break law in this way is miserable, we don’t want to be a fare invader, that’s immoral, that’s make us hateful, and it’s shame to tell our children that we are fare invader. I mean when we paid, but failed to show that we paid, because we need a process to activate this monthly pass, this is inconvinient and seems trapping us to be a outlaw. Probably there are technical reasons or other reasons to make clipper work in this way, but please make it more friendly and make good people keep being good.

Because we commute on caltrain for months, we saw lots of scenario between passenger and conductors. And almost all conductors is friendly and helpful. I saw conductor explains to those foreigners who bought wrong tickets, and ask the buy it correctly next time. I saw conductor asking people using clipper card who forgot tag on to tag it at next stop. I heared from my coworker who got educated when they forgot to tag it on first day of month. All those stories told me that conductors are helping us from making mistake, and educate us when we are wrong. But when this happend on me for the first time, I got a citation directly. I’m not asking mercy on me, everyone is equal by law, I just diserve this punishment. But consider a goodwill, we should using the punishment to prevent intentional fare envision, but educate the others. This will keep us in faith of justice. I still trust all conductors on the train is helping us from making this mistake. But on the day I got this citation, I saw 2 conductors chekcing ticket, and that’s the first working day of that month, there are 3 guys got citation in same cabin, this seems more to a punishment than education.

I won’t argu if I’m guilty here, I suppose to talk about it with judge in court. I only want to ask for some change on this system to ease our commute, make our life easier, hope no one will break the law in this way. Would like to consider those actions?

  • Give us the choice to pre-load the monthly pass
  • Give us a notifications by email (for clipper direct user), or show us notifications on station’s display on first (work) day of month
  • Give us a covinient way to tag the card, especially a way to tag on the train, alternatively should give us tagging machine on each entrance of station. So tagging daily is smooth for monthly pass commuter.
  • Give us a channel to show monthly pass proof if we got cited in this situation, when we paid for the pass.

Thank you for read, I won’t be despress forever, I trust that caltrain will be better, and we will be more happy to commute on this transportation system.

Me,
Jun 9, 2013.

Output ISO 8601 format datetime string in UTC timezone

I hate timezone. Especially for python, since python’s timezone is not in standard library, I always need to install pytz. I know it make sense, since timezone db changes sometimes, and it make no sense to put it in python standard library. But this make life harder. Every time when I work on timezone aware datetime, it will #FML.

The formal way is name your timezone in config file, but for the case of output iso 8601 format datetime in UTC timezone, it’s not elegant to ask for this information. The tricky part is get your local timezone. Since timezone is not in system environment variable, so it’s kind of hack to let it work with small code footprint. I know I’m stupid, but let me post my solution here. If you know better solution, please let me know.

try:
    local_tz = pytz.build_tzinfo('localtime', open('/etc/localtime', 'rb'))
except:
    from .poorman_tz import LocalTimezone
    local_tz = LocalTimezone()


def isoformat(dt):
    if dt:
        return local_tz.localize(dt).replace(microsecond=0).astimezone(pytz.utc).isoformat()
    return None

The LocalTimezone is from python’s datetime document, just copy that section of code and you get this poorman’s implementation of LocalTimezone. Thanks my coworker Stan pointing me out the solution to build a timezone from /etc/localtime, also for this smart guy answer that stackoverflow thread. BTW: python’s datetime has microsecond information, I don’t need them so I replace them with 0, it makes the datetime string shorter.

品酒的好处

大约有2年品啤酒的经历,很喜欢品啤酒的过程。那么为什么要品啤酒呢?

首先要明确 饮酒品酒 的区别。品酒应该是饮酒的一个子集,因为不喝下去完全闻味不是靠谱的品酒方式。品酒最大的区别在于为什么 (Why) 和如何做 (How) 不同,品酒的目的是分辨、欣赏、交流。

品酒是对模式识别的锻炼。模式识别是心智模式中的R模式,也就是一种基于图像的思维,模式识别是一种直觉。品酒必须要用语言记录,可以是和朋友在一起交流感受,或者写下来分享给更多的人。为什么品酒必须要交流呢?因为模式识别是R模式,但是人和人交流要使用语言逻辑,但是这种逻辑使用的是心智模式中的L模式。但是根据心智模型的理论L模式和R模式往往不能同步工作,如果一个人两个模式之间交流迅速和频繁,那么这个人表现出来就是 聪明。R模式和L模式交互有一种形象的表示法,我们管它叫做“比喻”。使用比喻能力强的人,我们一般说是文学能力强,但是其实更广义的说使用比喻能力强的人 聪明 。我们有一个更时髦的词叫“隐喻”,这是在各种专业能力中非常重要的一种基于直觉的技能,我们常说一个领域的大师具有的就是过人的直觉。或者说过人的 模式识别 和使用 隐喻 的能力。所以那本书说,幽默其实是一种使用 隐喻 的特殊能力,不管是使用幽默还是感受幽默都需要这种 隐喻 的能力。

如果大家读过酒评就知道为什么我说 模式识别隐喻 了。这里举个例子:

花香显著,并带有新鲜果香:柑橘、杏桃、荔枝、麦香,些许糖香。深琥珀色片红,入口时的发泡感强烈,使酒体感觉轻盈。酒液有些浑浊,泡沫丰富有粘性。酒体的厚度中等,酒感稍若,具有相当的苦涩感。苦味强劲,有酸味的错觉。烧烤饼干,红色果香,榛果杏仁。回香中有花香,苦味显著。余韵甘苦带涩,有果香,麦香与啤酒花香。又有非常好的复杂度。

摘自《比利时啤酒——品饮与风味指南》

这里用到了很多的隐喻。经常交流的啤酒爱好者可以看到大家使用的不同隐喻,并把他们使用到自己的语言中。这个过程还培养了对味道细节发现的能力,因为经常品酒后对很多味道的轻微痕迹描述的能力会增强,所以就可以更好的发现原来不能描述的味道。这其实就印证了很多朋友学习中的一个感受,当你能够定义和命名一种现象以后,你在遇到问题的时候识别的能力增强了,最后解决问题的能力也增强了。在搞编程的朋友中,都知道 Martin Fowler,他就是以这种能力著称。

最后,我想强调我说的是 品酒 ,很多极端的品酒者品尝酒品后会吐掉,以防止对辨别能力的影响。我们需要强调这样的精神,不要陷入把饮酒作为简单的习惯的地步,那样很容易酗酒,结果是很严重的。要保持 Know How 的同时 Know Why,这样可以长期健康的品酒。

我目前品酒主要使用 Instagram,照下酒评和酒杯,可以观察酒体和泡沫。然后在评论里面写下对啤酒味道的品评笔记。我主要使用 #mahbeer 这个标签分享,这个标签下有很多非常棒的啤酒的分享。而查啤酒比较重要的工具是 Ratebeer ,上面有大部分你能喝到的啤酒的信息和众多酒友的评价。

附上我非常早期的一个酒评:

  • 使用的是和Duvel类似的高级蛇麻草,非陈年款,味道比较鲜明,微苦
  • 有麦香味,水果味收敛,味甘甜,回味发甜
  • 香味不太明显,不酸,也不太苦,没有特别突出的单种口感
  • 泡沫丰富,细密,泡沫很白,大约有酒体的一半高
  • 酒色金黄,略微浑浊(雾)
  • 有肉蔻的香味留下,酒通过喉咙感觉酒精刺激明显,留下微微的薄荷的味道
  • 酒后口内没有酸和苦的回味,说明酒不是很甜酒花量也不是很大

能猜到这是啥啤酒不?(是Westmalle Blone,一种修道院啤酒 Trappist beer,不过属于没啥特点的一种)

Web语义化真的是Web开发人员表现出来的可贵的人性!

知乎上面扯得比较长 ,所以也发到自己博客一份。

语义网是让机器可以理解数据。语义网技术,它包括一套描述语言和推理逻辑。它包通过一些格式对本体(Ontology)进行描述。如W3C的 RDF就是这样一种描述规范,它描述这些数据所表达的含义还有这些词之间可能产生的关系(动词?),那么计算机就可以通过查询(推理规则)来产生我们需要的数据视图了。也就是说如果你对计算机提问,因为计算机理解数据,所以可以推理出你所想要的答案,即使这个答案不是预先准备好的。大部分的语义网的表示规范都基于XML,因为它是一种完备的通用描述语言。

HTML选择文本协议是因为文本协议便于人类与计算机阅读。其实要注意一个重要的历史细节,是email激发并帮助产生了互联网技术。email相关的第一个RFC(RFC 561)在1973年就有了,而TBL大神在1989年提出超链接技术才标志着WWW的产生。

使用文本协议是因为原先传递消息不是为了让计算机存储和理解的。最早的email就和我们现在的短信的想法差不多,两个计算机同时在线用modem传递一些文本信息给使用计算机的人去读,这个时候的文本信息没有链接(没有链接就不是网),纯粹就是文本块。

而后来为了让email里面能保存非文本的数据,并且不破坏原来的协议兼容性,所以催生了非常非常重要的MIME协议(Multipurpose Internet Mail Extensions)。

因为有了email的协议族(包括传输协议和MIME)以后,在计算机之间通过纯文本消息体已经可以交换各种数据了。但是这个时候传递的只是数据,数据之间是没有关系的。

Hypertext是在文本协议上面扩展了表示文档关系(超链接)的能力,它就让原先的文本变成了网络(关系)。这种文本表述协议HTML的第一个RFC是1995年的RFC 1866。而我们可爱的HTTP在1996年才有了HTTP 1.0(RFC 1945)。

你基本上可以按照RFC来排这些技术的辈份……

强调这个历史是想说明计算机可理解不是这组协议的目的,因为计算机可以解析二进制,用二进制更高效(传输和解析)。这些协议最早是为了人类可读而设计的,所以都基于对计算机不那么有好的文本协议。文本协议人类调试起来会高效很多。

那么这和语义网有什么关系呢?

因为HTML不是为了计算机可读而优化,所以HTML的解析实际上是一个比较头疼的地方,这个是所有写过HTML解析器的朋友都知道的。HTML很多时候真的是一锅粥,模糊的语义很多时候靠猜。

所以,有了著名XHTML,它的目的是让HTML套上XML的外衣。XML是啥呢?XML的最初目的就是设计一种计算机和人类都可读的协议,由于人也就能读文本,所以它是一种文本描述语言。让HTML符合XML的规范就计算机(机器人)和人类都皆大欢喜了。而后有了XHTML 1.0,当时的“网站重构”活动所有经历过那个时代的朋友都亲身体会到了。多了一些强制的写法,写XHMTL解析器的朋友们就不用哭泣了。不过后来大家发现写XHTML的朋友经常会有语法错误,另外一些朋友则对HTML 4.x灵活(模糊)的语法恋恋不忘。

由于还是有很多朋友发现XHTML让计算机真的跟容易理解文本的结构,所以那些人继续狂热的做XHMTL 2.0,但是这是个不归路。而且对于消费HTML的大部分人类来说这都没有爱,所以最后这个标准被抛弃了。

人类理解文本字面以外更重要的是把这些概念抽出来理解,人类需要知道文档的结构是什么。研究协议的人们都是高端人才,天天写paper……(我这个是纯演绎)所以他们觉得应该让HTML能够很好的展现他们的写的字(文本)的这种章回的结构,所以就把文档的结构的隐喻放在了HTML的文档模型(BOM和DOM)上面,所以HTML协议里面是包括tag和tag所表述的文档中的隐喻的定义的,这样人类阅读这些文本的时候就把文本和一个文档(某个paper)的结构映射出来了,那么浏览器就可以把他渲染的让这些标准制定者高兴了。

但是后来大家发现光有文档没有索引不行,所以搜索引擎越来越重要。可是搜索引擎不是都像Yahoo那样是人肉编辑的,后来的主流搜索引擎都是基于对查询文本和网络上的文本的相关度进行搜索的。但是网络上的文本要取下来并按照文档结构解析是需要机器人(爬虫)的。所以机器人读网页的权利被越来越多的重视起来,那些搜索引擎优化不就是想骗过这些机器人的算法么。

当然这里又回到一开始的语义网了,因为人们的查询不光是字面匹配,人们希望使用更聪明的搜索引擎。那么搜索引擎应该知道用户的意图,这不是什么人工智能,而是一些基于统计的算法。但是这些算法都和语义网中的一些东西有相关性,因为人们需要得到数据,并且找到这些数据的本体,通过一些预先定义好的本体之间的关系进行逻辑推演(目前都是写死的算法,而没有使用语义网里面的推演系统)。也就是说这个模型从概念上和语义网相似,但是由于技术上还不太可行,所以走了其它的路。不过从理解文本的这个地方来说,所有的现代搜索引擎都有这方面的逻辑。它希望把搜集到的文本描述成一种可以推演的数据,在语义网里面描述这些数据的方式之一是RDF。RDF基于XML,而HTML中的XHTML是一种XML。通过HTML的attribute储存语义网数据叫RDFa(Resource Description Framework – in – attributes),这就把HTML/XHTML和语义网技术拉到了一起,当然光表示数据只是语义网的一部分。

和RDFa相似的东西还有microformats(老早我就力挺microformats,不过后来这东西被好多人断言说已经死了,还好后来micro-data火起来了),它把语义数据放在node text或者属性里面,并且通过css class来表达数据结构。但是他们只表示了结构,我们还没法映射到本体。那么Microdata就是这样的尝试,它定义了一些词汇表,表达某一些常见的格式,通过这些词汇表就对应到了数据的本体。

到这里这个圈子就转好了。语义化需要让数据和表述的本体的映射成为可能,那么结构首先要可以表达出来,并且通过一些结构的约定俗成(或者直接声明)让计算机可以找到这些结构的本体,然后计算机就可以通过本体的关系来进行逻辑演绎。目前我们能真的达到约定俗称的东西还很少,大家看看micro data。但是先不要说终极理想,也就是让我们的大网成为语义网。我们目前可用的技术里面应该充分的考虑到可怜的机器人非常弱的理解能力,尽量说一些约定俗称的东西,这样机器人就可以帮我们进行一些我们人类不太擅长但是它很擅长的推理计算了。那么最好我们能够让我们的文档在描述相关的本体的时候使用计算机更容易理解的结构,这就是语义化。也就是说用某一种模式来表达计算机可以理解的词汇,这就是HTML的语义化。

当然现在又一个巨大的问题,那就是HTML的文档模型和我们平常要表达的映射直接没有隐喻的关系,而且这个差距是巨大的。我们开发Web Application无处不受这种限制的影响,所以性急的人们才把HTML5变成一个永远演化的协议,来保证我们更及时的把我们想要的一些新的结构、语义加入到HTML这个文本描述协议里面。

那么最后强调一下,语义化真的不是为了我们人类。语义化是我们人类博爱的体现,我们也要照顾可怜的机器人,让他们能够很好的通过自描述的结构逐渐掌握我们人类的词汇,理解我们人来在说什么,这样它就可以更好的为我们服务。语义化真的是Web开发人员表现出来的可贵的人性!

相关词条链接:

Optimize Sparrow.app’s sparrowdb data file size

Intrepid Blog has a blog post about how to optimize Mac’s popular mail app sparrow.app’s sparrowdb data file size.

Good to know that sparrow is using tokyo-cabinet. And it sounds like a great way to optimize tha data file safely.

Please quite sparrow.app before you run this.

But when I run: tchmgr optimize ~/Library/Application\ Support/Sparrow/my.email.account.sparrowdb/data.db/data.tch

I got this error:

tchmgr: ~/Library/Application\ Support/Sparrow/my.email.account.sparrowdb/data.db/data.tch: 6: invalid record header

After a googling, I found the parameter -nl should fix it.

So I tried run this command again:

tchmgr optimize -nl ~/Library/Application\ Support/Sparrow/my.email.account.sparrowdb/data.db/data.tch

now it fix the db and renamed the original file with a temporary name:

data.tch                    data.tch.tmp.1209295.broken

Please open Sparrow to check if the db is OK. If you can see your message then you can delete the .broken file. Otherwise, please copy the borken file back.

In my case it do reduced the db size:

-rw-r--r--  1 tin  staff   1.7G Aug 21 13:12 data.tch
-rw-r--r--  1 tin  staff   2.6G Aug 21 13:14 data.tch.tmp.1209295.broken

Unfortunetely I saw some message can’t be rendered anymore, so I copy that db back again.

If the db is borken and can’t be recovered, Sparrow has a official way to reset a local cache and synching that again. Just delete the Info.plist file in your sparrwodb folder.