Monday, August 10, 2009

japanese in mysql databases

grr.. what is it today... to get nihongo working in mysql

1) verify that the issue is the db encoding, it needs to read line this example

mysql -u[usename] -p[password] [database]
show variables like "%character%";show variables like "%collation%";
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database   | utf8_general_ci |
| collation_server     | utf8_general_ci |
+----------------------+-----------------+
3 rows in set (0.00 sec)

2) the fix;
sudo emacs /etc/mysql/my.cnf

then add/edit the lines (in the correct sections)
[mysqld]
default-character-set=utf8

[mysql]
default-character-set=utf8

then reboot the server
/etc/init.d/mysql restart

then recreate the db from scratch or export all the data/recreate and import it..

parsing a japanese file to create a word list

mecab japaneseIn.txt | egrep -v "^EOS" | egrep -v "記号|\w助詞" | sed "s/\t.*//" | sort | uniq -c | sort -n -r | grep -v "^\s*[0-9]*\s*[0-9a-zA-Z,./\\<>?_;:@{}^~。*%()\-]*$" > wordListOut.txt

mecab Asic-wiki.txt | grep -v "記号,.*,\*,\*" | grep -v "名詞,数,\*,\*,\*,\*,\*" | grep -v "名詞,サ変接続,\*,\*,\*,\*,\*" | grep -v "^EOS"

verbs;
mecab Asic-wiki.txt | grep "助詞"

verb(dictionary form) occurance count;
mecab Asic-wiki.txt | grep "動詞" | egrep "一段|五段" | sed "s/.*[一五]段[^,]*,[^,]*,\([^,]*\),.*/\1/" | sort | uniq -c | sort -n -r

verbs transform occurance count;
mecab Asic-wiki.txt | grep "動詞" | egrep "一段|五段" | sed "s/.*[一五]段[^,]*,\([^,]*\),.*/\1/" | sort | uniq -c | sort -n -r

of course these are hacky 1 liners i have a ruby fuction that parsers the file and gives me the various results

the critical column is given by this
mecab Asic-wiki.txt | sed "s/[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,\([^,]*\).*/\1/"

rails "bugs" 2 belongs to pointing to the same table

2 belongs_to pointing to the same table and searching fails...

(rdb:1)  Customer.find(:all, :conditions => ["#{Address.table_name}.name_given = 'kdFJf6J6Mdx'"], :include => [:address, :shipping_address] )
[]
(rdb:1)  Customer.find(:all, :conditions => ["#{Address.table_name}.name_given = 'kdFJf6J6Mdx'"], :include => [:shipping_address, :address] )
[#]

decide for your self...
1) The problem can be fixed by using the :joins option. to search both fields
2) The problem can be exploted to expand the full set of child records linked via has_many with eager loading while searching for a field contained in the children to locate the the parents.. Generally searches from the parent into child will limit the children to only the matching ones...(which is a problem 90% of the time)

mecab - japanese parser

Mecab is an excellent parser for Japanese text im using it to auto generate a weighted flash card list from the relevant sections of Japanese news papers... the hardest part of studying Japanese is figuring out which words and phrases are applicable for your field of expertise..

but mecab is a pain in the butt to install in ubuntu the default output isnt utf-8 which is a total pain... heres the fixed install install instructions

Download the main mecab and the ipa dict from here;
http://sourceforge.net/projects/mecab/files/

Then run these commands... note that i had to monkey fix the config LDFLAGS for ubuntu...

tar xvzf mecab-0.97.tar.gz mecab-0.97/
tar xzvf mecab-ipadic-2.7.0-20070801.tar.gz mecab-ipadic-2.7.0-20070801/
cd mecab-0.97/
LDFLAGS=-R/usr/local/lib ./configure  --with-charset=utf8
make
make check
sudo make install
cd ..
cd mecab-ipadic-2.7.0-20070801/
make
LDFLAGS=-R/usr/local/lib ./configure  --with-charset=utf8
make
sudo make install

Check here for Japanese version of the instructions and download links:
http://mecab.sourceforge.net/#install-unix

http://sourceforge.jp/projects/mecab/lists/archive/users/2007-January/000204.html

Thursday, August 6, 2009

ruby on rails - dumping the stack -hack..

maybe not the best.. but im a semicon guy and my rails stuff is for play so sue me

begin
  asd
rescue => e
  logger.info ""
  logger.info "skip_full_validation #{ret} -  #{self.shipping_flag}"
  bt = e.backtrace.delete_if{ |l| !(/meat-guy/ =~ l) }
  logger.info YAML::dump(bt)
end

passwordless shutdown of ubuntu

the correct way to edit the /etc/sudoers file is via "sudo visudo". using "chmod u+w /etc/sudoers" is a no-no

if you brick it then restart the computer with physical access and enter recovery mode and drop to the root prompt and fix the file.. reinstall is not required

to get a password less shutdown go;
sudo visudo 

add the line and save to its default location
%admin ALL=NOPASSWD: /sbin/shutdown

then
sudo chmod +s /sbin/reboot

disabling clear text password access to ssh

Disable Password Authentication

To disable password authentication,
vi /etc/ssh/sshd_config 

replace it with a line that looks like this:
#PasswordAuthentication yes
PasswordAuthentication no

/etc/init.d/ssh restart

make certain that u have at least one public ssh key in the relevant users .ssh/authorized_keys file