How can you update a big table efficiently?

Facts:

I have a big table in MySQL with roughly 5 million rows. The schema of this table is simple and I hope to update the “entities” column entirely.

table big_tech:

image

Solution 1: update each row directly

update big_tech set entities='…' where postid='…'

MySQL does not provide a batch update operation. You have to update each row one by one. Each if you update multiple rows in a transaction, this will not make too much difference.

The problem: it is fast in the beginning of updating, but the speed will slow down as the time rolls on.

Solution 2: use batch insert

  • Create a new table with simple columns, and insert into this table using batch inserts supported by MySQL natively.

image

  • Then make a join update with table big_tech. The problem: Overall, this solution will spend less time. However, the batch insertion is not efficient enough after the temp table grows big.

Solution 3: output to a CSV file and then load into database

  • Output “postid”, ”entities” into a CSV file, which costs nearly no time;
  • Load this CSV file into a temp table in database;
  • Make a join update between temp and big_tech.
  • It is worth noting that the encoding of the CSV file should be the same as the encoding of database in international language environment. Or else the loading will be problematic when special chars are met.

    Conclusion:
    Sometimes, the most efficient way to do things is very simple. No matter what kind of approaches, they are just tools with their specific pros and cons; it is the job of designers to utilize these tools appropriately to make the things done efficiently.

Sketch-based Approach for Graphs

January 26, 2012 Leave a comment

We discussed these two papers about sketch-based approach today:

This key of sketch-based approach is finding the landmarks by random sampling for every node in a graph. Some comments:

  • The assumption for sketch-based approach is the nodes are uniformly selected. This assumption is only suitable for random graphs. For scale-free graphs, nodes have distinct priority, so the uniform sampling technique may be not the case;
  • Making whatever incremental computation based on sketch-based approach seems an explored research topic;
  • The combination of sketch-based approach and random walk may generate some very fast algorithms for PageRank/SimRank on large graphs.
Categories: Graph Data Management Tags:

Mass Data Storage

January 26, 2012 Leave a comment
Categories: Structured Storage

How to Set Up Twitter4J Quickly in Your Java Project

January 25, 2012 Leave a comment

1. Go to http://twitter4j.org/en/index.html and download the most recently stable version.

2. Add twitter4j-core-2.2.5.jar into project for general use. If you want to use streaming API, add twitter4j-stream-2.2.5.jar into your project.

3. Copy Twitter API property file into your project/bin directory.

4. Test your code example:

public static void main(String[] args) throws Exception {

        Twitter twitter = new TwitterFactory().getInstance();
        int hits = twitter.getRateLimitStatus().getRemainingHits();
        System.out.println(hits);
        Status status = twitter.updateStatus("Really weird!");
        System.out.println("Successfully updated the status to ["
                + status.getText() + "].");

}

Categories: Social Web Tags:

The Organization of Social Posts

December 18, 2011 Leave a comment

Observations:

1. Twitter

A typical tweet has the following attributes:

  • Author, associated with a  profile which includes: unique ID, screen name, location, portrait, following and followers, etc;
  • Tweet, including tweet content, time stamp, etc;
  • Attributes: favorite (like/dislike), retweet, reply.

image

2. Forum

Take http://stackoverflow.com as an example.

image

Attributes:

  • Agree (upper triangle) and disagree (lower triangle);
  • Favorite (star);
  • Share to Facebook and Twitter;
  • Tags such as “ruby”;
  • Author demographic data;
  • Replies.

3. News Article Review

Take Readwriteweb.com as an example.

An initial post is actually an article, so it contains title, (abstract), pictures/videos, and a long textual content. The following is what you can do for the initial post:

  • Share: Twitter, Facebook, Email, Digg and Print out.
  • Comments
  • Subscribe to the author or category.

image

If a post is a reply, it looks like this:

image

Attributes for such a reply post:

  • Author;
  • Content;
  • Time stamp;
  • Like/dislike;
  • Reply.

Modelling:

A general post model:

1. Each post has an unique URL, reallocated, to indicate the structures of the whole discussion thread. For example,

twitter/3456/23/1

means a post which is the first reply to post twitter/3456/23 and the latter is the 23rd post to the initial post twitter/3456 in Twitter.com.

2. Authors, with demographic data, are connected by social networks;

3. Posts can be organized by semantic ties, which indicates the cohesive of the topic (like hashtags in Twitter);

4. Time stamps are important in time series analysis.

Categories: Social Web

Smart Java Code Blocks

December 16, 2011 Leave a comment

Read entries in a Map<Key, Value>:

for (Map.Entry entry : entityWeight.entrySet()) {
	System.out.println(entry.getKey() + ":" + entry.getValue());
}

Open a file and append text to the end of the file:

File file = new File("X:/Spritzer/Daily/" + stringDate + ".txt");
if (!file.exists())
	file.createNewFile();
FileWriter fileWritter = new FileWriter(file, true); // append
BufferedWriter bufferWritter = new BufferedWriter(fileWritter);
bufferWritter.write("\r\n");

Obtain designated date format:

Date date = new Date();
SimpleDateFormat dateformat = new SimpleDateFormat("yyyyMMddhhmm");
String stringDate = dateformat.format(date);
Categories: Programming Languages

Linux Commands for Installing Software

December 9, 2011 2 comments

Wireshark on Fedora:

su —
yum install wireshark
yum install wireshark-gnome


Categories: Software Center Tags:
Follow

Get every new post delivered to your Inbox.