How can you update a big table efficiently?
Facts:
I have a big table in MySQL with roughly 5 million rows. The schema of this table is simple and I hope to update the “entities” column entirely.
table big_tech:
Solution 1: update each row directly
update big_tech set entities='…' where postid='…'
MySQL does not provide a batch update operation. You have to update each row one by one. Each if you update multiple rows in a transaction, this will not make too much difference.
The problem: it is fast in the beginning of updating, but the speed will slow down as the time rolls on.
Solution 2: use batch insert
- Create a new table with simple columns, and insert into this table using batch inserts supported by MySQL natively.
- Then make a join update with table big_tech. The problem: Overall, this solution will spend less time. However, the batch insertion is not efficient enough after the temp table grows big.
Solution 3: output to a CSV file and then load into database
- Output “postid”, ”entities” into a CSV file, which costs nearly no time;
- Load this CSV file into a temp table in database;
- Make a join update between temp and big_tech.
It is worth noting that the encoding of the CSV file should be the same as the encoding of database in international language environment. Or else the loading will be problematic when special chars are met.
- Conclusion:
- Sometimes, the most efficient way to do things is very simple. No matter what kind of approaches, they are just tools with their specific pros and cons; it is the job of designers to utilize these tools appropriately to make the things done efficiently.
Sketch-based Approach for Graphs
We discussed these two papers about sketch-based approach today:
- Atish Das Sarma, Sreenivas Gollapudi, Marc Najork, Rina Panigrahy: A sketch-based distance oracle for web-scale graphs. WSDM 2010:401-410
- Andrey Gubichev, Srikanta J. Bedathur, Stephan Seufert, Gerhard Weikum: Fast and accurate estimation of shortest paths in large graphs. CIKM 2010:499-508
This key of sketch-based approach is finding the landmarks by random sampling for every node in a graph. Some comments:
- The assumption for sketch-based approach is the nodes are uniformly selected. This assumption is only suitable for random graphs. For scale-free graphs, nodes have distinct priority, so the uniform sampling technique may be not the case;
- Making whatever incremental computation based on sketch-based approach seems an explored research topic;
- The combination of sketch-based approach and random walk may generate some very fast algorithms for PageRank/SimRank on large graphs.
Mass Data Storage
Some Chinese articles:
Smart Java Code Blocks
Read entries in a Map<Key, Value>:
for (Map.Entry entry : entityWeight.entrySet()) {
System.out.println(entry.getKey() + ":" + entry.getValue());
}
Open a file and append text to the end of the file:
File file = new File("X:/Spritzer/Daily/" + stringDate + ".txt");
if (!file.exists())
file.createNewFile();
FileWriter fileWritter = new FileWriter(file, true); // append
BufferedWriter bufferWritter = new BufferedWriter(fileWritter);
bufferWritter.write("\r\n");
Obtain designated date format:
Date date = new Date();
SimpleDateFormat dateformat = new SimpleDateFormat("yyyyMMddhhmm");
String stringDate = dateformat.format(date);
Linux Commands for Installing Software
Wireshark on Fedora:
su —
yum install wireshark
yum install wireshark-gnome