An Example
This example demonstrates all the current CHPOX features. To compile the program, you can use the provided Makefile in Listing Two. The program takes three arguments:
- The first parameter defines the sleep interval (in seconds) when reading the source file line by line.
- The second parameter defines the name of input file (plain text). The input file is nothing but numbers separated by newline. Use something like: $ seq 1 100 >input-file.txt to generate the sequence.
- The third parameter defines the name of output file. The test program prints the square of number that is read from input file.
Here are the steps to do checkpointing with CHPOX:
- Make sure you have loaded the CHPOX kernel module. If you haven't, type modprobe chpox_mod. Confirm it with lsmod.
- Create the appropriate special character file in the /dev directory. Skip this if you use devfs. To automate the work, the CHPOX package provides a script to do this. Simply execute the mkchpoxdev script from the tools directory. Remember to do step #1 first because the script will check the existence of /proc/chpox and will bail out if it found none.
- Run the target application. In this case, you can execute the above "chpoxtest" program for a quick start.
- Register the target program with chpoxctl. The syntax is:
#chpoxctl add <target-pid> <assigned signal number> <child flag> <checkpoint filename>
For example, if you know that the process pid is 4500, you want to use "31" as signal number assigned for CHPOX handler and checkpoint the process' children, you type:
#chpoxctl add 4500 31 9 /tmp/test.dmp
(See the chpoxctl(1) man page for complete explanation of each paramete.r)
If you just need to checkpoint a single process only, replace the fourth parameter with "1". For checkpoint filename, you can give any valid pathname.
For signal handler's number, it is preferred to pick an unused one. The safe choices are SIGUSR1, SIGUSR2, or SIGUNUSED, but if you know there is a definitive handler assigned to these signals from within the target process you must select another one. Check /usr/include/bits/signum.h or type kill -l to see the actual signal number for the related signal name.
Chpoxtest already implants a signal handler for SIGUSR1,thus you are forced to pick another one. Here you are assumed to pick SIGSYS (31). Please notice that you need to grab the PID of the parent process, not the child. You can use "ps" or "pgrep" looking for "chpoxtest" instance and pick the process with the lowest PID (the parent). Another way is by using "ps auf" or "ps auxf" to get the tree showing parent-child relationship.
- Now you are ready to checkpoint. During the execution of chpoxtest, send it a SIGSYS:
# kill 31 4500
If you succeed, a new file "test.dmp" in the /tmp directory is created. Repeat as needed if you want to checkpoint chpoxtest at a different time. Notice that latest checkpoint is saved on the specified dump file, so if you want to keep the previous checkpoint state, you need to copy/move it first before executing kill again. CHPOX can't do this automatically for you.
- Wait until chpoxtest finishes or just stop it (using Ctrl-C). Now you can restore the task using "ld-chpox". The syntax is:
# ld-chpox <path and filename of the dump file> </pre class="code"> <P> <P> In this case, to restore chpoxtest, you can type: <P> <pre class="brush: text; html: collapse;" style="font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; color: rgb(0, 0, 0); background-color: rgb(238, 238, 238); font-size: 12px; border: 1px dashed rgb(153, 153, 153); line-height: 14px; padding: 5px; overflow: auto; width: 100%;"> # ld-chpox /tmp/test.dmp
As you can see from the console output, it continues right from the same spot when you checkpointed it. Make sure the path and the name of input and output file are still same or you will get error that files can't be found.
At this stage, you won't need CHPOX to remember the PID (process id) you have registered before. Therefore, you can safely delete it from CHPOX list using chpoxctl. This time, the syntax is:
chpoxctl del <pid of process>
Because we registered using 4500 as PID (pay attention, this is just an assumption), you type:
# chpoxctl del 4500
You can also type:
#chpoxctl clear
To clean up all the PIDs in the CHPOX checkpoint list.
You can also try another possibility here. As you can see in the code, sleep() is used inside the SIGUSR1 handler for pending the parent process' execution within several seconds. So you have the possibility to checkpoint when the code hits the signal handler:
- Send the parent process the USR1 signal:
# kill -USR1 4500
- The application prints the status that it is now inside the signal handler. Quickly checkpoint it:
# kill -31 4500
Later when you restore it from the dump file, parent processes start from sleeping state, meanwhile both children are still running. Remember here that SIGUSR1 signal handler is assigned ONLY for parent process, thus the one who reacts to SIGUSR1 is just the parent process.
As another test case, you can pick grep. To make grep run long enough to checkpoint it, you can grep over something large; for example, the Linux kernel source. Here we assume you have extracted Linux kernel source in the /usr/src/linux directory. You can use any version of Linux kernel source. Try:
$ grep -I -r -i -n schedule /usr/src/linux/* > /tmp/grep-result.txt
The term "schedule" is selected because it returns bunch of search hit. You are free to pick any term you like. Try to pick the one which is likely found many many times so grep is kept busy and you have plenty of time to checkpoint.
As before, while grep is in action, register and send the checkpoint signal to the grep pid. Perhaps at this point, you might wonder how to prove that restoration is actually successful? Here is the trick:
- Do grep and let it finish. Here we assume that you save the output as grep-complete.txt.
- Do grep again. This time redirect the output as grep-partial.txt.
- Press Ctrl-Z to stop the grep--remember, stop it, not terminate it.
- While grep in stopped state, register and checkpoint it. Notice that the dump file is not created yet.
- Copy grep-partial.txt as grep-partial-a.txt.
- Resume grep work by typing:
$ fg %1
You might need to adjust the parameter by looking at "jobs" output.
- Terminate it by pressing Ctrl-C as soon as possible. Watch that the dump file now exists.
- Delete and recreate grep-partial.txt. Simply use touch:
# rm -f grep-partial.txt # touch grep-partial.txt
- Now restore the process by using the dump file. Let it finish. Now you have grep-partial.txt and grep-partial-a.txt. Concatenate them using cat:
$ cat grep-partial-a.txt grep-partial.txt > ./grep-cat.txt
- Compare the content of grep-cat.txt with grep-complete. By observing the result, you can draw your own conclusion. Hint: to get thorough view of file's content, use a file viewer that is able to dump the raw content of the file, for example: hexdump utility.
Conclusion
CHPOX still needs lots of improvement. As noted in its web site, support for Internet sockets, shared memory, System V IPC, processes with multiple threads are still in to-do list. So don't expect CHPOX to work flawlessly in every situation. Always do experimentation before using CHPOX in real production environment.
Reference
Understanding the Linux Kernel, Second Edition. Daniel P. Bovett and Marco Cesati, O'Reilly and Associates, ISBN 0-596-00213-0.
Linux Kernel Development. Robert Love, Sams Publishing, ISBN 0-672-32512-8.
Comparison of the Existing Checkpoint Systems", Byoung-Jip Kim, IBM Watson.
"Process Checkpointing and Restarting System for Linux," Sudakov O.O., Boyko Yu.V., Tretyak O.V., Korotkova T.P., Meshcheryakov E.S., Mathematical Machines and Systems, 2003. N.2, p.146-153.
Mulyadi Santosa is a software developer in Indonesia. He can be contacted at [email protected]. Eugeniy Meshcheryakov is a developer in the Ukraine. He can be contacted at [email protected].