In this short series, we would like to introduce to you the basic concepts of Data Science and tools to get you started on your Data Science path.  
 

As  you make your transition into Data Science you will, sooner or later, feel the need to start using Unix based operating systems and one of the reasons is the availability and usefulness of the Unix shell. It gives you the ability to manipulate and explore files, set security permissions and thousands of other small or bigger tricks. 

What is shell?

Simply put, the shell is a program that takes commands from the keyboard and gives them to the operating system to perform. In the old days, it was the only user interface available on a Unix-like system such as Linux. On most Linux systems a program called bash acts as the shell program. Besides bash, there are other shell programs that can be installed in a Linux system.

 

5 commands for every Data Scientist

There are plenty of really cool things you can do with shell but as with everything, you have to start somewhere. We want to give you five commands you would use the most as a Data Scientist, based on our own experience. Ready?

1. cd

What better way to start than learning how to move around your files. Changing directories is easy as long as you know where you are (your current directory) and how that relates to where you want to go.

To change directories, use the cd command. Typing this command by itself returns you to your home directory; moving to any other directory requires a path name.

You can use absolute or relative path names. Absolute paths start at the top of the file system with / (referred to as root) and then look down for the requested directory; relative paths look down from your current directory, wherever that may be.

These are the things you can do with the cd command:

These to will return you to your login directory

cd ~
cd

Takes you to the entire system's root directory

cd /

Takes you to the home directory of the root, or superuser, account created at installation; you must be the root user to access this directory

cd /root

Moves you up one directory

cd ..

Regardless of which directory you are in, this absolute path takes you directly to subdirfoo, a subdirectory of dir1

cd /dir1/subdirfoo

This relative path takes you up two directories, then to dir3, then to the dir2 directory

cd ../../dir3/dir2

 Source: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/3/html/Step_by_Step_Guide/s1-navigating-cd.html


2. ls

Once you get where you want to be in your file system, you will need to find out what does that folder contain. For this you can use ls command. This command will simply list all the contents of your current folder and print out their names into the terminal. There are other fun things you can do here as well:

To list one file per line

ls -1

To list all properties of the files, including permissions, size and the last edit date (you can use ls -l ls -lh to display the file size in more friendly format).

ls -l

To show the hidden files as well

ls -A

3. chmod

More times than you would like, some files do not have the security permissions that you desire. This can either mean one of your programs cannot access a file that it needs and it will keep breaking on you or you are opening some files to the public and don't want them to be all that open (stay tuned for Part 3 of our tutorial on APIs).

If you need to change permissions of a file chmod command comes to the rescue. Let's imagine we have these files in our directory:

example% ls -lg
   total 28
   -rw-r--r--  1 user       group     273 Mar 24 11:28 file1
   -rwxrwxrwx  1 user       group    1449 Jan 29 14:01 file2
   -rwx------  1 user       group    4119 Jan 26 13:22 file3

We are interested in the first column in the output, which deals with permissions. These are the explanations of each of the 10 characters in that column.

Character    What it means
        1               "d" if a directory, "-" if a file
        2               "r" if file is readable to user, "-" if not
        3               "w" if file is writable to user, "-" if not
        4               "x" if file is executable to user, "-" if not
        5-7             same as 2-4, with reference to group
        8-10            same as 2-4, with reference to everyone else

We can change these at will (provided we have a power to do so) with the chmod command. The command has the following syntax: 

chmod options permissions file name

If no options are specified, chmod modifies the permissions of the file specified by file name to the permissions specified by permissions.
Permissions defines the permissions for the owner of the file (the "user"), members of the group who owns the file (the "group"), and anyone else ("others"). There are two ways to represent these permissions: with symbols (alphanumeric characters), or with octal numbers (the digits 0 through 7).

We will go over the octal numbers option with the following example:

chmod 574 awesomefile

Here the digits 57, and 4 each individually represent the permissions for the user, group, and others, in that order. Each digit is a combination of the numbers 421, and 0:

  • 4 stands for "read",
  • 2 stands for "write",
  • 1 stands for "execute", and
  • 0 stands for "no permission."

So 5 is the combination of permissions 4+0+1 (read, no write, and execute), 7 is 4+2+1 (read, write, and execute), and 4 is 4+0+0 (read, no write, and no execute).


4. head / tail

The next logical step you will need to do is to partially examine files before you start your analysis. We will go over two commands head and tail. Head displays the beginning of the file and can be useful for finding out the formatting of your files, column names in a csv or for debugging if your data isn't loading properly. Tail displays the file's ending and is great for examining log files or to check if the files you are writing out during your analysis are being written correctly (eg overwrite vs. append).

Unless stated otherwise, head and tail can be used interchangeably in the following examples.

To display first 10 lines of a file.

head /path/to/file

To print first n lines of a file.

head -n <number_of_lines> /path/to/file

To keep the file open and "follow" the lines being added after opening it (works only for tail).

tail -f /path/to/file

To open first c bytes of the file.

head -c <number of bytes> /path/to/file

5. grep

The grep command is used to search text or searches the given file for lines containing a match to the given strings or words. By default, grep displays the matching lines. Use grep to search for lines of text that match one or many regular expressions, and outputs only the matching lines. 

Grep uses the following syntax:

grep 'word' filename
grep 'word' file1 file2 file3
grep 'string1 string2'filename

 You can also search through all files in the directory (or files of particular filetype).

grep 'word' *
grep 'word' *.py

In case you want to search directories recursively you can use -r or -R option.

grep -r "192.168.1.5" /etc/
grep -R "192.168.1.5" /etc/

Conclusion

Now you are ready to start browsing around your Unix file system. If you have any questions please leave a comment. If you want to learn more about how to be a Data Scientist in the real world, check out our bootcamp.